Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The nativeID is intended to refer to the closest-to-native format that 
can be interpreted by the machine. In your pipeline, the mgf is the 
closest-to-native format, so yes that nativeID would and should be 
preserved throughout the pipeline.  Your database use case cannot use 
mzML ids because xsd:IDs are unique within a file, not across files. You 
do not have any kind of guarantee that your ids will be distinct between 
two mzML files, not to mention the fact that non-mzML files don't even 
HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML
What do you use for spectrumID? :)

-Matt

Jones, Andy wrote:
>  Hi Matt,
>
>  Consider the following pipeline mgf --> mzML --> SearchEngine -->
>  analysisXML
>
>  Having thought about this some more, I'm fairly sure that we want to
>  reference the ID attribute rather than nativeID. The nativeID is
>  intended to identify the source spectrum prior to conversion to mzML
>  format i.e. it does not strictly identify the data represented in the
>  file. The input to analysisXML is the mzML-formatted spectrum, not
>  the source mgf file. If we reference the nativeID, this implies that
>  the input to the SearchEngine was the mgf representation of the
>  spectrum. It's a minor point that makes no difference for most XML
>  implementations but the mgf formatted spectrum and the mzML formatted
>  spectrum are different objects. If a database implements this, it
>  will be much simpler to have a chain of inputs and outputs with
>  distinct IDs, reflecting the processing that has happened at each
>  stage. From a database/LIMS or file tracking point of view, this
>  could be significant I think.
>
>  Cheers Andy
> > -----Original Message----- From: Matt Chambers
> > [mailto:mat...@va...] Sent: 01 December 2008
> > 14:23 To: psi...@li... Subject: Re:
> > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> >
> > Jones, Andy wrote:
> >> Hi all,
> >>
> >> The issues list is getting a bit messy with essentially a mailing
> >>  list discussion so I'll shift the discussion back here :-)
> >>
> >> There are two points up for discussion.
> >>
> >> 1) Use of identifiers for input spectra 2) CV terms shared
> >> between psi-ms and psi-pi
> >>
> >> In terms of 1) I've worked through Matt's argument and I'm in
> >> general agreement that we would like to use the same system for
> >> identifying the input spectrum - these CV terms have only been
> >> added relatively recently. I did not realise that the nativeID
> >> attribute had been specified to this level, since there is no
> >> documentation about this is in the XSD or mzML specification
> >> document.
> >>
> >> I don't think we should change the name of the attribute though,
> >> since nativeID makes sense for an element called <Spectrum> in
> >> mzML but not for an element <SpectrumIdentificationResult> in
> >> analysisXML. For referencing mzML spectra, I'm still not sure
> >> which attribute we should choose to reference since the "true"
> >> (and guaranteed unique) spectrum identifier in mzML is actually
> >> the ID attribute. I can envisage a case where instruments output
> >> mzML directly and the nativeID is not implemented sensibly. The
> >> xs:ID datatype on "ID" guarantees that these will always be
> >> unique whatever changes happen to documentation in the future or
> >> whatever tools are used to create the file.
> > I contest the term "guaranteed unique" since the one doing the
> > guaranteeing is the schema and there is no guarantee that somebody
> > runs their output through a schema validator. :) If you take the
> > validation step to the semantic validator (which is what the
> > standard demands), the nativeID term is also guaranteed to be
> > unique (and must be "implemented sensibly"), and as David suggested
> > earlier, it should be possible to add a uniqueness constraint to
> > the nativeID attribute in the schema even though it is xsd:string
> > (but uniqueness is not so helpful when the actual form of a Thermo
> > RAW id must be: "controller=xsd:positiveInteger
> > scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother
> > me, but I don't understand your reasoning for not changing it. :)
> >
> >
> >> So I agree with Matt but I don't want to change the schema :-)
> >> I'm happy to add something to the documentation specifying how
> >> different identifiers should be implemented, following the rules
> >> in the psi-ms CV.
> > If the attribute name doesn't change, only the xsd documentation
> > needs to be updated to reflect which attribute the spectrumID
> > points to and that it can be used even if the input spectra file is
> > not mzML!
> >
> >
> >> In terms of 2), we had made a decision in the past that we would
> >> simply create terms as we need them in PSI-PI, rather than
> >> worrying if they should be common between psi-ms and psi-pi and
> >> trying to coordinate updates across groups. If a term is present
> >> in psi-ms with the exact meaning that we want (taking into
> >> account its position in the hierarchy), I think we should just
> >> use it and update the mapping file to reference it. Are there
> >> many terms from psi-ms that we want to use?
> > It's looking like scan time (aka retention time) will be useful in
> > analysisXML as an "alternative identifier" for the special use case
> > of converting existing search results to analysisXML where a
> > reliable nativeID to the original vendor format has been lost.
> > Presumably, even in this use case a nativeID could be provided to
> > point back to a spectrum in the search engine's immediate spectra
> > input file (i.e. MGF).  If not even that is possible, either
> > spectrumID has to be optional or the use case is rather suspect. :)
> >
> >
> > Additionally, if your "spectrumID" attribute matches the "nativeID"
> >  attribute in mzML, the mapping file must require one of the
> > nativeID format terms in the file header: the specific place is TBD
> > in analysisXML, in mzML it's mapped to the fileDescription element.
> >  Remember, nativeID is always available from any input spectra
> > file, so there's no problem requiring it as long as decent
> > references to the input spectra are maintained.
> >
> > The scan time as an "alternative identifier" issue makes me wonder
> > if a "scan time native spectrum identifier" term is called for. It
> > still wouldn't solve all of the problems with David's use case
> > (i.e. if the MGF was missing RTINSECONDS attributes), but it seems
> > potentially useful.
> >
> > -Matt
> >
> >
> >> I am working on the spec document today and would like to get all
> >>  issues tidied up ASAP... Cheers Andy
> >>
> >>
> >>
> >>
> >>
> >>
> >>> -----Original Message----- From: cod...@go...
> >>> [mailto:cod...@go...] Sent: 30 November 2008
> >>> 19:36 To: psi...@li... Subject:
> >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> >>>
> >>>
> >>> Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues
> >>> with the CV http://code.google.com/p/psi-pi/issues/detail?id=42
> >>>
> >>>
> >>> Yes, I was at that meeting too. :) The one (important, IMO) use
> >>>  case we did not consider at that time is output of analysisXML
> >>>  without a corresponding mzML document. In such a case, the
> >>> mzML arbitrary id does not exist, but the nativeID does. This
> >>> fact convinces me that nativeID is a better reference than the
> >>> arbitrary id.
> >>>
> >>> The change of attribute name to nativeID is not so critical,
> >>> but I think the risk of confusing the spectrumID with the id
> >>> attribute when it actually points to the nativeID attribute is
> >>> worse than the risk of confusing the nativeID attribute with
> >>> some property of the search engine. I think the documentation
> >>> for the nativeID attribute can easily make it clear what it's
> >>> supposed to reference, especially since it's on a
> >>> spectrum-centric element; you can copy it from the mzML schema
> >>> (although I think this documentation could be improved upon):
> >>> <xs:documentation>The native identifier for the spectrum, used
> >>> by the acquisition software.</xs:documentation>
> >>>
> >>> It's good to know about the header information. The nativeID
> >>> (or whatever it's called in analysisXML) format term would go
> >>> in the spectra input definition as a CV Param required by the
> >>> mapping file.