Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Matt,

It's a question of identifying objects as they are traced throughout a process.
In AnalysisXML we are not representing the spectrum object, we are making an
explicit reference to the spectrum as it is represented in mzML. The nativeID is
preserved, since it is referenced as the input to the process that converts mgf
to mzML. 

> HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML
> What do you use for spectrumID? :)

In this case, we use the identifier as specified by the "native ID" system
you've defined because the input to the search was the mgf spectrum object. I
agree that this system looks good and we will use it for each of the
vendor-specific formats - in effect I want to add one more mapping for mzML,
mapping to the mzML ID ;-) Remember, we're not trying to represent the spectrum
in analysisXML, all we are saying is what spectrum did the search engine take as
input. 

I view the conversion of an mgf spectrum to an mzML spectrum as a process that
has changed the spectrum object. As such, the nativeID in mzML references the
input to the conversion process and the mzML ID attribute references the
(output) spectrum as it is in the file. Correct use of the identifiers maintains
this trace.

> Your database use case cannot use
> mzML ids because xsd:IDs are unique within a file, not across files.

This is true but is solved easily by prefixing all identifiers with a unique
string (e.g. the file URL). The problem is worse for nativeID because this
cannot be done - the mgf version of the spectrum and the mzML version of the
spectrum are fundamentally different (possibly even have different precisions)
so they need different identifiers. If we re-use the native identifier this
implies the input to the search engine was the mgf file, which was not the
case...

Cheers
Andy

> -----Original Message-----
> From: Matthew Chambers [mailto:mat...@va...]
> Sent: 01 December 2008 16:03
> To: psi...@li...
> Subject: Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> 
> The nativeID is intended to refer to the closest-to-native format that
> can be interpreted by the machine. In your pipeline, the mgf is the
> closest-to-native format, so yes that nativeID would and should be
> preserved throughout the pipeline.  Your database use case cannot use
> mzML ids because xsd:IDs are unique within a file, not across files. You
> do not have any kind of guarantee that your ids will be distinct between
> two mzML files, not to mention the fact that non-mzML files don't even
> HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML
> What do you use for spectrumID? :)
> 
> -Matt
> 
> 
> Jones, Andy wrote:
> >  Hi Matt,
> >
> >  Consider the following pipeline mgf --> mzML --> SearchEngine -->
> >  analysisXML
> >
> >  Having thought about this some more, I'm fairly sure that we want to
> >  reference the ID attribute rather than nativeID. The nativeID is
> >  intended to identify the source spectrum prior to conversion to mzML
> >  format i.e. it does not strictly identify the data represented in the
> >  file. The input to analysisXML is the mzML-formatted spectrum, not
> >  the source mgf file. If we reference the nativeID, this implies that
> >  the input to the SearchEngine was the mgf representation of the
> >  spectrum. It's a minor point that makes no difference for most XML
> >  implementations but the mgf formatted spectrum and the mzML formatted
> >  spectrum are different objects. If a database implements this, it
> >  will be much simpler to have a chain of inputs and outputs with
> >  distinct IDs, reflecting the processing that has happened at each
> >  stage. From a database/LIMS or file tracking point of view, this
> >  could be significant I think.
> >
> >  Cheers Andy
> > > -----Original Message----- From: Matt Chambers
> > > [mailto:mat...@va...] Sent: 01 December 2008
> > > 14:23 To: psi...@li... Subject: Re:
> > > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> > >
> > > Jones, Andy wrote:
> > >> Hi all,
> > >>
> > >> The issues list is getting a bit messy with essentially a mailing
> > >>  list discussion so I'll shift the discussion back here :-)
> > >>
> > >> There are two points up for discussion.
> > >>
> > >> 1) Use of identifiers for input spectra 2) CV terms shared
> > >> between psi-ms and psi-pi
> > >>
> > >> In terms of 1) I've worked through Matt's argument and I'm in
> > >> general agreement that we would like to use the same system for
> > >> identifying the input spectrum - these CV terms have only been
> > >> added relatively recently. I did not realise that the nativeID
> > >> attribute had been specified to this level, since there is no
> > >> documentation about this is in the XSD or mzML specification
> > >> document.
> > >>
> > >> I don't think we should change the name of the attribute though,
> > >> since nativeID makes sense for an element called <Spectrum> in
> > >> mzML but not for an element <SpectrumIdentificationResult> in
> > >> analysisXML. For referencing mzML spectra, I'm still not sure
> > >> which attribute we should choose to reference since the "true"
> > >> (and guaranteed unique) spectrum identifier in mzML is actually
> > >> the ID attribute. I can envisage a case where instruments output
> > >> mzML directly and the nativeID is not implemented sensibly. The
> > >> xs:ID datatype on "ID" guarantees that these will always be
> > >> unique whatever changes happen to documentation in the future or
> > >> whatever tools are used to create the file.
> > > I contest the term "guaranteed unique" since the one doing the
> > > guaranteeing is the schema and there is no guarantee that somebody
> > > runs their output through a schema validator. :) If you take the
> > > validation step to the semantic validator (which is what the
> > > standard demands), the nativeID term is also guaranteed to be
> > > unique (and must be "implemented sensibly"), and as David suggested
> > > earlier, it should be possible to add a uniqueness constraint to
> > > the nativeID attribute in the schema even though it is xsd:string
> > > (but uniqueness is not so helpful when the actual form of a Thermo
> > > RAW id must be: "controller=xsd:positiveInteger
> > > scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother
> > > me, but I don't understand your reasoning for not changing it. :)
> > >
> > >
> > >> So I agree with Matt but I don't want to change the schema :-)
> > >> I'm happy to add something to the documentation specifying how
> > >> different identifiers should be implemented, following the rules
> > >> in the psi-ms CV.
> > > If the attribute name doesn't change, only the xsd documentation
> > > needs to be updated to reflect which attribute the spectrumID
> > > points to and that it can be used even if the input spectra file is
> > > not mzML!
> > >
> > >
> > >> In terms of 2), we had made a decision in the past that we would
> > >> simply create terms as we need them in PSI-PI, rather than
> > >> worrying if they should be common between psi-ms and psi-pi and
> > >> trying to coordinate updates across groups. If a term is present
> > >> in psi-ms with the exact meaning that we want (taking into
> > >> account its position in the hierarchy), I think we should just
> > >> use it and update the mapping file to reference it. Are there
> > >> many terms from psi-ms that we want to use?
> > > It's looking like scan time (aka retention time) will be useful in
> > > analysisXML as an "alternative identifier" for the special use case
> > > of converting existing search results to analysisXML where a
> > > reliable nativeID to the original vendor format has been lost.
> > > Presumably, even in this use case a nativeID could be provided to
> > > point back to a spectrum in the search engine's immediate spectra
> > > input file (i.e. MGF).  If not even that is possible, either
> > > spectrumID has to be optional or the use case is rather suspect. :)
> > >
> > >
> > > Additionally, if your "spectrumID" attribute matches the "nativeID"
> > >  attribute in mzML, the mapping file must require one of the
> > > nativeID format terms in the file header: the specific place is TBD
> > > in analysisXML, in mzML it's mapped to the fileDescription element.
> > >  Remember, nativeID is always available from any input spectra
> > > file, so there's no problem requiring it as long as decent
> > > references to the input spectra are maintained.
> > >
> > > The scan time as an "alternative identifier" issue makes me wonder
> > > if a "scan time native spectrum identifier" term is called for. It
> > > still wouldn't solve all of the problems with David's use case
> > > (i.e. if the MGF was missing RTINSECONDS attributes), but it seems
> > > potentially useful.
> > >
> > > -Matt
> > >
> > >
> > >> I am working on the spec document today and would like to get all
> > >>  issues tidied up ASAP... Cheers Andy
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>> -----Original Message----- From: cod...@go...
> > >>> [mailto:cod...@go...] Sent: 30 November 2008
> > >>> 19:36 To: psi...@li... Subject:
> > >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> > >>>
> > >>>
> > >>> Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues
> > >>> with the CV http://code.google.com/p/psi-pi/issues/detail?id=42
> > >>>
> > >>>
> > >>> Yes, I was at that meeting too. :) The one (important, IMO) use
> > >>>  case we did not consider at that time is output of analysisXML
> > >>>  without a corresponding mzML document. In such a case, the
> > >>> mzML arbitrary id does not exist, but the nativeID does. This
> > >>> fact convinces me that nativeID is a better reference than the
> > >>> arbitrary id.
> > >>>
> > >>> The change of attribute name to nativeID is not so critical,
> > >>> but I think the risk of confusing the spectrumID with the id
> > >>> attribute when it actually points to the nativeID attribute is
> > >>> worse than the risk of confusing the nativeID attribute with
> > >>> some property of the search engine. I think the documentation
> > >>> for the nativeID attribute can easily make it clear what it's
> > >>> supposed to reference, especially since it's on a
> > >>> spectrum-centric element; you can copy it from the mzML schema
> > >>> (although I think this documentation could be improved upon):
> > >>> <xs:documentation>The native identifier for the spectrum, used
> > >>> by the acquisition software.</xs:documentation>
> > >>>
> > >>> It's good to know about the header information. The nativeID
> > >>> (or whatever it's called in analysisXML) format term would go
> > >>> in the spectra input definition as a CV Param required by the
> > >>> mapping file.
> 
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Psidev-pi-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev