Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Matt,

Consider the following pipeline mgf --> mzML --> SearchEngine --> analysisXML 

Having thought about this some more, I'm fairly sure that we want to reference
the ID attribute rather than nativeID. The nativeID is intended to identify the
source spectrum prior to conversion to mzML format i.e. it does not strictly
identify the data represented in the file. The input to analysisXML is the
mzML-formatted spectrum, not the source mgf file. If we reference the nativeID,
this implies that the input to the SearchEngine was the mgf representation of
the spectrum. It's a minor point that makes no difference for most XML
implementations but the mgf formatted spectrum and the mzML formatted spectrum
are different objects. If a database implements this, it will be much simpler to
have a chain of inputs and outputs with distinct IDs, reflecting the processing
that has happened at each stage. From a database/LIMS or file tracking point of
view, this could be significant I think.

> If the attribute name doesn't change, only the xsd documentation needs
> to be updated to reflect which attribute the spectrumID points to and
> that it can be used even if the input spectra file is not mzML!

Agreed, the documentation of the attribute does need to be improved. I prefer to
have attribute names that reflect their relationship to the parent element, I
think spectrumID is clear in what it refers to for SpectrumIdentificationResult.

> Additionally, if your "spectrumID" attribute matches the "nativeID"
> attribute in mzML, the mapping file must require one of the nativeID
> format terms in the file header: the specific place is TBD in
> analysisXML, in mzML it's mapped to the fileDescription element.
> Remember, nativeID is always available from any input spectra file, so
> there's no problem requiring it as long as decent references to the
> input spectra are maintained.

I'll take a look at the mzML mapping file and see what we need to do.

Cheers
Andy

> -----Original Message-----
> From: Matt Chambers [mailto:mat...@va...]
> Sent: 01 December 2008 14:23
> To: psi...@li...
> Subject: Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> 
> Jones, Andy wrote:
> >  Hi all,
> >
> >  The issues list is getting a bit messy with essentially a mailing
> >  list discussion so I'll shift the discussion back here :-)
> >
> >  There are two points up for discussion.
> >
> >  1) Use of identifiers for input spectra 2) CV terms shared between
> >  psi-ms and psi-pi
> >
> >  In terms of 1) I've worked through Matt's argument and I'm in general
> >  agreement that we would like to use the same system for identifying
> >  the input spectrum - these CV terms have only been added relatively
> >  recently. I did not realise that the nativeID attribute had been
> >  specified to this level, since there is no documentation about this
> >  is in the XSD or mzML specification document.
> >
> >  I don't think we should change the name of the attribute though,
> >  since nativeID makes sense for an element called <Spectrum> in mzML
> >  but not for an element <SpectrumIdentificationResult> in analysisXML.
> >  For referencing mzML spectra, I'm still not sure which attribute we
> >  should choose to reference since the "true" (and guaranteed unique)
> >  spectrum identifier in mzML is actually the ID attribute. I can
> >  envisage a case where instruments output mzML directly and the
> >  nativeID is not implemented sensibly. The xs:ID datatype on "ID"
> >  guarantees that these will always be unique whatever changes happen
> >  to documentation in the future or whatever tools are used to create
> >  the file.
> I contest the term "guaranteed unique" since the one doing the
> guaranteeing is the schema and there is no guarantee that somebody runs
> their output through a schema validator. :) If you take the validation
> step to the semantic validator (which is what the standard demands), the
> nativeID term is also guaranteed to be unique (and must be "implemented
> sensibly"), and as David suggested earlier, it should be possible to add
> a uniqueness constraint to the nativeID attribute in the schema even
> though it is xsd:string (but uniqueness is not so helpful when the
> actual form of a Thermo RAW id must be: "controller=xsd:positiveInteger
> scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother me,
> but I don't understand your reasoning for not changing it. :)
> 
> 
> >  So I agree with Matt but I don't want to change the schema :-) I'm
> >  happy to add something to the documentation specifying how different
> >  identifiers should be implemented, following the rules in the psi-ms
> >  CV.
> If the attribute name doesn't change, only the xsd documentation needs
> to be updated to reflect which attribute the spectrumID points to and
> that it can be used even if the input spectra file is not mzML!
> 
> 
> >  In terms of 2), we had made a decision in the past that we would
> >  simply create terms as we need them in PSI-PI, rather than worrying
> >  if they should be common between psi-ms and psi-pi and trying to
> >  coordinate updates across groups. If a term is present in psi-ms with
> >  the exact meaning that we want (taking into account its position in
> >  the hierarchy), I think we should just use it and update the mapping
> >  file to reference it. Are there many terms from psi-ms that we want
> >  to use?
> It's looking like scan time (aka retention time) will be useful in
> analysisXML as an "alternative identifier" for the special use case of
> converting existing search results to analysisXML where a reliable
> nativeID to the original vendor format has been lost. Presumably, even
> in this use case a nativeID could be provided to point back to a
> spectrum in the search engine's immediate spectra input file (i.e.
> MGF).  If not even that is possible, either spectrumID has to be
> optional or the use case is rather suspect. :)
> 
> Additionally, if your "spectrumID" attribute matches the "nativeID"
> attribute in mzML, the mapping file must require one of the nativeID
> format terms in the file header: the specific place is TBD in
> analysisXML, in mzML it's mapped to the fileDescription element.
> Remember, nativeID is always available from any input spectra file, so
> there's no problem requiring it as long as decent references to the
> input spectra are maintained.
> 
> The scan time as an "alternative identifier" issue makes me wonder if a
> "scan time native spectrum identifier" term is called for. It still
> wouldn't solve all of the problems with David's use case (i.e. if the
> MGF was missing RTINSECONDS attributes), but it seems potentially useful.
> 
> -Matt
> 
> 
> >  I am working on the spec document today and would like to get all
> >  issues tidied up ASAP... Cheers Andy
> >
> >
> >
> >
> >
> >
> > > -----Original Message----- From: cod...@go...
> > > [mailto:cod...@go...] Sent: 30 November 2008 19:36
> > > To: psi...@li... Subject: [Psidev-pi-dev]
> > > Issue 42 in psi-pi: Issues with the CV
> > >
> > >
> > > Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues with
> > > the CV http://code.google.com/p/psi-pi/issues/detail?id=42
> > >
> > > Yes, I was at that meeting too. :) The one (important, IMO) use
> > > case we did not consider at that time is output of analysisXML
> > > without a corresponding mzML document. In such a case, the mzML
> > > arbitrary id does not exist, but the nativeID does. This fact
> > > convinces me that nativeID is a better reference than the arbitrary
> > > id.
> > >
> > > The change of attribute name to nativeID is not so critical, but I
> > > think the risk of confusing the spectrumID with the id attribute
> > > when it actually points to the nativeID attribute is worse than the
> > > risk of confusing the nativeID attribute with some property of the
> > > search engine. I think the documentation for the nativeID attribute
> > > can easily make it clear what it's supposed to reference,
> > > especially since it's on a spectrum-centric element; you can copy
> > > it from the mzML schema (although I think this documentation could
> > > be improved upon): <xs:documentation>The native identifier for the
> > > spectrum, used by the acquisition software.</xs:documentation>
> > >
> > > It's good to know about the header information. The nativeID (or
> > > whatever it's called in analysisXML) format term would go in the
> > > spectra input definition as a CV Param required by the mapping
> > > file.
> >
> 
> 
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Psidev-pi-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev