From: Matthew C. <mat...@va...> - 2008-12-01 16:04:02
|
The nativeID is intended to refer to the closest-to-native format that can be interpreted by the machine. In your pipeline, the mgf is the closest-to-native format, so yes that nativeID would and should be preserved throughout the pipeline. Your database use case cannot use mzML ids because xsd:IDs are unique within a file, not across files. You do not have any kind of guarantee that your ids will be distinct between two mzML files, not to mention the fact that non-mzML files don't even HAVE an id. Consider the pipeline: mgf --> SearchEngine --> analysisXML What do you use for spectrumID? :) -Matt Jones, Andy wrote: > Hi Matt, > > Consider the following pipeline mgf --> mzML --> SearchEngine --> > analysisXML > > Having thought about this some more, I'm fairly sure that we want to > reference the ID attribute rather than nativeID. The nativeID is > intended to identify the source spectrum prior to conversion to mzML > format i.e. it does not strictly identify the data represented in the > file. The input to analysisXML is the mzML-formatted spectrum, not > the source mgf file. If we reference the nativeID, this implies that > the input to the SearchEngine was the mgf representation of the > spectrum. It's a minor point that makes no difference for most XML > implementations but the mgf formatted spectrum and the mzML formatted > spectrum are different objects. If a database implements this, it > will be much simpler to have a chain of inputs and outputs with > distinct IDs, reflecting the processing that has happened at each > stage. From a database/LIMS or file tracking point of view, this > could be significant I think. > > Cheers Andy > > -----Original Message----- From: Matt Chambers > > [mailto:mat...@va...] Sent: 01 December 2008 > > 14:23 To: psi...@li... Subject: Re: > > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > > > > Jones, Andy wrote: > >> Hi all, > >> > >> The issues list is getting a bit messy with essentially a mailing > >> list discussion so I'll shift the discussion back here :-) > >> > >> There are two points up for discussion. > >> > >> 1) Use of identifiers for input spectra 2) CV terms shared > >> between psi-ms and psi-pi > >> > >> In terms of 1) I've worked through Matt's argument and I'm in > >> general agreement that we would like to use the same system for > >> identifying the input spectrum - these CV terms have only been > >> added relatively recently. I did not realise that the nativeID > >> attribute had been specified to this level, since there is no > >> documentation about this is in the XSD or mzML specification > >> document. > >> > >> I don't think we should change the name of the attribute though, > >> since nativeID makes sense for an element called <Spectrum> in > >> mzML but not for an element <SpectrumIdentificationResult> in > >> analysisXML. For referencing mzML spectra, I'm still not sure > >> which attribute we should choose to reference since the "true" > >> (and guaranteed unique) spectrum identifier in mzML is actually > >> the ID attribute. I can envisage a case where instruments output > >> mzML directly and the nativeID is not implemented sensibly. The > >> xs:ID datatype on "ID" guarantees that these will always be > >> unique whatever changes happen to documentation in the future or > >> whatever tools are used to create the file. > > I contest the term "guaranteed unique" since the one doing the > > guaranteeing is the schema and there is no guarantee that somebody > > runs their output through a schema validator. :) If you take the > > validation step to the semantic validator (which is what the > > standard demands), the nativeID term is also guaranteed to be > > unique (and must be "implemented sensibly"), and as David suggested > > earlier, it should be possible to add a uniqueness constraint to > > the nativeID attribute in the schema even though it is xsd:string > > (but uniqueness is not so helpful when the actual form of a Thermo > > RAW id must be: "controller=xsd:positiveInteger > > scan=xsd:nonZeroInteger"). The name of the attribute doesn't bother > > me, but I don't understand your reasoning for not changing it. :) > > > > > >> So I agree with Matt but I don't want to change the schema :-) > >> I'm happy to add something to the documentation specifying how > >> different identifiers should be implemented, following the rules > >> in the psi-ms CV. > > If the attribute name doesn't change, only the xsd documentation > > needs to be updated to reflect which attribute the spectrumID > > points to and that it can be used even if the input spectra file is > > not mzML! > > > > > >> In terms of 2), we had made a decision in the past that we would > >> simply create terms as we need them in PSI-PI, rather than > >> worrying if they should be common between psi-ms and psi-pi and > >> trying to coordinate updates across groups. If a term is present > >> in psi-ms with the exact meaning that we want (taking into > >> account its position in the hierarchy), I think we should just > >> use it and update the mapping file to reference it. Are there > >> many terms from psi-ms that we want to use? > > It's looking like scan time (aka retention time) will be useful in > > analysisXML as an "alternative identifier" for the special use case > > of converting existing search results to analysisXML where a > > reliable nativeID to the original vendor format has been lost. > > Presumably, even in this use case a nativeID could be provided to > > point back to a spectrum in the search engine's immediate spectra > > input file (i.e. MGF). If not even that is possible, either > > spectrumID has to be optional or the use case is rather suspect. :) > > > > > > Additionally, if your "spectrumID" attribute matches the "nativeID" > > attribute in mzML, the mapping file must require one of the > > nativeID format terms in the file header: the specific place is TBD > > in analysisXML, in mzML it's mapped to the fileDescription element. > > Remember, nativeID is always available from any input spectra > > file, so there's no problem requiring it as long as decent > > references to the input spectra are maintained. > > > > The scan time as an "alternative identifier" issue makes me wonder > > if a "scan time native spectrum identifier" term is called for. It > > still wouldn't solve all of the problems with David's use case > > (i.e. if the MGF was missing RTINSECONDS attributes), but it seems > > potentially useful. > > > > -Matt > > > > > >> I am working on the spec document today and would like to get all > >> issues tidied up ASAP... Cheers Andy > >> > >> > >> > >> > >> > >> > >>> -----Original Message----- From: cod...@go... > >>> [mailto:cod...@go...] Sent: 30 November 2008 > >>> 19:36 To: psi...@li... Subject: > >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV > >>> > >>> > >>> Comment #56 on issue 42 by matthew....@vanderbilt.edu: Issues > >>> with the CV http://code.google.com/p/psi-pi/issues/detail?id=42 > >>> > >>> > >>> Yes, I was at that meeting too. :) The one (important, IMO) use > >>> case we did not consider at that time is output of analysisXML > >>> without a corresponding mzML document. In such a case, the > >>> mzML arbitrary id does not exist, but the nativeID does. This > >>> fact convinces me that nativeID is a better reference than the > >>> arbitrary id. > >>> > >>> The change of attribute name to nativeID is not so critical, > >>> but I think the risk of confusing the spectrumID with the id > >>> attribute when it actually points to the nativeID attribute is > >>> worse than the risk of confusing the nativeID attribute with > >>> some property of the search engine. I think the documentation > >>> for the nativeID attribute can easily make it clear what it's > >>> supposed to reference, especially since it's on a > >>> spectrum-centric element; you can copy it from the mzML schema > >>> (although I think this documentation could be improved upon): > >>> <xs:documentation>The native identifier for the spectrum, used > >>> by the acquisition software.</xs:documentation> > >>> > >>> It's good to know about the header information. The nativeID > >>> (or whatever it's called in analysisXML) format term would go > >>> in the spectra input definition as a CV Param required by the > >>> mapping file. |