Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Matt,

For now I've gone with the following documentation in the XSD:

"The locally unique id for the spectrum in the spectra data set specified by
SpectraData_ref. External guidelines are provided on the use of consistent
identifiers for spectra in different external formats."

In the draft of the spec document, I'll add a section in  "Resolved Design and
scope issues" on the discussion of unique identifiers, including documentation
about using nativeID for mzML and systems for identifying spectra in other
formats. Do you know if there is any documentation about the system designed for
mzML?

I don't want to "hard-code" this guideline in the XSD, since I would like to get
a bit more input from mzML developers during the doc process. I looked at the
mzML XSD docs again, and the ID attribute does say: "A unique identifier for
this spectrum. It should be expected that external files may use this identifier
together with the mzML filename or accession to reference a particular
spectrum."

If we hard-code a conflicting guideline in the axml XSD, people will get
confused so I would rather document this in more detail in the spec doc, where
we can actually discuss the reasoning for this.

Cheers
Andy

> -----Original Message-----
> From: Matthew Chambers [mailto:mat...@va...]
> Sent: 01 December 2008 19:44
> To: psi...@li...
> Subject: Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> 
> Pfft, I'm such a lazy implementor I wait till the last minute to put my
> output through a schema validator, much less a semantic validator. ;)
> The truth is that spectrum::id was formulated before nativeID and AFAIK
> we kept it so if somebody wants an arbitrary xsd:ID for whatever reason
> they can have it.
> 
> This is a small documentation change indeed, but it's a pretty big
> semantic change (how writers/readers work and how much information they
> expect to have access to). How's this for spectrumID:
> "Uniquely identifies a spectrum in the input spectra using the nativeID
> format defined for that file type. For mzML input, the format is
> inherited from the input and spectrumID will match the nativeID of a
> spectrum in the mzML."
> 
> Thanks,
> -Matt
> 
> 
> Jones, Andy wrote:
> >
> > Okay, maybe I can be persuaded by this argument, since as you say the
> > actual identifier is created by the combination of id + file ref so it
> > is clear that we're referencing the mzML version of the spectrum.
> >
> > I'm still slightly wary that not all implementers will read the
> > documentation and put out files using the semantic validator though.
> > If the nativeID is genuinely always going to be unique within file
> > then fine, but why have the ID attribute at all in mzML...? It sounds
> > like it was created so that there is a guaranteed unique identifier
> > for each spectrum, that can be XSD validated.
> >
> > The bottom line is though that this is a tiny documentation change
> > either way, with pretty small pros/cons! Perhaps we've spent long
> > enough on it :-) I can make this change to the documentation (if mzML
> > input, give nativeID) then we can re-visit the whole identification
> > issue during the doc process. Seem reasonable?
> >
> > cheers
> > Andy
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Matthew Chambers [mailto:mat...@va...]
> > Sent: Mon 01/12/2008 17:39
> > To: psi...@li...
> > Subject: Re: [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> >
> > As you say, the file URL in the analysisXML header must be combined with
> > the spectrum identifier to link back to the input spectrum. The nativeID
> > works no matter what the format of the input file was. The mzML id only
> > works for mzML input. Consider the following chain of input file
> > processing:
> > file:///raw/source1.raw
> > file:///processed/source1.mzXML
> > file:///searched/source1.analysisXML
> >
> > The analysisXML header would point to file:///processed/source1.mzXML as
> > the input file and would use mzXML's nativeID, which is:
> > "scan=xsd:nonNegativeInteger"
> >
> > Now substitute mzML instead of mzXML. The nativeID can now use the true
> > native format for thermo RAW: "controller=xsd:nonNegativeInteger
> > scan=xsd:positiveInteger"
> >
> > There is no difference in file tracking, it's only the fidelity of the
> > nativeID that has improved. The nativeID works transparently in mzML: if
> > the analysisXML header points to an mzML and uses the nativeID, that's
> > just as effective at finding a spectrum as the arbitrary id. It just so
> > happens that the preserved nativeID format allows machines to easily
> > look up the spectrum in the raw data as well as the processed data
> > despite the fact that - as you say - they're different "objects." As
> > long as your database maintains both a processing prefix/URL as well as
> > the nativeID, it'll be golden.
> >
> >
> > >  If we re-use the native identifier this implies the input to the
> > >  search engine was the mgf file, which was not the case...
> > As above, you must maintain the an identifier to the file that the
> > spectrum identifier pertains to - which explicitly says what the input
> > to the search was. What implication is possible given that information?
> >
> > -Matt
> >
> >
> > Jones, Andy wrote:
> > >  Hi Matt,
> > >
> > >  It's a question of identifying objects as they are traced throughout
> > >  a process. In AnalysisXML we are not representing the spectrum
> > >  object, we are making an explicit reference to the spectrum as it is
> > >  represented in mzML. The nativeID is preserved, since it is
> > >  referenced as the input to the process that converts mgf to mzML.
> > >
> > > > HAVE an id. Consider the pipeline: mgf --> SearchEngine -->
> > > > analysisXML What do you use for spectrumID? :)
> > >
> > >  In this case, we use the identifier as specified by the "native ID"
> > >  system you've defined because the input to the search was the mgf
> > >  spectrum object. I agree that this system looks good and we will use
> > >  it for each of the vendor-specific formats - in effect I want to add
> > >  one more mapping for mzML, mapping to the mzML ID ;-) Remember, we're
> > >  not trying to represent the spectrum in analysisXML, all we are
> > >  saying is what spectrum did the search engine take as input.
> > >
> > >  I view the conversion of an mgf spectrum to an mzML spectrum as a
> > >  process that has changed the spectrum object. As such, the nativeID
> > >  in mzML references the input to the conversion process and the mzML
> > >  ID attribute references the (output) spectrum as it is in the file.
> > >  Correct use of the identifiers maintains this trace.
> > >
> > > > Your database use case cannot use mzML ids because xsd:IDs are
> > > > unique within a file, not across files.
> > >
> > >  This is true but is solved easily by prefixing all identifiers with a
> > >  unique string (e.g. the file URL). The problem is worse for nativeID
> > >  because this cannot be done - the mgf version of the spectrum and the
> > >  mzML version of the spectrum are fundamentally different (possibly
> > >  even have different precisions) so they need different identifiers.
> > >  If we re-use the native identifier this implies the input to the
> > >  search engine was the mgf file, which was not the case...
> > >
> > >  Cheers Andy
> > >
> > >
> > >
> > > > -----Original Message----- From: Matthew Chambers
> > > > [mailto:mat...@va...] Sent: 01 December 2008
> > > > 16:03 To: psi...@li... Subject: Re:
> > > > [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> > > >
> > > > The nativeID is intended to refer to the closest-to-native format
> > > > that can be interpreted by the machine. In your pipeline, the mgf
> > > > is the closest-to-native format, so yes that nativeID would and
> > > > should be preserved throughout the pipeline.  Your database use
> > > > case cannot use mzML ids because xsd:IDs are unique within a file,
> > > > not across files. You do not have any kind of guarantee that your
> > > > ids will be distinct between two mzML files, not to mention the
> > > > fact that non-mzML files don't even HAVE an id. Consider the
> > > > pipeline: mgf --> SearchEngine --> analysisXML What do you use for
> > > > spectrumID? :)
> > > >
> > > > -Matt
> > > >
> > > >
> > > > Jones, Andy wrote:
> > > >> Hi Matt,
> > > >>
> > > >> Consider the following pipeline mgf --> mzML --> SearchEngine -->
> > > >>  analysisXML
> > > >>
> > > >> Having thought about this some more, I'm fairly sure that we want
> > > >> to reference the ID attribute rather than nativeID. The nativeID
> > > >> is intended to identify the source spectrum prior to conversion
> > > >> to mzML format i.e. it does not strictly identify the data
> > > >> represented in the file. The input to analysisXML is the
> > > >> mzML-formatted spectrum, not the source mgf file. If we reference
> > > >> the nativeID, this implies that the input to the SearchEngine was
> > > >> the mgf representation of the spectrum. It's a minor point that
> > > >> makes no difference for most XML implementations but the mgf
> > > >> formatted spectrum and the mzML formatted spectrum are different
> > > >> objects. If a database implements this, it will be much simpler
> > > >> to have a chain of inputs and outputs with distinct IDs,
> > > >> reflecting the processing that has happened at each stage. >From a
> > > >> database/LIMS or file tracking point of view, this could be
> > > >> significant I think.
> > > >>
> > > >> Cheers Andy
> > > >>> -----Original Message----- From: Matt Chambers
> > > >>> [mailto:mat...@va...] Sent: 01 December 2008
> > > >>>  14:23 To: psi...@li... Subject: Re:
> > > >>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> > > >>>
> > > >>> Jones, Andy wrote:
> > > >>>> Hi all,
> > > >>>>
> > > >>>> The issues list is getting a bit messy with essentially a
> > > >>>> mailing list discussion so I'll shift the discussion back
> > > >>>> here :-)
> > > >>>>
> > > >>>> There are two points up for discussion.
> > > >>>>
> > > >>>> 1) Use of identifiers for input spectra 2) CV terms shared
> > > >>>> between psi-ms and psi-pi
> > > >>>>
> > > >>>> In terms of 1) I've worked through Matt's argument and I'm in
> > > >>>>  general agreement that we would like to use the same system
> > > >>>> for identifying the input spectrum - these CV terms have only
> > > >>>> been added relatively recently. I did not realise that the
> > > >>>> nativeID attribute had been specified to this level, since
> > > >>>> there is no documentation about this is in the XSD or mzML
> > > >>>> specification document.
> > > >>>>
> > > >>>> I don't think we should change the name of the attribute
> > > >>>> though, since nativeID makes sense for an element called
> > > >>>> <Spectrum> in mzML but not for an element
> > > >>>> <SpectrumIdentificationResult> in analysisXML. For
> > > >>>> referencing mzML spectra, I'm still not sure which attribute
> > > >>>> we should choose to reference since the "true" (and
> > > >>>> guaranteed unique) spectrum identifier in mzML is actually
> > > >>>> the ID attribute. I can envisage a case where instruments
> > > >>>> output mzML directly and the nativeID is not implemented
> > > >>>> sensibly. The xs:ID datatype on "ID" guarantees that these
> > > >>>> will always be unique whatever changes happen to
> > > >>>> documentation in the future or whatever tools are used to
> > > >>>> create the file.
> > > >>> I contest the term "guaranteed unique" since the one doing the
> > > >>> guaranteeing is the schema and there is no guarantee that
> > > >>> somebody runs their output through a schema validator. :) If
> > > >>> you take the validation step to the semantic validator (which
> > > >>> is what the standard demands), the nativeID term is also
> > > >>> guaranteed to be unique (and must be "implemented sensibly"),
> > > >>> and as David suggested earlier, it should be possible to add a
> > > >>> uniqueness constraint to the nativeID attribute in the schema
> > > >>> even though it is xsd:string (but uniqueness is not so helpful
> > > >>> when the actual form of a Thermo RAW id must be:
> > > >>> "controller=xsd:positiveInteger scan=xsd:nonZeroInteger"). The
> > > >>> name of the attribute doesn't bother me, but I don't understand
> > > >>> your reasoning for not changing it. :)
> > > >>>
> > > >>>
> > > >>>> So I agree with Matt but I don't want to change the schema
> > > >>>> :-) I'm happy to add something to the documentation
> > > >>>> specifying how different identifiers should be implemented,
> > > >>>> following the rules in the psi-ms CV.
> > > >>> If the attribute name doesn't change, only the xsd
> > > >>> documentation needs to be updated to reflect which attribute
> > > >>> the spectrumID points to and that it can be used even if the
> > > >>> input spectra file is not mzML!
> > > >>>
> > > >>>
> > > >>>> In terms of 2), we had made a decision in the past that we
> > > >>>> would simply create terms as we need them in PSI-PI, rather
> > > >>>> than worrying if they should be common between psi-ms and
> > > >>>> psi-pi and trying to coordinate updates across groups. If a
> > > >>>> term is present in psi-ms with the exact meaning that we want
> > > >>>> (taking into account its position in the hierarchy), I think
> > > >>>> we should just use it and update the mapping file to
> > > >>>> reference it. Are there many terms from psi-ms that we want
> > > >>>> to use?
> > > >>> It's looking like scan time (aka retention time) will be useful
> > > >>> in analysisXML as an "alternative identifier" for the special
> > > >>> use case of converting existing search results to analysisXML
> > > >>> where a reliable nativeID to the original vendor format has
> > > >>> been lost. Presumably, even in this use case a nativeID could
> > > >>> be provided to point back to a spectrum in the search engine's
> > > >>> immediate spectra input file (i.e. MGF).  If not even that is
> > > >>> possible, either spectrumID has to be optional or the use case
> > > >>> is rather suspect. :)
> > > >>>
> > > >>>
> > > >>> Additionally, if your "spectrumID" attribute matches the
> > > >>> "nativeID" attribute in mzML, the mapping file must require one
> > > >>> of the nativeID format terms in the file header: the specific
> > > >>> place is TBD in analysisXML, in mzML it's mapped to the
> > > >>> fileDescription element. Remember, nativeID is always available
> > > >>> from any input spectra file, so there's no problem requiring it
> > > >>> as long as decent references to the input spectra are
> > > >>> maintained.
> > > >>>
> > > >>> The scan time as an "alternative identifier" issue makes me
> > > >>> wonder if a "scan time native spectrum identifier" term is
> > > >>> called for. It still wouldn't solve all of the problems with
> > > >>> David's use case (i.e. if the MGF was missing RTINSECONDS
> > > >>> attributes), but it seems potentially useful.
> > > >>>
> > > >>> -Matt
> > > >>>
> > > >>>
> > > >>>> I am working on the spec document today and would like to get
> > > >>>> all issues tidied up ASAP... Cheers Andy
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>> -----Original Message----- From:
> > > >>>>> cod...@go...
> > > >>>>> [mailto:cod...@go...] Sent: 30 November 2008
> > > >>>>>  19:36 To: psi...@li... Subject:
> > > >>>>> [Psidev-pi-dev] Issue 42 in psi-pi: Issues with the CV
> > > >>>>>
> > > >>>>>
> > > >>>>> Comment #56 on issue 42 by matthew....@vanderbilt.edu:
> > > >>>>> Issues with the CV
> > > >>>>> http://code.google.com/p/psi-pi/issues/detail?id=42
> > > >>>>>
> > > >>>>>
> > > >>>>> Yes, I was at that meeting too. :) The one (important, IMO)
> > > >>>>> use case we did not consider at that time is output of
> > > >>>>> analysisXML without a corresponding mzML document. In such
> > > >>>>> a case, the mzML arbitrary id does not exist, but the
> > > >>>>> nativeID does. This fact convinces me that nativeID is a
> > > >>>>> better reference than the arbitrary id.
> > > >>>>>
> > > >>>>> The change of attribute name to nativeID is not so
> > > >>>>> critical, but I think the risk of confusing the spectrumID
> > > >>>>> with the id attribute when it actually points to the
> > > >>>>> nativeID attribute is worse than the risk of confusing the
> > > >>>>> nativeID attribute with some property of the search engine.
> > > >>>>> I think the documentation for the nativeID attribute can
> > > >>>>> easily make it clear what it's supposed to reference,
> > > >>>>> especially since it's on a spectrum-centric element; you
> > > >>>>> can copy it from the mzML schema (although I think this
> > > >>>>> documentation could be improved upon):
> > > >>>>> <xs:documentation>The native identifier for the spectrum,
> > > >>>>> used by the acquisition software.</xs:documentation>
> > > >>>>>
> > > >>>>> It's good to know about the header information. The
> > > >>>>> nativeID (or whatever it's called in analysisXML) format
> > > >>>>> term would go in the spectra input definition as a CV Param
> > > >>>>> required by the mapping file.
> > > >
> > > >
> > -------------------------------------------------------------------------
> > > >  This SF.Net email is sponsored by the Moblin Your Move Developer's
> > > > challenge Build the coolest Linux based applications with Moblin
> > > > SDK & win great prizes Grand prize is a trip for two to an Open
> > > > Source event anywhere in the world
> > > > http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > <http://moblin-contest.org/redirect.php?banner_id=100&url=/>
> > > > _______________________________________________ Psidev-pi-dev
> > > > mailing list Psi...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev
> > >
> > >
> > -------------------------------------------------------------------------
> > >  This SF.Net email is sponsored by the Moblin Your Move Developer's
> > >  challenge Build the coolest Linux based applications with Moblin SDK
> > >  & win great prizes Grand prize is a trip for two to an Open Source
> > >  event anywhere in the world
> > >  http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > <http://moblin-contest.org/redirect.php?banner_id=100&url=/>
> > >  _______________________________________________ Psidev-pi-dev
> mailing
> > >  list Psi...@li...
> > >  https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev
> > >
> >
> >
> > -------------------------------------------------------------------------
> > This SF.Net email is sponsored by the Moblin Your Move Developer's
> > challenge
> > Build the coolest Linux based applications with Moblin SDK & win great
> > prizes
> > Grand prize is a trip for two to an Open Source event anywhere in the
> > world
> > http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > <http://moblin-contest.org/redirect.php?banner_id=100&url=/>
> > _______________________________________________
> > Psidev-pi-dev mailing list
> > Psi...@li...
> > https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev
> >
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Psidev-pi-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev