From: Matthew C. <mat...@va...> - 2008-02-25 18:18:32
|
I am thinking that this externalID, or something like it, is a good idea as a replacement for scanNumber (in addition to 0-based index). The type and format of the identifier would change depending on the context, but that type and format would be well-defined by the specification (or possibly CV) for that context. All contexts would share some common traits: certainly the identifier must be unique, and perhaps it should be sortable according to a context-dependent predicate. For the Thermo context, the identifier would be a positive integer and the predicate is trivial. For a WIFF-file context, the identifier could have a well-defined pattern like: "sample.period.cycle.experiment". The predicate to sort this is more complicated, but not difficult (a lexicographical sort won't cut it). For spectra from a 4000 Series source, the identifier could have a well-defined pattern like: "chromatogram_name.ms2_job_run.fraction.ms2_spectrum". The predicate for this would be similar to the previous one. I got the last two examples from the Protein Pilot documentation, I hope that's ok. :) Is it reasonable to assume that many, if not most (or all) vendor formats have some unique (per-run) identifier like these? MALDI spots could probably be handled like this too. We could define these patterns with a cvParam, but then the ID couldn't be used as an attribute in the index, and I would lobby for that as much as Darren. It is a reasonable use case to go looking for a spectrum based on its original source identifier (and expect equal performance to looking for it by 0-based index). -Matt Kessner, Darren E. wrote: > Good question. Here are a couple choices: > > 1) Mandate that in the Thermo case, externalID has to be the scan > number, and the cvParam "Thermo Scientific" has to be put in some > particular location, so that readers may make this assumption. > > 2) Use scanNumber and make it optional. > > My feeling is that #1 may be dangerous, and #2 is ugly. At the moment I > don't really have a strong feeling one way or the other (or maybe its > equally weak feelings...). > > > Darren > > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of > Matthew Chambers > Sent: Friday, February 22, 2008 10:43 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] indexedmzML: scanNumber / acquisition > number > > Going back to your externalID idea, it would need to be optional on a > per-offset basis in case some spectra were sums/averages and others not. > > On top of that, how would this use case (which our software uses as > well, so to some extent I'm playing devil's advocate here) handle a > RAW->mzML converter that doesn't write the externalIDs or writes > something else instead of the pure scan number? > > -Matt > > > Kessner, Darren E. wrote: > >> It's not that we need to know the scan number at file open, it's that >> > we > >> *do* know the scan number. >> >> We have command line tools that will extract various things (scan >> metadata, full scan binary data) by scanNumber. Being able to do this >> is very important, since we use this in quality control scripts, >> debugging other tools, and automated testing of code modules. >> >> The general point is that in any particular case, we don't want it to >> > be > >> significantly less efficient to find and extract information from an >> mzML file than it is from a RAW or mzXML (or any other) file. >> >> >> Darren >> >> >> >> -----Original Message----- >> From: psi...@li... >> [mailto:psi...@li...] On Behalf Of >> Matthew Chambers >> Sent: Friday, February 22, 2008 10:01 AM >> To: Mass spectrometry standard development >> Subject: Re: [Psidev-ms-dev] indexedmzML: scanNumber / acquisition >> number >> >> Why do you need to know the scan number at file open time? You don't >> even know if it's a Thermo spectrum (i.e. that the scan number >> > actually > >> corresponds to some element in the native instrument data). What's >> > wrong > >> with using the 0-based index? Once you've accessed a scan to figure >> > out > >> what kind it is, you can also get the native scan number via the >> acquisition section (assuming you load all the spectrum-level metadata >> > > >> at the same time for a given spectrum, which is easy and fast). >> >> Your externalID is better, but still not really sensible if the >> > spectrum > >> being indexed is a sum/average of multiple native scans. >> >> -Matt >> >> >> Kessner, Darren E. wrote: >> >> >>> Ooh -- I like that even less... >>> >>> Generating a new scanNumber map requires hitting the file at the >>> location of each <spectrum>, which can take a significant amount of >>> >>> >> time >> >> >>> when there are 10k scans in a file. The point of including an index >>> >>> >> is >> >> >>> so that this doesn't have to be done on file open. >>> >>> We have existing useful tools that rely on scanNumber indexing in >>> >>> >> mzXML, >> >> >>> so I don't think this facility should be lost in mzML. >>> >>> >>> Darren >>> >>> >>> >>> -----Original Message----- >>> From: psi...@li... >>> [mailto:psi...@li...] On Behalf Of >>> Matthew Chambers >>> Sent: Friday, February 22, 2008 9:44 AM >>> To: Mass spectrometry standard development >>> Subject: Re: [Psidev-ms-dev] indexedmzML: scanNumber / acquisition >>> number >>> >>> Ugh. I don't think there should be any reference to acquisitions in >>> >>> >> the >> >> >>> index; there is no 1:1 mapping and mapping to the first of the >>> acquisitions is counter-intuitive. The scanNumber attribute in the >>> >>> >> index >> >> >>> offsets should be replaced by the 0-based index (now that there is an >>> > > >>> index attribute on the spectrum) and if you want to refer to a Thermo >>> > > >>> spectrum by its original scan number, that will somehow have to be >>> parsed from the spectrum id attribute or you can generate the >>> index->scan mapping when reading and/or generating the file index. >>> >>> -Matt >>> >>> >>> Kessner, Darren E. wrote: >>> >>> >>> >>>> Hi all, >>>> >>>> >>>> >>>> Please correct me if I'm wrong, but I believe the consensus now is >>>> > to > >>>> >>>> >> >> >>>> encode the Thermo scanNumber as the (first) acquisition number: >>>> >>>> >>>> >>>> <spectrum id="S17" index="0" msLevel="1" arrayLength="1313"> >>>> >>>> ... >>>> <spectrumDescription> >>>> >>>> <acquisitionList count="1"> >>>> >>>> <acquisition number="17" spectrumRef="?" >>>> sourceFileRef="?"/> >>>> >>>> </acquisitionList> >>>> >>>> ... >>>> >>>> </spectrumDescription> >>>> >>>> ... >>>> >>>> </spectrum> >>>> >>>> >>>> >>>> However, when the mzML is indexed, we still have <offset> entries >>>> >>>> >> with >> >> >>>> >>>> >>>> >>> >>> >>> >>>> attribute 'scanNumber': >>>> >>>> >>>> >>>> <index> >>>> >>>> <offset id="S17" scanNumber="17">4826</offset> >>>> >>>> </index> >>>> >>>> >>>> >>>> Shall we make this: >>>> >>>> <offset id="S17" acquisitionNumber="17">4826</offset> >>>> >>>> and assume it refers to the *first* acquisition number in the >>>> <acquisitionList> ? >>>> >>>> >>>> >>>> The use case is the same -- for efficient random access by Thermo >>>> scanNumber we need this in the <index>. >>>> >>>> >>>> >>>> Previously I had also proposed including the 0-based index, which >>>> > was > >>>> >>>> >> >> >>>> deemed unnecessary (and I agree), but someone may want it now for >>>> consistency and/or validation? >>>> >>>> <offset id="S17" index="0" >>>> >>>> >>>> >>> acquisitionNumber="17">4826</offset> >>> >>> >>> >>>> >>>> >>>> >>>> >>>> Darren >>>> >>>> >>>> >>>> >>>> >>>> Darren Kessner >>>> >>>> Scientific Programmer >>>> >>>> Dar...@cs... <mailto:Dar...@cs...> >>>> >>>> 310-423-9538 >>>> >>>> >>>> >>>> Spielberg Family Center for Applied Proteomics >>>> >>>> Cedars-Sinai Medical Center >>>> >>>> http://www.sfcap.cshs.org/ >>>> >>>> >>>> >>>> >>>> >>>> IMPORTANT WARNING: This message is intended for the use of the >>>> > person > >>>> >>>> >> >> >>>> or entity to which it is addressed and may contain information that >>>> >>>> >> is >> >> >>>> >>>> >>>> >>> >>> >>> >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to >>>> > > >>>> the intended recipient, you are hereby notified that any >>>> dissemination, distribution or copying of this information is >>>> >>>> >> STRICTLY >> >> >>>> >>>> >>>> >>> >>> >>> >>>> PROHIBITED. >>>> >>>> If you have received this message in error, please notify us >>>> >>>> >>>> >>> immediately >>> >>> >>> >>>> by calling (310) 423-6428 and destroy the related message. Thank You >>>> > > >>>> for your cooperation. >>>> >>>> >>>> >>>> > ------------------------------------------------------------------------ > >> >> >>> >>> >>> >>>> >>>> >>>> > ------------------------------------------------------------------------ > >> >> >>> - >>> >>> >>> >>>> This SF.net email is sponsored by: Microsoft >>>> Defy all challenges. Microsoft(R) Visual Studio 2008. >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>>> >>>> >>>> >>>> > ------------------------------------------------------------------------ > >> >> >>> >>> >>> >>>> _______________________________________________ >>>> Psidev-ms-dev mailing list >>>> Psi...@li... >>>> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >>>> >>>> >>>> >>>> >>> >>> > ------------------------------------------------------------------------ > >> >> >>> - >>> This SF.net email is sponsored by: Microsoft >>> Defy all challenges. Microsoft(R) Visual Studio 2008. >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>> _______________________________________________ >>> Psidev-ms-dev mailing list >>> Psi...@li... >>> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >>> IMPORTANT WARNING: This message is intended for the use of the person >>> >>> >> or entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> >> >>> applicable law. If the reader of this message is not the intended >>> >>> >> recipient, or the employee or agent responsible for delivering it to >> > the > >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is STRICTLY PROHIBITED. >> >> >>> If you have received this message in error, please notify us >>> >>> >> immediately >> >> >>> by calling (310) 423-6428 and destroy the related message. Thank You >>> >>> >> for your cooperation. >> >> >>> >>> > ------------------------------------------------------------------------ > >> - >> >> >>> This SF.net email is sponsored by: Microsoft >>> Defy all challenges. Microsoft(R) Visual Studio 2008. >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >>> _______________________________________________ >>> Psidev-ms-dev mailing list >>> Psi...@li... >>> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >>> >>> >>> >>> >> > ------------------------------------------------------------------------ > >> - >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2008. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> IMPORTANT WARNING: This message is intended for the use of the person >> > or entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > >> applicable law. If the reader of this message is not the intended >> > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is STRICTLY PROHIBITED. > >> If you have received this message in error, please notify us >> > immediately > >> by calling (310) 423-6428 and destroy the related message. Thank You >> > for your cooperation. > >> > ------------------------------------------------------------------------ > - > >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2008. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> >> >> > > ------------------------------------------------------------------------ > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. > > If you have received this message in error, please notify us immediately > by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > |