From: Kessner, D. E. <Dar...@cs...> - 2008-01-23 19:32:18
|
Matt, thanks for the clarification on the order guarantee for scan numbers. =20 If scanNumber is unique, I agree that I don't see the point of 'id'. But if we use 'id', and especially if we reference based on 'id', we should have it in the index. =20 As for "length": I'm using a stream-based parser that will read in a single element, so I won't be needing it. I could see it being useful for other parsers though, in particular if you want to memory-map that portion of the file before doing the text processing. =20 =20 Darren =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Wednesday, January 23, 2008 11:18 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML indexing =20 Hi Darren, Since scan numbers are guaranteed to be in ascending order in the spectrumList (established elsewhere in this mailing list), it makes sense to extend that guarantee to the index. Also, "spectrumRef" should refer to the scanNumber, not the "id" - the "id" can be any unique string and I don't see why referencing based on that is desirable. I acknowledge the consistency problem with having an "id" attribute that a "Ref" attribute ignores in favor of a non-"id" attribute, but if "id" is not simply the scan number, then it should be somewhat irrelevant (like the "title" attribute in MGF). So the <offset> can still function unambiguously with only a "scanNumber" attribute. However, one thing I would like to see is not just an offset, but a size of the spectrum element to really make reading via the index easy and as fast as possible (instead of fumbling around with code and cpu cycles to figure out where the indexed spectrum element ends the entire block can be read with one call). Thus, I would like to see: <offset scanNumber=3D"19" byteOffset=3D"3512" length=3D"12705" /> At first glance, you might think that simply reading until the next offset would work, but that might include a bunch of unexpected elements if comments are allowed in the spectrumList, e.g.: <spectrum>...</spectrum><comment>foo</comment><spectrum></spectrum> If such comments aren't allowed in the list and the next element is guaranteed to be the next <spectrum> element, then the length attribute is unnecessary, so I'd like to get that clarified. -Matt Kessner, Darren E. wrote:=20 Hi all, =20 There are three ways to refer to a <spectrum> element -- by zero-based index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the <index> currently only contains scanNumber. I would like to encode the zero-based index and the id as well in the <index> as follows: =20 <index name=3D"spectrum" > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> <offset index=3D"1" scanNumber=3D"20" id=3D"S20">16217</offset> ... </index> =20 Including the zero-based index is important to enable random access to the mzML file when you don't know what scan numbers are contained in the file. The alternative is to require that the <index> entries are written in the same order as the <spectrumList> entries. =20 Including the 'id' in the <index> entries is necessary for efficiently dereferencing a "spectrumRef" (e.g. in <precursor> element). Without this, a dereference requires reading through the <spectrumList> to find the right 'id'. This info could be read once and cached, but this still defeats the purpose of indexing. =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or = entity to which it is addressed and may contain information that is privi= leged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipi= ent, or the employee or agent responsible for delivering it to the intend= ed recipient, you are hereby notified that any dissemination, distributio= n or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for= your cooperation. |