Re: [Psidev-ms-dev] index style in mzML (was misnamed subject about compressionType)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Brian,

I'm glad to hear discussion is going on.  We know the index style has to 
change if multi-run support is kept, so it ought to be done properly.

When this was last discussed on this list (right after I subscribed), I 
thought the same as you: that a hierarchical index would work best for a 
multi-run file.  This is what I suggested in the earlier list discussion 
about the index:

<source name="someSourceName">
	<spectrum scan="15">
		...
	</spectrum>
</source>
<index>
	<indexedSource name="someSourceName" offset="0">
		<indexedSpectrum scan="15" offset="33">
			...
		</indexedSpectrum>
	</indexedSource>
</index>

It was exactly the same structure as yours, obviously more verbose than 
necessary.  I have changed my mind about the structure since then 
though. :)  I do not think the index should have a separate XML element 
for each offset.  That actually makes it HARDER to parse.  There is no 
question that a space delimited list is easier and faster to parse than 
fully structured XML.  You may be right about some XML parser libs not 
appreciating several kilobytes of characters for a single attribute.  
That could be solved by the following style of index:

<index run_id="runA" size="15">
	<scanNumberList>2 3 4 6 7 8 10 11 12 14 15 16</scanNumberList>
	<offsetList>42 142 242 342 442 542 642 742 842 942 1042 1142</offsetList>
</index>
<index run_id="runB" size="15">
	<scanNumberList>2 3 4 6 7 8 10 11 12</scanNumberList>
	<offsetList>1242 1342 1442 1542 1642 1742 1842 1942 2042 2142 2242 2342</offsetList>
</index>

Even if the rest of the parsing is done with an XML lib like expat or 
xerces, simple delimiter/token parsing (like RAMP uses) is the best way 
to read this style of index, and nothing could be faster (without going 
binary).  As we both agree, XML purists dislike any of these index 
styles, so we needn't try to please. :)  As I already mentioned, once 
the scan numbers are in an array (which is by definition sorted since 
scan numbers are guaranteed to be in ascending order), a binary search 
can find the index into the array for a queried scan number.  
Alternatively, the query could be done in a streaming fashion, without 
storing the index in memory, by incrementing a counter while tokenizing 
the scan number list, and then using that counter to know how many 
tokens to advance in the offset list.  But I don't see the point of not 
reading the entire index whenever an mzML file is opened (assuming the 
index exists and it's trusted to be correct), so I would opt for the 
binary search idea.

If we wanted to be even faster, save space, and reuse the base64 
functions, both lists could be binary and base64 encoded.  I don't think 
that binary encoding would be terribly useful here, but it's an option.  
My main idea is to get away from the overhead of having each offset be 
wrapped by an XML element.  It all depends how much you adhere to the 
motto: "If you're gonna go, go all out."

-Matt

Brian Pratt wrote:
> Matt -
>
> Thanks for the info.  Hopefully it shows up in the schema comments.
>
> Josh and I were kicking around the index idea off list (by accident, the
> reply-to default tripped us up...).  I agree that my initial idea is more
> verbose than it should be.  I worry though that those lists you propose
> could get very long very fast and become troublesome to some XML parser
> libs.  Josh also suggested a more structured approach, here's my tweak of
> that:  (same example as before, but with runs named "Bob" and "fizzle" to
> avoid implying any structured name conventions)
>
> <index run_id="Bob">
> ...
>     <offset scan="1041" 4212696/>
>     <offset scan="1042" 4218791/>
> </index>
> <index run_id="fizzle">  
>     <offset scan="1" 4221806/>
>     <offset scan="2" 4227580/>
>     <offset scan="3" 4231174/>
> ...
> </index> 
>
> Not nearly as tight as yours, but easier to write an XML handler for it, and
> less likely to upset XML purists (who are already outraged by the idea of an
> index, so maybe that's not a problem...).  We could use smaller names, too,
> "fpos" for "offset", etc.
>
> Brian
>
> -----Original Message-----
> From: psi...@li...
> [mailto:psi...@li...] On Behalf Of Matt
> Chambers
> Sent: Friday, August 03, 2007 7:44 PM
> To: psi...@li...
> Subject: Re: [Psidev-ms-dev] compressionType (RE: mzML 0.93 ready for first
> review)
>
> Brian Pratt wrote:
>   
>>  Oh, and to revisit the original question, a snippet of a multi-run  
>> index would look like something like this, I think, assuming the  
>> runList contained a couple of runs with id="runA" and id="runB"
>>  respectively:
>>
>>  <offset run_id="runA" spectrum_id="1041">4212696</offset><offset
>>  run_id="runA" spectrum_id="1042">4218791</offset> <offset  
>> run_id="runB" spectrum_id="1">4221806</offset> <offset run_id="runB"
>>  spectrum_id="2">4227580</offset> <offset run_id="runB"
>>  spectrum_id="3">4231174</offset>
>>
>>     
> Wouldn't a simpler, faster, and easier index be something like:
> <index run_id="runA" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14
> 15 16 18 19 20" offsetList="42 142 242 342 442 542 642 742 842 942 1042
> 1142 1242 1342 1442" />
> <index run_id="runB" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14
> 15 16 18 19 20" offsetList="1542 1642 1742 1842 1942 2042 2142 2242 2342
> 2442 2542 2642 2742 2842 2942" />
>
> These space delimited lists seem much easier to parse and deal with, unless
> there is some attribute length limit I'm ignorant of?  Certainly human
> readability of the index is not an issue because a human can just do a find
> on the scan number and jump straight to it (even with, heaven forbid it,
> Notepad.exe).  Adherence to XML principles needn't be a concern because the
> whole concept of the index is outside the realm of pure XML principles. :)
> I like the idea of reading in two sorted arrays, doing a binary search on
> the first one to find the index of the desired scan number, and then using
> that index to look up the offset in the second sorted array.  I feel the
> need... the need for speed.
>