Re: [Psidev-ms-dev] Indexing in mzML

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> On 6/19/07, Matthew Chambers <mat...@va...> wrote:
> > On a related note, is there any guarantee in mzML (or mzData
> > for that matter) that the spectrum IDs or scan numbers are given in
> > ascending order?
> 
> This is a good question.  I haven't read the spec closely, but if the
> answer isn't in there, it ought to be.  Along those lines, are IDs and
> scan numbers even guaranteed to be unique within a file?  (I hope the
> answer will be "yes".)
> 

I think IDs are definitely unique within a file, and scan numbers will
almost always be unique within a spectra source (multiple spectra sources
can be in a single mzML file though).  In our software, I use
"SourceFileName.ScanNum.ChargeState" as a unique identifier so that spectra
from multiple sources can be loaded into the same data structure.

> > But one thing I've missed a lot in mzData (even though I think it's a
> > better format because of the flat spectra list) is an index to quickly
> > access a given scan number.
> 
> I'm torn on this myself.  On the one hand, adding *any* redundant
> information seems to go against the basic idea of just representing
> the experimental data.  On the other hand, it *would* make some
> operations more convenient.  Random access reads  become easier,
> altering the file becomes harder, and something like XSLT
> transformations probably become impossible (I'm not an XSLT fan
> anyway).

I'm not sure an index counts as redundant.  It's more like metadata (i.e. I
don't think you could call the index at the end of a textbook redundant!).
We already store plenty of metadata, because otherwise we'd have real
trouble reinterpreting the data's meaning.  In fact, XML is by definition
loaded with metadata. :)  Random access reads wouldn't just become easier -
ease of coding is not the issue to me.  I just want random access to not be
a computational nightmare due to an excess of XML parsing.

> One point to consider: do we think that all of the various producers
> (and transformers) of these files will be capable of producing correct
> (bug-free) indices?  If they're not *always* correct, or if you have
> to validate the file before you trust it, you're basically having to
> recreate the index anyway.  If that's so, maybe it should just be left
> out of the mzML file altogether.
> 
> It looks like indices are currently stored in a separate, optional
> file.  This seems like a good compromise.

>From what I've read of the minutes, and I may not have gotten the full
picture on this discussion, the issue of indexes is set aside at the moment.
I have to say that I dislike the idea of having a separate index on
principle, but if it would really make consistency with the main file more
feasible, then it's an acceptable compromise.  For example, the optional
index file could store a SHA-1 hash of the main mzML file.  Software could
test whether to trust the optional index by whether its stored hash matches
a new hash on the main file.  I don't think it's unreasonable to
(re)generate the index whenever the main file is altered (and of course
store the new hash).

The optional index file method also means that the software doesn't have to
skip to the end of the file to read the index.

Unfortunately, indexing is just a necessary evil in a world of large
datasets.

-Matt Chambers