Re: [Psidev-ms-dev] Indexing in mzML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 6/19/07, Matthew Chambers <mat...@va...> wrote:
> On a related note, is there any guarantee in mzML (or mzData
> for that matter) that the spectrum IDs or scan numbers are given in
> ascending order?

This is a good question.  I haven't read the spec closely, but if the
answer isn't in there, it ought to be.  Along those lines, are IDs and
scan numbers even guaranteed to be unique within a file?  (I hope the
answer will be "yes".)

> But one thing I've missed a lot in mzData (even though I think it's a
> better format because of the flat spectra list) is an index to quickly
> access a given scan number.

I'm torn on this myself.  On the one hand, adding *any* redundant
information seems to go against the basic idea of just representing
the experimental data.  On the other hand, it *would* make some
operations more convenient.  Random access reads  become easier,
altering the file becomes harder, and something like XSLT
transformations probably become impossible (I'm not an XSLT fan
anyway).

One point to consider: do we think that all of the various producers
(and transformers) of these files will be capable of producing correct
(bug-free) indices?  If they're not *always* correct, or if you have
to validate the file before you trust it, you're basically having to
recreate the index anyway.  If that's so, maybe it should just be left
out of the mzML file altogether.

It looks like indices are currently stored in a separate, optional
file.  This seems like a good compromise.

It's worth noting that these arguments also apply to the other
redundant information in the file (counts and checksums, for example).
 I wouldn't mind seeing those also moved to a separate file.  If
they're left in, maybe something should be said about what's supposed
to happen when the redundant information is inconsistent.

Mike