Re: [Psidev-ms-dev] Indexing in mzML

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi everyone, thank you for the good discussion, here is what I take away
from this discussion (colored with my understanding of the prevailing
opinion on the various topics):

- Separate index/metadata files will be avoided
- The mzML index will be *optional* as a wrapper schema (with actual
index at the end of the file) as currently in mzXML
- The validator will enforce that scan numbers are in ascending order,
but not necessarily without gaps
- The validator will enforce that scan numbers and identifiers must be
unique within a run (but there could be multiple runs in a file)
- Regarding *always* correct indexes, users of mzXML have been using
indexes for years with no reports of problems that I'm aware.  Obviously
if the file is altered in any way, the index should be regenerated.
There are (for mzXML) / will be (for mzML) index checkers to make sure
all is well along with reindexing functionality if the index is bad.
- It should be a requirement for any reading software that uses the
index (all readers are required to be tolerant of the presence of the
wrapper schema index, but are not required to use it) to do some basic
checking that the result is correct.  E.g. if scan number 17500 is
desired and the index is used to jump to that location, it is
straightforward and necessary to ensure that the first tag read is
indeed <spectrum scan_number=3D"17500">. If it is not, the software is
free to do anything except continue as if it didn't know better (e.g.,
stop with error, revert to sequential read, or try to regenerate the
index and retry).
- While index/data mismatch is a potential source of problem, it has
been our experience that problems are rare and the benefits huge.

Regards,
Eric


> -----Original Message-----
> From: psi...@li...
[mailto:psidev-ms-dev-
> bo...@li...] On Behalf Of Matthew Chambers
> Sent: Tuesday, June 19, 2007 8:46 AM
> To: 'Mike Coleman'
> Cc: psi...@li...
> Subject: Re: [Psidev-ms-dev] Indexing in mzML
>=20
> > On 6/19/07, Matthew Chambers <mat...@va...>
wrote:
> > > On a related note, is there any guarantee in mzML (or mzData
> > > for that matter) that the spectrum IDs or scan numbers are given
in
> > > ascending order?
> >
> > This is a good question.  I haven't read the spec closely, but if
the
> > answer isn't in there, it ought to be.  Along those lines, are IDs
and
> > scan numbers even guaranteed to be unique within a file?  (I hope
the
> > answer will be "yes".)
> >
>=20
> I think IDs are definitely unique within a file, and scan numbers will
> almost always be unique within a spectra source (multiple spectra
sources
> can be in a single mzML file though).  In our software, I use
> "SourceFileName.ScanNum.ChargeState" as a unique identifier so that
> spectra
> from multiple sources can be loaded into the same data structure.
>=20
> > > But one thing I've missed a lot in mzData (even though I think
it's a
> > > better format because of the flat spectra list) is an index to
quickly
> > > access a given scan number.
> >
> > I'm torn on this myself.  On the one hand, adding *any* redundant
> > information seems to go against the basic idea of just representing
> > the experimental data.  On the other hand, it *would* make some
> > operations more convenient.  Random access reads  become easier,
> > altering the file becomes harder, and something like XSLT
> > transformations probably become impossible (I'm not an XSLT fan
> > anyway).
>=20
> I'm not sure an index counts as redundant.  It's more like metadata
(i.e.
> I
> don't think you could call the index at the end of a textbook
redundant!).
> We already store plenty of metadata, because otherwise we'd have real
> trouble reinterpreting the data's meaning.  In fact, XML is by
definition
> loaded with metadata. :)  Random access reads wouldn't just become
easier
> -
> ease of coding is not the issue to me.  I just want random access to
not
> be
> a computational nightmare due to an excess of XML parsing.
>=20
>=20
> > One point to consider: do we think that all of the various producers
> > (and transformers) of these files will be capable of producing
correct
> > (bug-free) indices?  If they're not *always* correct, or if you have
> > to validate the file before you trust it, you're basically having to
> > recreate the index anyway.  If that's so, maybe it should just be
left
> > out of the mzML file altogether.
> >
> > It looks like indices are currently stored in a separate, optional
> > file.  This seems like a good compromise.
>=20
> >From what I've read of the minutes, and I may not have gotten the
full
> picture on this discussion, the issue of indexes is set aside at the
> moment.
> I have to say that I dislike the idea of having a separate index on
> principle, but if it would really make consistency with the main file
more
> feasible, then it's an acceptable compromise.  For example, the
optional
> index file could store a SHA-1 hash of the main mzML file.  Software
could
> test whether to trust the optional index by whether its stored hash
> matches
> a new hash on the main file.  I don't think it's unreasonable to
> (re)generate the index whenever the main file is altered (and of
course
> store the new hash).
>=20
> The optional index file method also means that the software doesn't
have
> to
> skip to the end of the file to read the index.
>=20
> Unfortunately, indexing is just a necessary evil in a world of large
> datasets.
>=20
> -Matt Chambers
>=20
>=20
>
------------------------------------------------------------------------
-
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Psidev-ms-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev