From: Eric D. <ede...@sy...> - 2007-06-19 18:01:14
|
Hi everyone, thank you for the good discussion, here is what I take away from this discussion (colored with my understanding of the prevailing opinion on the various topics): - Separate index/metadata files will be avoided - The mzML index will be *optional* as a wrapper schema (with actual index at the end of the file) as currently in mzXML - The validator will enforce that scan numbers are in ascending order, but not necessarily without gaps - The validator will enforce that scan numbers and identifiers must be unique within a run (but there could be multiple runs in a file) - Regarding *always* correct indexes, users of mzXML have been using indexes for years with no reports of problems that I'm aware. Obviously if the file is altered in any way, the index should be regenerated. There are (for mzXML) / will be (for mzML) index checkers to make sure all is well along with reindexing functionality if the index is bad. - It should be a requirement for any reading software that uses the index (all readers are required to be tolerant of the presence of the wrapper schema index, but are not required to use it) to do some basic checking that the result is correct. E.g. if scan number 17500 is desired and the index is used to jump to that location, it is straightforward and necessary to ensure that the first tag read is indeed <spectrum scan_number=3D"17500">. If it is not, the software is free to do anything except continue as if it didn't know better (e.g., stop with error, revert to sequential read, or try to regenerate the index and retry). - While index/data mismatch is a potential source of problem, it has been our experience that problems are rare and the benefits huge. Regards, Eric > -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Matthew Chambers > Sent: Tuesday, June 19, 2007 8:46 AM > To: 'Mike Coleman' > Cc: psi...@li... > Subject: Re: [Psidev-ms-dev] Indexing in mzML >=20 > > On 6/19/07, Matthew Chambers <mat...@va...> wrote: > > > On a related note, is there any guarantee in mzML (or mzData > > > for that matter) that the spectrum IDs or scan numbers are given in > > > ascending order? > > > > This is a good question. I haven't read the spec closely, but if the > > answer isn't in there, it ought to be. Along those lines, are IDs and > > scan numbers even guaranteed to be unique within a file? (I hope the > > answer will be "yes".) > > >=20 > I think IDs are definitely unique within a file, and scan numbers will > almost always be unique within a spectra source (multiple spectra sources > can be in a single mzML file though). In our software, I use > "SourceFileName.ScanNum.ChargeState" as a unique identifier so that > spectra > from multiple sources can be loaded into the same data structure. >=20 > > > But one thing I've missed a lot in mzData (even though I think it's a > > > better format because of the flat spectra list) is an index to quickly > > > access a given scan number. > > > > I'm torn on this myself. On the one hand, adding *any* redundant > > information seems to go against the basic idea of just representing > > the experimental data. On the other hand, it *would* make some > > operations more convenient. Random access reads become easier, > > altering the file becomes harder, and something like XSLT > > transformations probably become impossible (I'm not an XSLT fan > > anyway). >=20 > I'm not sure an index counts as redundant. It's more like metadata (i.e. > I > don't think you could call the index at the end of a textbook redundant!). > We already store plenty of metadata, because otherwise we'd have real > trouble reinterpreting the data's meaning. In fact, XML is by definition > loaded with metadata. :) Random access reads wouldn't just become easier > - > ease of coding is not the issue to me. I just want random access to not > be > a computational nightmare due to an excess of XML parsing. >=20 >=20 > > One point to consider: do we think that all of the various producers > > (and transformers) of these files will be capable of producing correct > > (bug-free) indices? If they're not *always* correct, or if you have > > to validate the file before you trust it, you're basically having to > > recreate the index anyway. If that's so, maybe it should just be left > > out of the mzML file altogether. > > > > It looks like indices are currently stored in a separate, optional > > file. This seems like a good compromise. >=20 > >From what I've read of the minutes, and I may not have gotten the full > picture on this discussion, the issue of indexes is set aside at the > moment. > I have to say that I dislike the idea of having a separate index on > principle, but if it would really make consistency with the main file more > feasible, then it's an acceptable compromise. For example, the optional > index file could store a SHA-1 hash of the main mzML file. Software could > test whether to trust the optional index by whether its stored hash > matches > a new hash on the main file. I don't think it's unreasonable to > (re)generate the index whenever the main file is altered (and of course > store the new hash). >=20 > The optional index file method also means that the software doesn't have > to > skip to the end of the file to read the index. >=20 > Unfortunately, indexing is just a necessary evil in a world of large > datasets. >=20 > -Matt Chambers >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |