From: Matthew C. <mat...@va...> - 2007-06-19 18:27:22
|
> Hi everyone, thank you for the good discussion, here is what I take away > from this discussion (colored with my understanding of the prevailing > opinion on the various topics): > > - Separate index/metadata files will be avoided > - The mzML index will be *optional* as a wrapper schema (with actual > index at the end of the file) as currently in mzXML This works either way for me. It makes it tricky to get to the index, but it's not any trickier than managing two files, especially if one is optional. However, unless I am missing something, is it not much harder to use a hash to check if the index is valid (i.e. that the main file has not been altered) if the index is included in the main file? Even if the hash is written as the last thing, that change alone would cause the next hash to be different, would it not? > - The validator will enforce that scan numbers are in ascending order, > but not necessarily without gaps > - The validator will enforce that scan numbers and identifiers must be > unique within a run (but there could be multiple runs in a file) I'm confused about the difference between identifiers and scan numbers. Since a mzML file can have more than one spectra source (e.g. multiple RAW files), scan numbers could only be unique within a run, as you say, but I would expect that the "SpectrumID" identifier, if it is different from the scan number, should be unique to the whole file. What is the reasoning behind the SpectrumID identifier being unique only to a run, or am I misunderstanding? What is the purpose of having a separate SpectrumID identifier anyway? > - Regarding *always* correct indexes, users of mzXML have been using > indexes for years with no reports of problems that I'm aware. Obviously > if the file is altered in any way, the index should be regenerated. > There are (for mzXML) / will be (for mzML) index checkers to make sure > all is well along with reindexing functionality if the index is bad. > - It should be a requirement for any reading software that uses the > index (all readers are required to be tolerant of the presence of the > wrapper schema index, but are not required to use it) to do some basic > checking that the result is correct. E.g. if scan number 17500 is > desired and the index is used to jump to that location, it is > straightforward and necessary to ensure that the first tag read is > indeed <spectrum scan_number="17500">. If it is not, the software is > free to do anything except continue as if it didn't know better (e.g., > stop with error, revert to sequential read, or try to regenerate the > index and retry). This is complicated by multiple sources being in a single mzML file. Will the index follow the same structure as the main section so that when you are looking for a scan number in some source, you first traverse into the source, and then look for the indexed spectrum to get its offset? <source name="someSourceName"> <spectrum scan="15"> ... </spectrum> </source> <index> <indexedSource name="someSourceName" offset="0"> <indexedSpectrum scan="15" offset="33"> ... </indexedSpectrum> </indexedSource> </index> > - While index/data mismatch is a potential source of problem, it has > been our experience that problems are rare and the benefits huge. Agreed. Regards, Matt Chambers |