Re: [Psidev-ms-dev] Indexing in mzML

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> Hi everyone, thank you for the good discussion, here is what I take away
> from this discussion (colored with my understanding of the prevailing
> opinion on the various topics):
> 
> - Separate index/metadata files will be avoided
> - The mzML index will be *optional* as a wrapper schema (with actual
> index at the end of the file) as currently in mzXML

This works either way for me.  It makes it tricky to get to the index, but
it's not any trickier than managing two files, especially if one is
optional.  However, unless I am missing something, is it not much harder to
use a hash to check if the index is valid (i.e. that the main file has not
been altered) if the index is included in the main file?  Even if the hash
is written as the last thing, that change alone would cause the next hash to
be different, would it not?

> - The validator will enforce that scan numbers are in ascending order,
> but not necessarily without gaps
> - The validator will enforce that scan numbers and identifiers must be
> unique within a run (but there could be multiple runs in a file)

I'm confused about the difference between identifiers and scan numbers.
Since a mzML file can have more than one spectra source (e.g. multiple RAW
files), scan numbers could only be unique within a run, as you say, but I
would expect that the "SpectrumID" identifier, if it is different from the
scan number, should be unique to the whole file.  What is the reasoning
behind the SpectrumID identifier being unique only to a run, or am I
misunderstanding?  What is the purpose of having a separate SpectrumID
identifier anyway? 

> - Regarding *always* correct indexes, users of mzXML have been using
> indexes for years with no reports of problems that I'm aware.  Obviously
> if the file is altered in any way, the index should be regenerated.
> There are (for mzXML) / will be (for mzML) index checkers to make sure
> all is well along with reindexing functionality if the index is bad.
> - It should be a requirement for any reading software that uses the
> index (all readers are required to be tolerant of the presence of the
> wrapper schema index, but are not required to use it) to do some basic
> checking that the result is correct.  E.g. if scan number 17500 is
> desired and the index is used to jump to that location, it is
> straightforward and necessary to ensure that the first tag read is
> indeed <spectrum scan_number="17500">. If it is not, the software is
> free to do anything except continue as if it didn't know better (e.g.,
> stop with error, revert to sequential read, or try to regenerate the
> index and retry).

This is complicated by multiple sources being in a single mzML file.  Will
the index follow the same structure as the main section so that when you are
looking for a scan number in some source, you first traverse into the
source, and then look for the indexed spectrum to get its offset?

<source name="someSourceName">
	<spectrum scan="15">
		...
	</spectrum>
</source>
<index>
	<indexedSource name="someSourceName" offset="0">
		<indexedSpectrum scan="15" offset="33">
			...
		</indexedSpectrum>
	</indexedSource>
</index>

> - While index/data mismatch is a potential source of problem, it has
> been our experience that problems are rare and the benefits huge.

Agreed.

Regards,
Matt Chambers