Hi Wilfred, some comments to some of your comments:
Wilfred H Tang wrote:
> * When m/z vs. intensity data is written out in profile mode, it is
> pretty common to see a LARGE majority of the intensities to be zero.
> Given the preponderance of zero intensities, a space-efficient way to
> write the data out would be to specify a point spacing in the m/z
> dimension and then write out a (m/z, intensity) pair only if the
> intensity is non-zero. (Call this method 1.) The alternative, less
> space-efficient way would be to write out all of the (m/z, intensity)
> data pairs even though most of them have zero intensities and hence
> are not all that interesting. (Call this method 2.) For method 1 to
> work well, there must be a way to specify a m/z point spacing. Is
> there a way to do this currently? Furthermore, the program reading in
> the mzML must understand that the m/z point spacing implicitly
> requires reconstruction of all the zero-intensity data pairs;
> otherwise, for example, a mass spectrum plot would look funny. A
> further complication for method 1 is that the m/z point spacing may
> not necessarily be a constant. For example, for the AB/Sciex QSTAR
> instrument, the m/z spacing is proportional to the square root of m/z,
> and this is a natural consequence of this being a TOF instrument.
There is a method 3 which efficiently reduces space for profile spectra
which contain a lot of zeros. All data points with zero intensity that
are surrounded by data points of zero intensity can be left out. If you
have the following arrays:
int: 1 5 1 0 0 0 0 0 1 6
m/z: 1 2 3 4 5 6 7 8 9 10
These can be reduced to:
int: 1 5 1 0 0 1 6
m/z: 1 2 3 4 8 9 10
This is ok to do in mzML.
On the other hand, it would be very useful with a way to specify the m/z
spacing, since it can be quite tricky to get this for TOF data,
especially when a calibration function have been applied over the square
root spaced m/z values, so that they are no longer spaced exactly
proportional to the square root of m/z. Probably the initial spacing and
polynomial calibration functions could be specified using CV terms, just
that such terms are not in the CV (yet). Suggestions for this would be
> * The validator expects elements to appear in a certain order. This is
> due to the usage of xs:sequence in the XSD file. All deviations from
> the specified order are marked as errors, and I don't think that this
> is really the desired behavior. There's nothing intrinsic to XML that
> makes restricting order desirable, and in most cases for mzML, there
> is absolutely nothing to be gained by restricting order.
I think the order gets quite important since parsers will not be able to
load most files into memory. SAX/StAX parsing is needed due to the large
size of mzML files. Things get easier when parsing the files if we now
that referencableParamGroups and other referencable things are found in
the beginning of the file.
> * The validator doesn't appear to recognize <userParam> at all - i.e.,
> any time <userParam> is put into the mzML, the validator gives an
> error. This may possibly be related to the previous point, but I tried
> putting <userParam> in all possible locations, and nothing seemed to
The userParams are simply ignored by the semantic validator (or at least
are supposed to be). On the other hand, the xsd specifies that for a
given element cvParams must come before userParams. I don't think this
is a problem. If we were allowed to write a mixture of cvParams and
userParams in a block of data, we could not be sure which are related
anyway due to the unordered nature of XML.
The tiny1 example and also the peak list example files contain
userParams and validate.
> * For the <sourceFile> element, the cvParam mapping rule "MUST supply
> a *child* term of MS:1000561 (data file checksum type) one or more
> times" should be deleted. The checksum of the SOURCE data file seems
> to be completely irrelevant.
I also agreed on this previously, but was convinced after discussions
that this is important for the integrity of data. The file checksum of
the source file is irrelevant when looking at spectra in the file, but
very important for traceability of data, and this is also a key role of
But in some cases it is not workable to retrieve the checksum of a
source file, if it was several steps upstream in the analysis for
example, and not available to a converter. I guess just specifying
'unknown' as checksum value is OK, the requirement for the CV term just
points out that one really should try to specify the checksum value if
> * There is a mistake somewhere in the rules regarding the
> specification of mass analyzer. There are numerous instrument types
> that have multiple mass analyzers, but the validator rejects any
> instrument that contains more than one mass analyzer. Currently, only
> one <analyzer> subelement is allowed under <componentList>, and the
> <analyzer> element is only allowed to have one child mass analyzer
> type CV term.
You can indeed have several analyzer elements in your componentList, see:
at line 29-44.