Re: [Psidev-ms-dev] mzML comments

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Wilfred, some comments to some of your comments:

Wilfred H Tang wrote:
>
> * When m/z vs. intensity data is written out in profile mode, it is 
> pretty common to see a LARGE majority of the intensities to be zero. 
> Given the preponderance of zero intensities, a space-efficient way to 
> write the data out would be to specify a point spacing in the m/z 
> dimension and then write out a (m/z, intensity) pair only if the 
> intensity is non-zero. (Call this method 1.) The alternative, less 
> space-efficient way would be to write out all of the (m/z, intensity) 
> data pairs even though most of them have zero intensities and hence 
> are not all that interesting. (Call this method 2.) For method 1 to 
> work well, there must be a way to specify a m/z point spacing. Is 
> there a way to do this currently? Furthermore, the program reading in 
> the mzML must understand that the m/z point spacing implicitly 
> requires reconstruction of all the zero-intensity data pairs; 
> otherwise, for example, a mass spectrum plot would look funny. A 
> further complication for method 1 is that the m/z point spacing may 
> not necessarily be a constant. For example, for the AB/Sciex QSTAR 
> instrument, the m/z spacing is proportional to the square root of m/z, 
> and this is a natural consequence of this being a TOF instrument.
There is a method 3 which efficiently reduces space for profile spectra 
which contain a lot of zeros.  All data points with zero intensity that 
are surrounded by data points of zero intensity can be left out. If you 
have the following arrays:
int: 1 5 1 0 0 0 0 0 1 6
m/z: 1 2 3 4 5 6 7 8 9 10
These can be reduced to:
int: 1 5 1 0 0 1 6
m/z: 1 2 3 4 8 9 10
This is ok to do in mzML.

On the other hand, it would be very useful with a way to specify the m/z 
spacing, since it can be quite tricky to get this for TOF data, 
especially when a calibration function have been applied over the square 
root spaced m/z values, so that they are no longer spaced exactly 
proportional to the square root of m/z. Probably the initial spacing and 
polynomial calibration functions could be specified using CV terms, just 
that such terms are not in the CV (yet). Suggestions for this would be 
welcome.

> * The validator expects elements to appear in a certain order. This is 
> due to the usage of xs:sequence in the XSD file. All deviations from 
> the specified order are marked as errors, and I don't think that this 
> is really the desired behavior. There's nothing intrinsic to XML that 
> makes restricting order desirable, and in most cases for mzML, there 
> is absolutely nothing to be gained by restricting order.
I think the order gets quite important since parsers will not be able to 
load most files into memory. SAX/StAX parsing is needed due to the large 
size of mzML files. Things get easier when parsing the files if we now 
that referencableParamGroups and other referencable things are found in 
the beginning of the file.

>
> * The validator doesn't appear to recognize <userParam> at all - i.e., 
> any time <userParam> is put into the mzML, the validator gives an 
> error. This may possibly be related to the previous point, but I tried 
> putting <userParam> in all possible locations, and nothing seemed to 
> work.
The userParams are simply ignored by the semantic validator (or at least 
are supposed to be). On the other hand, the xsd specifies that for a 
given element cvParams must come before userParams. I don't think this 
is a problem. If we were allowed to write a mixture of cvParams and 
userParams in a block of data, we could not be sure which are related 
anyway due to the unordered nature of XML.
The tiny1 example and also the peak list example files contain 
userParams and  validate.
>
> * For the <sourceFile> element, the cvParam mapping rule "MUST supply 
> a *child* term of MS:1000561 (data file checksum type) one or more 
> times" should be deleted. The checksum of the SOURCE data file seems 
> to be completely irrelevant.
I also agreed on this previously, but was convinced after discussions 
that this is important for the integrity of data. The file checksum of 
the source file is irrelevant when looking at spectra in the file, but 
very important for traceability of data, and this is also a key role of 
mzML.
But in some cases it is not workable to retrieve the checksum of a 
source file, if it was several steps upstream in the analysis for 
example, and not available to a converter. I guess just specifying 
'unknown' as checksum value is OK, the requirement for the CV term just 
points out that one really should try to specify the checksum value if 
possible.

>
> * There is a mistake somewhere in the rules regarding the 
> specification of mass analyzer. There are numerous instrument types 
> that have multiple mass analyzers, but the validator rejects any 
> instrument that contains more than one mass analyzer.  Currently, only 
> one <analyzer> subelement is allowed under <componentList>, and the 
> <analyzer> element is only allowed to have one child mass analyzer 
> type CV term.
You can indeed have several analyzer elements in your componentList, see:
http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/plgs_example.mzML
at line 29-44.

Regards

Fredrik