[Psidev-ms-dev] mzML reviews are in

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi everyone, the mzML reviews are back. Thank you to Norman Paton and
the anonymous reviewers. I have pasted the reviews below for your
perusal if you are interested. I hope everyone can devote a little time
in the next month to making another (and hopefully final) push to get
mzML finished. I would like to propose a telephone conference in a week:

=20

Tuesday February 5:

=20

09:00 San Francisco

12:00 New York

17:00 London (GMT)

18:00 Europe

=20

Let me know if there are concerns about the time.

=20

Later this week I will draft up the list of things yet to do (including
items from reviews and other things that have emerged in the last two
months) and send out an agenda.

=20

Thanks,

Eric

=20

------------------------------------------------------------------------
--------------------------

=20

Reviewer 1

=20

Introduction

This review concerns Draft of Version 1.0.0 of the specification for the
mzML data format developed by the HUPO Proteomics Standards Initiative.

mzML is intended as a replacement for the existing XML data formats used
for markup of mass spectrometry data, mzData and mzXML.

Implementation of the format

The philosophy behind the development of mzML is to combine the
flexibility of mzData with the robustness of mzXML. mzData allowed the
additional of new controlled vocabulary terms as technologies and
concepts develop, but has the disadvantage of allowing inconsistent use
of these terms. mzXML is less flexible in terms of allowing the use of
controlled vocabulary terms, and has the disadvantage of requiring full
schema revisions to keep pace with advancements in mass spectrometry.

mzML proposes solving this issue by releasing a semantic validator with
the data format, enforcing rules as to which controlled vocabularies may
(and must) be used within a given location in the document. This appears
a sensible approach, as the dependence on both an XML and supplied
semantic validator, along with a managed centralised repository for
controlled vocabulary terms, is likely to prevent mzML from developing
into a number of diversifying dialects.

An existing open issue is how to support new CV terms. The most robust,
and therefore preferable, approach is the "ideal world" scenario, in
which a new term is suggested to the CV coordinator, who would verify
that the term is indeed novel and not an synonym of an existing term,
and if so add the term to the CV by updating the centralised repository
that is used by the existing semantic validator. It is claimed that this
approach may be objectionable to some, due to the fact that parsing
software must be connected to the internet to ensure that the semantic
validation is taking place according to the updated CV. This seems a
somewhat outdated concern, as software is increasing generated that
assumes that an internet connection is present. Furthermore, there have
been concerns that implementing such an approach is non-trivial. This
too seems an improbable argument, as implementing a module to download a
centralised CV parameter file would appear to be far less complex than
the implementation of tools to process the data once it has been parsed.

Much of these issues could be resolved by implementing a number of
parser libraries, such as have been developed for the parsing of systems
biology markup language (SBML)1 in the form of libSBML2. libSBML is
implemented in a number of programming languages, and as such has been
widely uptaken by the systems biology software development community,
removing the need to implement parsers. This approach allows the
software development community to concentrate on tool development, and
reduces some of the issues that are found objectionable here.

It is claimed that use of additional ontologies may be useful to
supplement the mzML controlled vocabulary. It is very unclear why ChEBI3
- an ontology describing chemical entities - may be considered
appropriate in these circumstances. mzML concerns itself with mass
spectrometry data. Any subsequent identifications of molecules from this
data, in which ChEBI terms may be appropriate, is considered separately
in other formats such as the forthcoming analysisXML.

The model XML schema

The concepts described in the XML schema are familiar to those who have
previously used mzData. Most of the schema appears to satisfactorily
cover the requirements for describing mass spectrometry data.

There are questions related to the concept of describing samples in the
schema as it currently stands.

*	The first regards the concept of multidimensional LC/MS/MS
experiments, in which an individual sample may be separated by a number
of LC steps, generating a number of fractions from a single sample. Each
of these subsamples are typically analysed by taking performing an
individual acquisition, thus generating a number of runs and sourceFiles
for the original sample. As the mzML schema specifies that there is a
1:1 relationship between mzML and run, it is not clear whether an
individual mzML file in this context would contain data from an
individual subsample. run has a single, optional sampleRef attribute,
but also may contain a sourceFileRefList, in which a number of
sourceFiles can be referenced. Furthermore, a CV term exists to describe
a 'sample batch' (MS:10000053), but it is unclear how the current schema
could be used to relate subsamples to an original, pre-fractionation
sample. It is thought then that the management of the relationship
between original sample and subsamples is not unambiguously catered for
in the schema as it stands. As such multidimensional LC/MS/MS
experiments are becoming increasingly commonplace, it is felt that this
is an issue that may need to be addressed.
*	The second question regards a related yet separate issue. In the
case of quantitative proteomics (and metabolomics) experiments, a given
sample that is analysed with MS is usually a mixture of two or more
samples, which are isotopically labelled to allow components to be
identified and quantified. Considering the example of a proteomic iTRAQ4
experiment, the sample that is analysed is a mixture of four (latterly,
eight) samples, each of which is labelled with an individual isotopic
component. The schema allows for multiple samples to be specified in the
sampleList, which is appropriate for such experiments. It is however
unclear as to how or if these samples can be annotated in such a way
that allows mzML to be used in quantitative analyses. In order to do
this, any analysis software would need to know both the type of
quantitative experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.)
and how the individual samples were labelled (iTRAQ label 114,
C-terminal 18O, etc.). It may be that this meta-data is considered to be
outside of the scope of the mzML format. If this meta-data were absent,
then mzML files of this format would be incredibly difficult to analyse
by third parties. A similar problem occurs with digestion protocols used
to generate the sample. In the case of typical proteomics studies in
which the sample is digested before analysis, if meta-data regarding
which digestion enzyme was used were not present, this data would be
very difficult to analyse with database search engines.

References

1The systems biology markup language (SBML): a medium for representation
and exchange of biochemical network models. Hucka M, et al.
Bioinformatics 19(4):524-31 (2003).

2http://sbml.org/software/libsbml/ <http://sbml.org/software/libsbml/>=20

3ChEBI: a database and ontology for chemical entities of biological
interest. Degtyarenko K, et al. Nucleic Acids Res. 36:D344-50. (2008).

4Multiplexed protein quantitation in Saccharomyces cerevisiae using
amine-reactive isobaric tagging reagents. Ross PL, et al. Mol Cell
Proteomics 3(12):1154-69. (2004).

5Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a
Simple and Accurate Approach to Expression Proteomics. Ong SE, et al.
Mol. Cell. Proteom. 1:376-386. (2002).

6Proteome analysis of low-abundance proteins using multidimensional
chromatography and isotope-coded affinity tags. Gygi SP, et al. J
Proteome Res. 1(1):47-54. (2002).

=20

Reviewer 2

The specification clearly fits the purpose, as there is a need for a
broadly adopted standard, both among theMS vendors and tool developers,
as well as among users.

=20

The specification is clearly written and at this stage there is no need
for any major changes. One potential consideration: The authors have
decided to create a comprehensive and expandable standard, which means
that the specification itself is quite heavy, especially all the cvParam
material. This leads to the usual problem of having too restrictive
standard into potential future problems: maintaining CV terms and having
very incomplete implementations of parser/writer code in software tools.
It will thus be essential to have a broad community support for mzML to
succeed and grow with the advancing field.

=20

Reviewer 3

I looked at the Word document briefly, but I wasn't able to answer my
question about whether mzML supports the storage of MRM (multiple
reaction monitoring) data.  This is an alternative scanning strategy
that records intensities for a small set of specified m/z transitions.

=20