psidev-ms-dev Mailing List for Proteomics Standards Initiative (Page 104)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello All,

Well, I'm on record as being aghast at the OBO-centric approach, but I can
see that battle is all but lost.   So, on to pragmatics:

In a world lacking the kind of software development tools available for W3C
schema, Reviewer 1's suggestion of the libSBML approach is apt.  You'll want
to provide a single C/C++ codebase with SWIG-generated bindings to hook it
up to other languages like Java, Perl, Python, Ruby, Matlab, etc* so that
the various custom parsing and validation methods required don't have to be
implemented over and over again.  More importantly, in a system as
unnecessarily complex as this the risk of errors and ambiguities is sorely
elevated, but with a single read/write implementation everybody will at
least be responding to those issues the same way.   Darren Kessler's work
looks promising as the core for this.  Note that I am NOT saying that
Darren's work contains errors (it looks quite nice, actually), but he's
already finding all the inconsistencies and gaps that a fully specified W3C
schema would have largely avoided, and having a single set of responses to
these problems will create a more stable world for mzML.  

(*Reviewer 1 is incorrect in saying that multiple libSBML implementations
exist in multiple languages - there's just one C/C++ implementation with
automatically generated language bindings.  You still have to compile the C
code for the target platform and install the library so the SWIG-generated
language bindings can call into it at runtime.  Not the dream of drop-in
portability that Java and Python promise, but better than nothing.  See
http://sbml.org/software/libsbml/docs/cpp-api/libsbml-installation.html#othe
r-lang for details.)

Brian

  _____  

From: psi...@li...
[mailto:psi...@li...] On Behalf Of Eric
Deutsch
Sent: Tuesday, January 29, 2008 12:29 AM
To: Mass spectrometry standard development
Cc: Eric Deutsch
Subject: [Psidev-ms-dev] mzML reviews are in

Hi everyone, the mzML reviews are back. Thank you to Norman Paton and the
anonymous reviewers. I have pasted the reviews below for your perusal if you
are interested. I hope everyone can devote a little time in the next month
to making another (and hopefully final) push to get mzML finished. I would
like to propose a telephone conference in a week:

Tuesday February 5:

09:00 San Francisco

12:00 New York

17:00 London (GMT)

18:00 Europe

Let me know if there are concerns about the time.

Later this week I will draft up the list of things yet to do (including
items from reviews and other things that have emerged in the last two
months) and send out an agenda.

Thanks,

Eric

----------------------------------------------------------------------------
----------------------

Reviewer 1

Introduction

This review concerns Draft of Version 1.0.0 of the specification for the
mzML data format developed by the HUPO Proteomics Standards Initiative.

mzML is intended as a replacement for the existing XML data formats used for
markup of mass spectrometry data, mzData and mzXML.

Implementation of the format

The philosophy behind the development of mzML is to combine the flexibility
of mzData with the robustness of mzXML. mzData allowed the additional of new
controlled vocabulary terms as technologies and concepts develop, but has
the disadvantage of allowing inconsistent use of these terms. mzXML is less
flexible in terms of allowing the use of controlled vocabulary terms, and
has the disadvantage of requiring full schema revisions to keep pace with
advancements in mass spectrometry.

mzML proposes solving this issue by releasing a semantic validator with the
data format, enforcing rules as to which controlled vocabularies may (and
must) be used within a given location in the document. This appears a
sensible approach, as the dependence on both an XML and supplied semantic
validator, along with a managed centralised repository for controlled
vocabulary terms, is likely to prevent mzML from developing into a number of
diversifying dialects.

An existing open issue is how to support new CV terms. The most robust, and
therefore preferable, approach is the "ideal world" scenario, in which a new
term is suggested to the CV coordinator, who would verify that the term is
indeed novel and not an synonym of an existing term, and if so add the term
to the CV by updating the centralised repository that is used by the
existing semantic validator. It is claimed that this approach may be
objectionable to some, due to the fact that parsing software must be
connected to the internet to ensure that the semantic validation is taking
place according to the updated CV. This seems a somewhat outdated concern,
as software is increasing generated that assumes that an internet connection
is present. Furthermore, there have been concerns that implementing such an
approach is non-trivial. This too seems an improbable argument, as
implementing a module to download a centralised CV parameter file would
appear to be far less complex than the implementation of tools to process
the data once it has been parsed.

Much of these issues could be resolved by implementing a number of parser
libraries, such as have been developed for the parsing of systems biology
markup language (SBML)1 in the form of libSBML2. libSBML is implemented in a
number of programming languages, and as such has been widely uptaken by the
systems biology software development community, removing the need to
implement parsers. This approach allows the software development community
to concentrate on tool development, and reduces some of the issues that are
found objectionable here.

It is claimed that use of additional ontologies may be useful to supplement
the mzML controlled vocabulary. It is very unclear why ChEBI3 - an ontology
describing chemical entities - may be considered appropriate in these
circumstances. mzML concerns itself with mass spectrometry data. Any
subsequent identifications of molecules from this data, in which ChEBI terms
may be appropriate, is considered separately in other formats such as the
forthcoming analysisXML.

The model XML schema

The concepts described in the XML schema are familiar to those who have
previously used mzData. Most of the schema appears to satisfactorily cover
the requirements for describing mass spectrometry data.

There are questions related to the concept of describing samples in the
schema as it currently stands.

*	The first regards the concept of multidimensional LC/MS/MS
experiments, in which an individual sample may be separated by a number of
LC steps, generating a number of fractions from a single sample. Each of
these subsamples are typically analysed by taking performing an individual
acquisition, thus generating a number of runs and sourceFiles for the
original sample. As the mzML schema specifies that there is a 1:1
relationship between mzML and run, it is not clear whether an individual
mzML file in this context would contain data from an individual subsample.
run has a single, optional sampleRef attribute, but also may contain a
sourceFileRefList, in which a number of sourceFiles can be referenced.
Furthermore, a CV term exists to describe a 'sample batch' (MS:10000053),
but it is unclear how the current schema could be used to relate subsamples
to an original, pre-fractionation sample. It is thought then that the
management of the relationship between original sample and subsamples is not
unambiguously catered for in the schema as it stands. As such
multidimensional LC/MS/MS experiments are becoming increasingly commonplace,
it is felt that this is an issue that may need to be addressed.
*	The second question regards a related yet separate issue. In the
case of quantitative proteomics (and metabolomics) experiments, a given
sample that is analysed with MS is usually a mixture of two or more samples,
which are isotopically labelled to allow components to be identified and
quantified. Considering the example of a proteomic iTRAQ4 experiment, the
sample that is analysed is a mixture of four (latterly, eight) samples, each
of which is labelled with an individual isotopic component. The schema
allows for multiple samples to be specified in the sampleList, which is
appropriate for such experiments. It is however unclear as to how or if
these samples can be annotated in such a way that allows mzML to be used in
quantitative analyses. In order to do this, any analysis software would need
to know both the type of quantitative experiment that was performed (iTRAQ,
SILAC5, ICAT6, etc.) and how the individual samples were labelled (iTRAQ
label 114, C-terminal 18O, etc.). It may be that this meta-data is
considered to be outside of the scope of the mzML format. If this meta-data
were absent, then mzML files of this format would be incredibly difficult to
analyse by third parties. A similar problem occurs with digestion protocols
used to generate the sample. In the case of typical proteomics studies in
which the sample is digested before analysis, if meta-data regarding which
digestion enzyme was used were not present, this data would be very
difficult to analyse with database search engines.

References

1The systems biology markup language (SBML): a medium for representation and
exchange of biochemical network models. Hucka M, et al. Bioinformatics
19(4):524-31 (2003).

2 <http://sbml.org/software/libsbml/> http://sbml.org/software/libsbml/

3ChEBI: a database and ontology for chemical entities of biological
interest. Degtyarenko K, et al. Nucleic Acids Res. 36:D344-50. (2008).

4Multiplexed protein quantitation in Saccharomyces cerevisiae using
amine-reactive isobaric tagging reagents. Ross PL, et al. Mol Cell
Proteomics 3(12):1154-69. (2004).

5Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple
and Accurate Approach to Expression Proteomics. Ong SE, et al. Mol. Cell.
Proteom. 1:376-386. (2002).

6Proteome analysis of low-abundance proteins using multidimensional
chromatography and isotope-coded affinity tags. Gygi SP, et al. J Proteome
Res. 1(1):47-54. (2002).

Reviewer 2

The specification clearly fits the purpose, as there is a need for a broadly
adopted standard, both among theMS vendors and tool developers, as well as
among users.

The specification is clearly written and at this stage there is no need for
any major changes. One potential consideration: The authors have decided to
create a comprehensive and expandable standard, which means that the
specification itself is quite heavy, especially all the cvParam material.
This leads to the usual problem of having too restrictive standard into
potential future problems: maintaining CV terms and having very incomplete
implementations of parser/writer code in software tools. It will thus be
essential to have a broad community support for mzML to succeed and grow
with the advancing field.

Reviewer 3

I looked at the Word document briefly, but I wasn't able to answer my
question about whether mzML supports the storage of MRM (multiple reaction
monitoring) data.  This is an alternative scanning strategy that records
intensities for a small set of specified m/z transitions.

2002	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (3)	Nov	Dec
2003	Jan	Feb	Mar	Apr (1)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov (3)	Dec
2004	Jan	Feb	Mar	Apr	May (2)	Jun	Jul (1)	Aug (5)	Sep	Oct (5)	Nov (1)	Dec (2)
2005	Jan (2)	Feb (5)	Mar	Apr (1)	May (5)	Jun (2)	Jul (3)	Aug (7)	Sep (18)	Oct (22)	Nov (10)	Dec (15)
2006	Jan (15)	Feb (8)	Mar (16)	Apr (8)	May (2)	Jun (5)	Jul (3)	Aug (1)	Sep (34)	Oct (21)	Nov (14)	Dec (2)
2007	Jan	Feb (17)	Mar (10)	Apr (25)	May (11)	Jun (30)	Jul (1)	Aug (38)	Sep	Oct (119)	Nov (18)	Dec (3)
2008	Jan (34)	Feb (202)	Mar (57)	Apr (76)	May (44)	Jun (33)	Jul (33)	Aug (32)	Sep (41)	Oct (49)	Nov (84)	Dec (216)
2009	Jan (102)	Feb (126)	Mar (112)	Apr (26)	May (91)	Jun (54)	Jul (39)	Aug (29)	Sep (16)	Oct (18)	Nov (12)	Dec (23)
2010	Jan (29)	Feb (7)	Mar (11)	Apr (22)	May (9)	Jun (13)	Jul (7)	Aug (10)	Sep (9)	Oct (20)	Nov (1)	Dec
2011	Jan	Feb (4)	Mar (27)	Apr (15)	May (23)	Jun (13)	Jul (15)	Aug (11)	Sep (23)	Oct (18)	Nov (10)	Dec (7)
2012	Jan (23)	Feb (19)	Mar (7)	Apr (20)	May (16)	Jun (4)	Jul (6)	Aug (6)	Sep (14)	Oct (16)	Nov (31)	Dec (23)
2013	Jan (14)	Feb (19)	Mar (7)	Apr (25)	May (8)	Jun (5)	Jul (5)	Aug (6)	Sep (20)	Oct (19)	Nov (10)	Dec (12)
2014	Jan (6)	Feb (15)	Mar (6)	Apr (4)	May (16)	Jun (6)	Jul (4)	Aug (2)	Sep (3)	Oct (3)	Nov (7)	Dec (3)
2015	Jan (3)	Feb (8)	Mar (14)	Apr (3)	May (17)	Jun (9)	Jul (4)	Aug (2)	Sep	Oct (13)	Nov	Dec (6)
2016	Jan (8)	Feb (1)	Mar (20)	Apr (16)	May (11)	Jun (6)	Jul (5)	Aug	Sep (2)	Oct (5)	Nov (7)	Dec (2)
2017	Jan (10)	Feb (3)	Mar (17)	Apr (7)	May (5)	Jun (11)	Jul (4)	Aug (12)	Sep (9)	Oct (7)	Nov (2)	Dec (4)
2018	Jan (7)	Feb (2)	Mar (5)	Apr (6)	May (7)	Jun (7)	Jul (7)	Aug (1)	Sep (9)	Oct (5)	Nov (3)	Dec (5)
2019	Jan (10)	Feb	Mar (4)	Apr (4)	May (2)	Jun (8)	Jul (2)	Aug (2)	Sep	Oct (2)	Nov (9)	Dec (1)
2020	Jan (3)	Feb (1)	Mar (2)	Apr	May (3)	Jun	Jul (2)	Aug	Sep	Oct (1)	Nov	Dec (1)
2021	Jan	Feb	Mar	Apr (5)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2023	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2024	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (2)
2025	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec

psidev-ms-dev Mailing List for Proteomics Standards Initiative (Page 104)

psidev-ms-dev — Mass spectroscopy standard development