Re: [Psidev-ms-dev] mzML reviews are in

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Eric, Thanks for the forward. I won't be able to attend the call as I
will be in transit. My comments below:

On 1/29/08, Eric Deutsch <ede...@sy...> wrote:
>
> *Reviewer 1**Implementation of the format*
>
> The philosophy behind the development of mzML is to combine the
> flexibility of mzData with the robustness of mzXML. mzData allowed the
> additional of new controlled vocabulary terms as technologies and concept=
s
> develop, but has the disadvantage of allowing inconsistent use of these
> terms. mzXML is less flexible in terms of allowing the use of controlled
> vocabulary terms, and has the disadvantage of requiring full schema
> revisions to keep pace with advancements in mass spectrometry.
>
> mzML proposes solving this issue by releasing a semantic validator with
> the data format, enforcing rules as to which controlled vocabularies may
> (and must) be used within a given location in the document. This appears =
a
> sensible approach, as the dependence on both an XML and supplied semantic
> validator, along with a managed centralised repository for controlled
> vocabulary terms, is likely to prevent mzML from developing into a number=
 of
> diversifying dialects.
>

This is what folks where complaining about in previous threads and is a
contentious issue. The main criticism, to summarize, is that  the current
use of CVparams coupled to a special purpose and custom built semantic
validator is operationally no different than a "hard-coded" and quickly
evolving XML schema. Personally I am of the opinion that the CVParam usage
has advantages over the quickly evolving schema, but don't have a good
answer for semantic validation of said terms. The only things I can propose
(which are answers but not good ones) is that we: (1) use RDF for the CV
instead of OBO, hence dropping the non-standard validator;  or (2) just mov=
e
closer to current mzXML practice with some very important and slow-changing
terms in the schema, leaving other non-essential to capture of mass-spec
data as CV param. The problem with #2 is that "non-essential" means widely
different things to different people so that is a BIG AND LONG conversation=
.

For that reason I am inclined to move to option #1 or at least try it out
for size. I am aware of how ridiculous RDF is BTW. Not a fan, but there
seems to be a lot of momentum behind the idea in the W3C and other standard=
s
groups. Read that as "we won't be deserted on an island" if we develop
around RDF.

An existing open issue is how to support new CV terms. The most robust, and
> therefore preferable, approach is the "ideal world" scenario, in which a =
new
> term is suggested to the CV coordinator, who would verify that the term i=
s
> indeed novel and not an synonym of an existing term, and if so add the te=
rm
> to the CV by updating the centralised repository that is used by the
> existing semantic validator. It is claimed that this approach may be
> objectionable to some, due to the fact that parsing software must be
> connected to the internet to ensure that the semantic validation is takin=
g
> place according to the updated CV. This seems a somewhat outdated concern=
,
> as software is increasing generated that assumes that an internet connect=
ion
> is present. Furthermore, there have been concerns that implementing such =
an
> approach is non-trivial. This too seems an improbable argument, as
> implementing a module to download a centralised CV parameter file would
> appear to be far less complex than the implementation of tools to process
> the data once it has been parsed.
>

Has to be part of the MS CV working group process.

Much of these issues could be resolved by implementing a number of parser
> libraries, such as have been developed for the parsing of systems biology
> markup language (SBML)1 in the form of libSBML2. libSBML is implemented i=
n
> a number of programming languages, and as such has been widely uptaken by
> the systems biology software development community, removing the need to
> implement parsers. This approach allows the software development communit=
y
> to concentrate on tool development, and reduces some of the issues that a=
re
> found objectionable here.
>

Never heard of it.

It is claimed that use of additional ontologies may be useful to supplement
> the mzML controlled vocabulary. It is very unclear why ChEBI3 =96 an
> ontology describing chemical entities - may be considered appropriate in
> these circumstances. mzML concerns itself with mass spectrometry data. An=
y
> subsequent identifications of molecules from this data, in which ChEBI te=
rms
> may be appropriate, is considered separately in other formats such as the
> forthcoming analysisXML.
>

ditto and also out of scope for mzML

*The model XML schema*
>
> The concepts described in the XML schema are familiar to those who have
> previously used mzData. Most of the schema appears to satisfactorily cove=
r
> the requirements for describing mass spectrometry data.
>
> There are questions related to the concept of describing samples in the
> schema as it currently stands.
>
>    - The first regards the concept of multidimensional LC/MS/MS
>    experiments, in which an individual sample may be separated by a numbe=
r of
>    LC steps, generating a number of fractions from a single sample. Each =
of
>    these subsamples are typically analysed by taking performing an indivi=
dual
>    acquisition, thus generating a number of runs and sourceFiles for
>    the original sample. As the mzML schema specifies that there is a 1:1
>    relationship between mzML and run, it is not clear whether an
>    individual mzML file in this context would contain data from an indivi=
dual
>    subsample. run has a single, optional sampleRef attribute, but also
>    may contain a sourceFileRefList, in which a number of sourceFiles
>    can be referenced. Furthermore, a CV term exists to describe a 'sample
>    batch' (MS:10000053), but it is unclear how the current schema could b=
e used
>    to relate subsamples to an original, pre-fractionation sample. It is t=
hought
>    then that the management of the relationship between original sample a=
nd
>    subsamples is not unambiguously catered for in the schema as it stands=
. As
>    such multidimensional LC/MS/MS experiments are becoming increasingly
>    commonplace, it is felt that this is an issue that may need to be addr=
essed.
>    - The second question regards a related yet separate issue. In the
>    case of quantitative proteomics (and metabolomics) experiments, a give=
n
>    sample that is analysed with MS is usually a mixture of two or more sa=
mples,
>    which are isotopically labelled to allow components to be identified a=
nd
>    quantified. Considering the example of a proteomic iTRAQ4experiment, t=
he sample that is analysed is a mixture of four (latterly,
>    eight) samples, each of which is labelled with an individual isotopic
>    component. The schema allows for multiple samples to be specified in
>    the sampleList, which is appropriate for such experiments. It is
>    however unclear as to how or if these samples can be annotated in such=
 a way
>    that allows mzML to be used in quantitative analyses. In order to do t=
his,
>    any analysis software would need to know both the type of quantitative
>    experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.) and how
>    the individual samples were labelled (iTRAQ label 114, C-terminal 18O,
>    etc.). It may be that this meta-data is considered to be outside of th=
e
>    scope of the mzML format. If this meta-data were absent, then mzML fil=
es of
>    this format would be incredibly difficult to analyse by third parties.=
 A
>    similar problem occurs with digestion protocols used to generate the s=
ample.
>    In the case of typical proteomics studies in which the sample is diges=
ted
>    before analysis, if meta-data regarding which digestion enzyme was use=
d were
>    not present, this data would be very difficult to analyse with databas=
e
>    search engines.
>
>
These two are related and are outside of our scope, IMHO.

> *Reviewer 3*
>
> I looked at the Word document briefly, but I wasn't able to answer my
> question about whether mzML supports the storage of MRM (multiple reactio=
n
> monitoring) data.  This is an alternative scanning strategy that records
> intensities for a small set of specified m/z transitions.
>

Was that in there? I thought we had not worked the scheme out for MRM in 0.=
9
?
-angel

-------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Psidev-ms-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
>
>

--=20
Angel Pizarro
Director, ITMAT Bioinformatics Facility
806 Biological Research Building
421 Curie Blvd.
Philadelphia, PA 19104-6160
215-573-3736