From: Eric D. <ede...@sy...> - 2008-01-29 08:29:30
|
Hi everyone, the mzML reviews are back. Thank you to Norman Paton and the anonymous reviewers. I have pasted the reviews below for your perusal if you are interested. I hope everyone can devote a little time in the next month to making another (and hopefully final) push to get mzML finished. I would like to propose a telephone conference in a week: =20 Tuesday February 5: =20 09:00 San Francisco 12:00 New York 17:00 London (GMT) 18:00 Europe =20 Let me know if there are concerns about the time. =20 Later this week I will draft up the list of things yet to do (including items from reviews and other things that have emerged in the last two months) and send out an agenda. =20 Thanks, Eric =20 ------------------------------------------------------------------------ -------------------------- =20 Reviewer 1 =20 Introduction This review concerns Draft of Version 1.0.0 of the specification for the mzML data format developed by the HUPO Proteomics Standards Initiative. mzML is intended as a replacement for the existing XML data formats used for markup of mass spectrometry data, mzData and mzXML. Implementation of the format The philosophy behind the development of mzML is to combine the flexibility of mzData with the robustness of mzXML. mzData allowed the additional of new controlled vocabulary terms as technologies and concepts develop, but has the disadvantage of allowing inconsistent use of these terms. mzXML is less flexible in terms of allowing the use of controlled vocabulary terms, and has the disadvantage of requiring full schema revisions to keep pace with advancements in mass spectrometry. mzML proposes solving this issue by releasing a semantic validator with the data format, enforcing rules as to which controlled vocabularies may (and must) be used within a given location in the document. This appears a sensible approach, as the dependence on both an XML and supplied semantic validator, along with a managed centralised repository for controlled vocabulary terms, is likely to prevent mzML from developing into a number of diversifying dialects. An existing open issue is how to support new CV terms. The most robust, and therefore preferable, approach is the "ideal world" scenario, in which a new term is suggested to the CV coordinator, who would verify that the term is indeed novel and not an synonym of an existing term, and if so add the term to the CV by updating the centralised repository that is used by the existing semantic validator. It is claimed that this approach may be objectionable to some, due to the fact that parsing software must be connected to the internet to ensure that the semantic validation is taking place according to the updated CV. This seems a somewhat outdated concern, as software is increasing generated that assumes that an internet connection is present. Furthermore, there have been concerns that implementing such an approach is non-trivial. This too seems an improbable argument, as implementing a module to download a centralised CV parameter file would appear to be far less complex than the implementation of tools to process the data once it has been parsed. Much of these issues could be resolved by implementing a number of parser libraries, such as have been developed for the parsing of systems biology markup language (SBML)1 in the form of libSBML2. libSBML is implemented in a number of programming languages, and as such has been widely uptaken by the systems biology software development community, removing the need to implement parsers. This approach allows the software development community to concentrate on tool development, and reduces some of the issues that are found objectionable here. It is claimed that use of additional ontologies may be useful to supplement the mzML controlled vocabulary. It is very unclear why ChEBI3 - an ontology describing chemical entities - may be considered appropriate in these circumstances. mzML concerns itself with mass spectrometry data. Any subsequent identifications of molecules from this data, in which ChEBI terms may be appropriate, is considered separately in other formats such as the forthcoming analysisXML. The model XML schema The concepts described in the XML schema are familiar to those who have previously used mzData. Most of the schema appears to satisfactorily cover the requirements for describing mass spectrometry data. There are questions related to the concept of describing samples in the schema as it currently stands. * The first regards the concept of multidimensional LC/MS/MS experiments, in which an individual sample may be separated by a number of LC steps, generating a number of fractions from a single sample. Each of these subsamples are typically analysed by taking performing an individual acquisition, thus generating a number of runs and sourceFiles for the original sample. As the mzML schema specifies that there is a 1:1 relationship between mzML and run, it is not clear whether an individual mzML file in this context would contain data from an individual subsample. run has a single, optional sampleRef attribute, but also may contain a sourceFileRefList, in which a number of sourceFiles can be referenced. Furthermore, a CV term exists to describe a 'sample batch' (MS:10000053), but it is unclear how the current schema could be used to relate subsamples to an original, pre-fractionation sample. It is thought then that the management of the relationship between original sample and subsamples is not unambiguously catered for in the schema as it stands. As such multidimensional LC/MS/MS experiments are becoming increasingly commonplace, it is felt that this is an issue that may need to be addressed. * The second question regards a related yet separate issue. In the case of quantitative proteomics (and metabolomics) experiments, a given sample that is analysed with MS is usually a mixture of two or more samples, which are isotopically labelled to allow components to be identified and quantified. Considering the example of a proteomic iTRAQ4 experiment, the sample that is analysed is a mixture of four (latterly, eight) samples, each of which is labelled with an individual isotopic component. The schema allows for multiple samples to be specified in the sampleList, which is appropriate for such experiments. It is however unclear as to how or if these samples can be annotated in such a way that allows mzML to be used in quantitative analyses. In order to do this, any analysis software would need to know both the type of quantitative experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.) and how the individual samples were labelled (iTRAQ label 114, C-terminal 18O, etc.). It may be that this meta-data is considered to be outside of the scope of the mzML format. If this meta-data were absent, then mzML files of this format would be incredibly difficult to analyse by third parties. A similar problem occurs with digestion protocols used to generate the sample. In the case of typical proteomics studies in which the sample is digested before analysis, if meta-data regarding which digestion enzyme was used were not present, this data would be very difficult to analyse with database search engines. References 1The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Hucka M, et al. Bioinformatics 19(4):524-31 (2003). 2http://sbml.org/software/libsbml/ <http://sbml.org/software/libsbml/>=20 3ChEBI: a database and ontology for chemical entities of biological interest. Degtyarenko K, et al. Nucleic Acids Res. 36:D344-50. (2008). 4Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Ross PL, et al. Mol Cell Proteomics 3(12):1154-69. (2004). 5Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics. Ong SE, et al. Mol. Cell. Proteom. 1:376-386. (2002). 6Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. Gygi SP, et al. J Proteome Res. 1(1):47-54. (2002). =20 Reviewer 2 The specification clearly fits the purpose, as there is a need for a broadly adopted standard, both among theMS vendors and tool developers, as well as among users. =20 The specification is clearly written and at this stage there is no need for any major changes. One potential consideration: The authors have decided to create a comprehensive and expandable standard, which means that the specification itself is quite heavy, especially all the cvParam material. This leads to the usual problem of having too restrictive standard into potential future problems: maintaining CV terms and having very incomplete implementations of parser/writer code in software tools. It will thus be essential to have a broad community support for mzML to succeed and grow with the advancing field. =20 Reviewer 3 I looked at the Word document briefly, but I wasn't able to answer my question about whether mzML supports the storage of MRM (multiple reaction monitoring) data. This is an alternative scanning strategy that records intensities for a small set of specified m/z transitions. =20 |