From: Angel P. <an...@ma...> - 2008-01-29 16:08:38
|
Hi Eric, Thanks for the forward. I won't be able to attend the call as I will be in transit. My comments below: On 1/29/08, Eric Deutsch <ede...@sy...> wrote: > > *Reviewer 1**Implementation of the format* > > The philosophy behind the development of mzML is to combine the > flexibility of mzData with the robustness of mzXML. mzData allowed the > additional of new controlled vocabulary terms as technologies and concept= s > develop, but has the disadvantage of allowing inconsistent use of these > terms. mzXML is less flexible in terms of allowing the use of controlled > vocabulary terms, and has the disadvantage of requiring full schema > revisions to keep pace with advancements in mass spectrometry. > > mzML proposes solving this issue by releasing a semantic validator with > the data format, enforcing rules as to which controlled vocabularies may > (and must) be used within a given location in the document. This appears = a > sensible approach, as the dependence on both an XML and supplied semantic > validator, along with a managed centralised repository for controlled > vocabulary terms, is likely to prevent mzML from developing into a number= of > diversifying dialects. > This is what folks where complaining about in previous threads and is a contentious issue. The main criticism, to summarize, is that the current use of CVparams coupled to a special purpose and custom built semantic validator is operationally no different than a "hard-coded" and quickly evolving XML schema. Personally I am of the opinion that the CVParam usage has advantages over the quickly evolving schema, but don't have a good answer for semantic validation of said terms. The only things I can propose (which are answers but not good ones) is that we: (1) use RDF for the CV instead of OBO, hence dropping the non-standard validator; or (2) just mov= e closer to current mzXML practice with some very important and slow-changing terms in the schema, leaving other non-essential to capture of mass-spec data as CV param. The problem with #2 is that "non-essential" means widely different things to different people so that is a BIG AND LONG conversation= . For that reason I am inclined to move to option #1 or at least try it out for size. I am aware of how ridiculous RDF is BTW. Not a fan, but there seems to be a lot of momentum behind the idea in the W3C and other standard= s groups. Read that as "we won't be deserted on an island" if we develop around RDF. An existing open issue is how to support new CV terms. The most robust, and > therefore preferable, approach is the "ideal world" scenario, in which a = new > term is suggested to the CV coordinator, who would verify that the term i= s > indeed novel and not an synonym of an existing term, and if so add the te= rm > to the CV by updating the centralised repository that is used by the > existing semantic validator. It is claimed that this approach may be > objectionable to some, due to the fact that parsing software must be > connected to the internet to ensure that the semantic validation is takin= g > place according to the updated CV. This seems a somewhat outdated concern= , > as software is increasing generated that assumes that an internet connect= ion > is present. Furthermore, there have been concerns that implementing such = an > approach is non-trivial. This too seems an improbable argument, as > implementing a module to download a centralised CV parameter file would > appear to be far less complex than the implementation of tools to process > the data once it has been parsed. > Has to be part of the MS CV working group process. Much of these issues could be resolved by implementing a number of parser > libraries, such as have been developed for the parsing of systems biology > markup language (SBML)1 in the form of libSBML2. libSBML is implemented i= n > a number of programming languages, and as such has been widely uptaken by > the systems biology software development community, removing the need to > implement parsers. This approach allows the software development communit= y > to concentrate on tool development, and reduces some of the issues that a= re > found objectionable here. > Never heard of it. It is claimed that use of additional ontologies may be useful to supplement > the mzML controlled vocabulary. It is very unclear why ChEBI3 =96 an > ontology describing chemical entities - may be considered appropriate in > these circumstances. mzML concerns itself with mass spectrometry data. An= y > subsequent identifications of molecules from this data, in which ChEBI te= rms > may be appropriate, is considered separately in other formats such as the > forthcoming analysisXML. > ditto and also out of scope for mzML *The model XML schema* > > The concepts described in the XML schema are familiar to those who have > previously used mzData. Most of the schema appears to satisfactorily cove= r > the requirements for describing mass spectrometry data. > > There are questions related to the concept of describing samples in the > schema as it currently stands. > > - The first regards the concept of multidimensional LC/MS/MS > experiments, in which an individual sample may be separated by a numbe= r of > LC steps, generating a number of fractions from a single sample. Each = of > these subsamples are typically analysed by taking performing an indivi= dual > acquisition, thus generating a number of runs and sourceFiles for > the original sample. As the mzML schema specifies that there is a 1:1 > relationship between mzML and run, it is not clear whether an > individual mzML file in this context would contain data from an indivi= dual > subsample. run has a single, optional sampleRef attribute, but also > may contain a sourceFileRefList, in which a number of sourceFiles > can be referenced. Furthermore, a CV term exists to describe a 'sample > batch' (MS:10000053), but it is unclear how the current schema could b= e used > to relate subsamples to an original, pre-fractionation sample. It is t= hought > then that the management of the relationship between original sample a= nd > subsamples is not unambiguously catered for in the schema as it stands= . As > such multidimensional LC/MS/MS experiments are becoming increasingly > commonplace, it is felt that this is an issue that may need to be addr= essed. > - The second question regards a related yet separate issue. In the > case of quantitative proteomics (and metabolomics) experiments, a give= n > sample that is analysed with MS is usually a mixture of two or more sa= mples, > which are isotopically labelled to allow components to be identified a= nd > quantified. Considering the example of a proteomic iTRAQ4experiment, t= he sample that is analysed is a mixture of four (latterly, > eight) samples, each of which is labelled with an individual isotopic > component. The schema allows for multiple samples to be specified in > the sampleList, which is appropriate for such experiments. It is > however unclear as to how or if these samples can be annotated in such= a way > that allows mzML to be used in quantitative analyses. In order to do t= his, > any analysis software would need to know both the type of quantitative > experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.) and how > the individual samples were labelled (iTRAQ label 114, C-terminal 18O, > etc.). It may be that this meta-data is considered to be outside of th= e > scope of the mzML format. If this meta-data were absent, then mzML fil= es of > this format would be incredibly difficult to analyse by third parties.= A > similar problem occurs with digestion protocols used to generate the s= ample. > In the case of typical proteomics studies in which the sample is diges= ted > before analysis, if meta-data regarding which digestion enzyme was use= d were > not present, this data would be very difficult to analyse with databas= e > search engines. > > These two are related and are outside of our scope, IMHO. > *Reviewer 3* > > I looked at the Word document briefly, but I wasn't able to answer my > question about whether mzML supports the storage of MRM (multiple reactio= n > monitoring) data. This is an alternative scanning strategy that records > intensities for a small set of specified m/z transitions. > Was that in there? I thought we had not worked the scheme out for MRM in 0.= 9 ? -angel ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > --=20 Angel Pizarro Director, ITMAT Bioinformatics Facility 806 Biological Research Building 421 Curie Blvd. Philadelphia, PA 19104-6160 215-573-3736 |