You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(1) |
Aug
(5) |
Sep
|
Oct
(5) |
Nov
(1) |
Dec
(2) |
2005 |
Jan
(2) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(5) |
Jun
(2) |
Jul
(3) |
Aug
(7) |
Sep
(18) |
Oct
(22) |
Nov
(10) |
Dec
(15) |
2006 |
Jan
(15) |
Feb
(8) |
Mar
(16) |
Apr
(8) |
May
(2) |
Jun
(5) |
Jul
(3) |
Aug
(1) |
Sep
(34) |
Oct
(21) |
Nov
(14) |
Dec
(2) |
2007 |
Jan
|
Feb
(17) |
Mar
(10) |
Apr
(25) |
May
(11) |
Jun
(30) |
Jul
(1) |
Aug
(38) |
Sep
|
Oct
(119) |
Nov
(18) |
Dec
(3) |
2008 |
Jan
(34) |
Feb
(202) |
Mar
(57) |
Apr
(76) |
May
(44) |
Jun
(33) |
Jul
(33) |
Aug
(32) |
Sep
(41) |
Oct
(49) |
Nov
(84) |
Dec
(216) |
2009 |
Jan
(102) |
Feb
(126) |
Mar
(112) |
Apr
(26) |
May
(91) |
Jun
(54) |
Jul
(39) |
Aug
(29) |
Sep
(16) |
Oct
(18) |
Nov
(12) |
Dec
(23) |
2010 |
Jan
(29) |
Feb
(7) |
Mar
(11) |
Apr
(22) |
May
(9) |
Jun
(13) |
Jul
(7) |
Aug
(10) |
Sep
(9) |
Oct
(20) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
(4) |
Mar
(27) |
Apr
(15) |
May
(23) |
Jun
(13) |
Jul
(15) |
Aug
(11) |
Sep
(23) |
Oct
(18) |
Nov
(10) |
Dec
(7) |
2012 |
Jan
(23) |
Feb
(19) |
Mar
(7) |
Apr
(20) |
May
(16) |
Jun
(4) |
Jul
(6) |
Aug
(6) |
Sep
(14) |
Oct
(16) |
Nov
(31) |
Dec
(23) |
2013 |
Jan
(14) |
Feb
(19) |
Mar
(7) |
Apr
(25) |
May
(8) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(20) |
Oct
(19) |
Nov
(10) |
Dec
(12) |
2014 |
Jan
(6) |
Feb
(15) |
Mar
(6) |
Apr
(4) |
May
(16) |
Jun
(6) |
Jul
(4) |
Aug
(2) |
Sep
(3) |
Oct
(3) |
Nov
(7) |
Dec
(3) |
2015 |
Jan
(3) |
Feb
(8) |
Mar
(14) |
Apr
(3) |
May
(17) |
Jun
(9) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
(13) |
Nov
|
Dec
(6) |
2016 |
Jan
(8) |
Feb
(1) |
Mar
(20) |
Apr
(16) |
May
(11) |
Jun
(6) |
Jul
(5) |
Aug
|
Sep
(2) |
Oct
(5) |
Nov
(7) |
Dec
(2) |
2017 |
Jan
(10) |
Feb
(3) |
Mar
(17) |
Apr
(7) |
May
(5) |
Jun
(11) |
Jul
(4) |
Aug
(12) |
Sep
(9) |
Oct
(7) |
Nov
(2) |
Dec
(4) |
2018 |
Jan
(7) |
Feb
(2) |
Mar
(5) |
Apr
(6) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(1) |
Sep
(9) |
Oct
(5) |
Nov
(3) |
Dec
(5) |
2019 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(4) |
May
(2) |
Jun
(8) |
Jul
(2) |
Aug
(2) |
Sep
|
Oct
(2) |
Nov
(9) |
Dec
(1) |
2020 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
|
May
(3) |
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Brian P. <bri...@in...> - 2008-01-29 18:07:35
|
Hello All, Well, I'm on record as being aghast at the OBO-centric approach, but I can see that battle is all but lost. So, on to pragmatics: In a world lacking the kind of software development tools available for W3C schema, Reviewer 1's suggestion of the libSBML approach is apt. You'll want to provide a single C/C++ codebase with SWIG-generated bindings to hook it up to other languages like Java, Perl, Python, Ruby, Matlab, etc* so that the various custom parsing and validation methods required don't have to be implemented over and over again. More importantly, in a system as unnecessarily complex as this the risk of errors and ambiguities is sorely elevated, but with a single read/write implementation everybody will at least be responding to those issues the same way. Darren Kessler's work looks promising as the core for this. Note that I am NOT saying that Darren's work contains errors (it looks quite nice, actually), but he's already finding all the inconsistencies and gaps that a fully specified W3C schema would have largely avoided, and having a single set of responses to these problems will create a more stable world for mzML. (*Reviewer 1 is incorrect in saying that multiple libSBML implementations exist in multiple languages - there's just one C/C++ implementation with automatically generated language bindings. You still have to compile the C code for the target platform and install the library so the SWIG-generated language bindings can call into it at runtime. Not the dream of drop-in portability that Java and Python promise, but better than nothing. See http://sbml.org/software/libsbml/docs/cpp-api/libsbml-installation.html#othe r-lang for details.) Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Tuesday, January 29, 2008 12:29 AM To: Mass spectrometry standard development Cc: Eric Deutsch Subject: [Psidev-ms-dev] mzML reviews are in Hi everyone, the mzML reviews are back. Thank you to Norman Paton and the anonymous reviewers. I have pasted the reviews below for your perusal if you are interested. I hope everyone can devote a little time in the next month to making another (and hopefully final) push to get mzML finished. I would like to propose a telephone conference in a week: Tuesday February 5: 09:00 San Francisco 12:00 New York 17:00 London (GMT) 18:00 Europe Let me know if there are concerns about the time. Later this week I will draft up the list of things yet to do (including items from reviews and other things that have emerged in the last two months) and send out an agenda. Thanks, Eric ---------------------------------------------------------------------------- ---------------------- Reviewer 1 Introduction This review concerns Draft of Version 1.0.0 of the specification for the mzML data format developed by the HUPO Proteomics Standards Initiative. mzML is intended as a replacement for the existing XML data formats used for markup of mass spectrometry data, mzData and mzXML. Implementation of the format The philosophy behind the development of mzML is to combine the flexibility of mzData with the robustness of mzXML. mzData allowed the additional of new controlled vocabulary terms as technologies and concepts develop, but has the disadvantage of allowing inconsistent use of these terms. mzXML is less flexible in terms of allowing the use of controlled vocabulary terms, and has the disadvantage of requiring full schema revisions to keep pace with advancements in mass spectrometry. mzML proposes solving this issue by releasing a semantic validator with the data format, enforcing rules as to which controlled vocabularies may (and must) be used within a given location in the document. This appears a sensible approach, as the dependence on both an XML and supplied semantic validator, along with a managed centralised repository for controlled vocabulary terms, is likely to prevent mzML from developing into a number of diversifying dialects. An existing open issue is how to support new CV terms. The most robust, and therefore preferable, approach is the "ideal world" scenario, in which a new term is suggested to the CV coordinator, who would verify that the term is indeed novel and not an synonym of an existing term, and if so add the term to the CV by updating the centralised repository that is used by the existing semantic validator. It is claimed that this approach may be objectionable to some, due to the fact that parsing software must be connected to the internet to ensure that the semantic validation is taking place according to the updated CV. This seems a somewhat outdated concern, as software is increasing generated that assumes that an internet connection is present. Furthermore, there have been concerns that implementing such an approach is non-trivial. This too seems an improbable argument, as implementing a module to download a centralised CV parameter file would appear to be far less complex than the implementation of tools to process the data once it has been parsed. Much of these issues could be resolved by implementing a number of parser libraries, such as have been developed for the parsing of systems biology markup language (SBML)1 in the form of libSBML2. libSBML is implemented in a number of programming languages, and as such has been widely uptaken by the systems biology software development community, removing the need to implement parsers. This approach allows the software development community to concentrate on tool development, and reduces some of the issues that are found objectionable here. It is claimed that use of additional ontologies may be useful to supplement the mzML controlled vocabulary. It is very unclear why ChEBI3 - an ontology describing chemical entities - may be considered appropriate in these circumstances. mzML concerns itself with mass spectrometry data. Any subsequent identifications of molecules from this data, in which ChEBI terms may be appropriate, is considered separately in other formats such as the forthcoming analysisXML. The model XML schema The concepts described in the XML schema are familiar to those who have previously used mzData. Most of the schema appears to satisfactorily cover the requirements for describing mass spectrometry data. There are questions related to the concept of describing samples in the schema as it currently stands. * The first regards the concept of multidimensional LC/MS/MS experiments, in which an individual sample may be separated by a number of LC steps, generating a number of fractions from a single sample. Each of these subsamples are typically analysed by taking performing an individual acquisition, thus generating a number of runs and sourceFiles for the original sample. As the mzML schema specifies that there is a 1:1 relationship between mzML and run, it is not clear whether an individual mzML file in this context would contain data from an individual subsample. run has a single, optional sampleRef attribute, but also may contain a sourceFileRefList, in which a number of sourceFiles can be referenced. Furthermore, a CV term exists to describe a 'sample batch' (MS:10000053), but it is unclear how the current schema could be used to relate subsamples to an original, pre-fractionation sample. It is thought then that the management of the relationship between original sample and subsamples is not unambiguously catered for in the schema as it stands. As such multidimensional LC/MS/MS experiments are becoming increasingly commonplace, it is felt that this is an issue that may need to be addressed. * The second question regards a related yet separate issue. In the case of quantitative proteomics (and metabolomics) experiments, a given sample that is analysed with MS is usually a mixture of two or more samples, which are isotopically labelled to allow components to be identified and quantified. Considering the example of a proteomic iTRAQ4 experiment, the sample that is analysed is a mixture of four (latterly, eight) samples, each of which is labelled with an individual isotopic component. The schema allows for multiple samples to be specified in the sampleList, which is appropriate for such experiments. It is however unclear as to how or if these samples can be annotated in such a way that allows mzML to be used in quantitative analyses. In order to do this, any analysis software would need to know both the type of quantitative experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.) and how the individual samples were labelled (iTRAQ label 114, C-terminal 18O, etc.). It may be that this meta-data is considered to be outside of the scope of the mzML format. If this meta-data were absent, then mzML files of this format would be incredibly difficult to analyse by third parties. A similar problem occurs with digestion protocols used to generate the sample. In the case of typical proteomics studies in which the sample is digested before analysis, if meta-data regarding which digestion enzyme was used were not present, this data would be very difficult to analyse with database search engines. References 1The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Hucka M, et al. Bioinformatics 19(4):524-31 (2003). 2 <http://sbml.org/software/libsbml/> http://sbml.org/software/libsbml/ 3ChEBI: a database and ontology for chemical entities of biological interest. Degtyarenko K, et al. Nucleic Acids Res. 36:D344-50. (2008). 4Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Ross PL, et al. Mol Cell Proteomics 3(12):1154-69. (2004). 5Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics. Ong SE, et al. Mol. Cell. Proteom. 1:376-386. (2002). 6Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. Gygi SP, et al. J Proteome Res. 1(1):47-54. (2002). Reviewer 2 The specification clearly fits the purpose, as there is a need for a broadly adopted standard, both among theMS vendors and tool developers, as well as among users. The specification is clearly written and at this stage there is no need for any major changes. One potential consideration: The authors have decided to create a comprehensive and expandable standard, which means that the specification itself is quite heavy, especially all the cvParam material. This leads to the usual problem of having too restrictive standard into potential future problems: maintaining CV terms and having very incomplete implementations of parser/writer code in software tools. It will thus be essential to have a broad community support for mzML to succeed and grow with the advancing field. Reviewer 3 I looked at the Word document briefly, but I wasn't able to answer my question about whether mzML supports the storage of MRM (multiple reaction monitoring) data. This is an alternative scanning strategy that records intensities for a small set of specified m/z transitions. |
From: Angel P. <an...@ma...> - 2008-01-29 16:08:38
|
Hi Eric, Thanks for the forward. I won't be able to attend the call as I will be in transit. My comments below: On 1/29/08, Eric Deutsch <ede...@sy...> wrote: > > *Reviewer 1**Implementation of the format* > > The philosophy behind the development of mzML is to combine the > flexibility of mzData with the robustness of mzXML. mzData allowed the > additional of new controlled vocabulary terms as technologies and concept= s > develop, but has the disadvantage of allowing inconsistent use of these > terms. mzXML is less flexible in terms of allowing the use of controlled > vocabulary terms, and has the disadvantage of requiring full schema > revisions to keep pace with advancements in mass spectrometry. > > mzML proposes solving this issue by releasing a semantic validator with > the data format, enforcing rules as to which controlled vocabularies may > (and must) be used within a given location in the document. This appears = a > sensible approach, as the dependence on both an XML and supplied semantic > validator, along with a managed centralised repository for controlled > vocabulary terms, is likely to prevent mzML from developing into a number= of > diversifying dialects. > This is what folks where complaining about in previous threads and is a contentious issue. The main criticism, to summarize, is that the current use of CVparams coupled to a special purpose and custom built semantic validator is operationally no different than a "hard-coded" and quickly evolving XML schema. Personally I am of the opinion that the CVParam usage has advantages over the quickly evolving schema, but don't have a good answer for semantic validation of said terms. The only things I can propose (which are answers but not good ones) is that we: (1) use RDF for the CV instead of OBO, hence dropping the non-standard validator; or (2) just mov= e closer to current mzXML practice with some very important and slow-changing terms in the schema, leaving other non-essential to capture of mass-spec data as CV param. The problem with #2 is that "non-essential" means widely different things to different people so that is a BIG AND LONG conversation= . For that reason I am inclined to move to option #1 or at least try it out for size. I am aware of how ridiculous RDF is BTW. Not a fan, but there seems to be a lot of momentum behind the idea in the W3C and other standard= s groups. Read that as "we won't be deserted on an island" if we develop around RDF. An existing open issue is how to support new CV terms. The most robust, and > therefore preferable, approach is the "ideal world" scenario, in which a = new > term is suggested to the CV coordinator, who would verify that the term i= s > indeed novel and not an synonym of an existing term, and if so add the te= rm > to the CV by updating the centralised repository that is used by the > existing semantic validator. It is claimed that this approach may be > objectionable to some, due to the fact that parsing software must be > connected to the internet to ensure that the semantic validation is takin= g > place according to the updated CV. This seems a somewhat outdated concern= , > as software is increasing generated that assumes that an internet connect= ion > is present. Furthermore, there have been concerns that implementing such = an > approach is non-trivial. This too seems an improbable argument, as > implementing a module to download a centralised CV parameter file would > appear to be far less complex than the implementation of tools to process > the data once it has been parsed. > Has to be part of the MS CV working group process. Much of these issues could be resolved by implementing a number of parser > libraries, such as have been developed for the parsing of systems biology > markup language (SBML)1 in the form of libSBML2. libSBML is implemented i= n > a number of programming languages, and as such has been widely uptaken by > the systems biology software development community, removing the need to > implement parsers. This approach allows the software development communit= y > to concentrate on tool development, and reduces some of the issues that a= re > found objectionable here. > Never heard of it. It is claimed that use of additional ontologies may be useful to supplement > the mzML controlled vocabulary. It is very unclear why ChEBI3 =96 an > ontology describing chemical entities - may be considered appropriate in > these circumstances. mzML concerns itself with mass spectrometry data. An= y > subsequent identifications of molecules from this data, in which ChEBI te= rms > may be appropriate, is considered separately in other formats such as the > forthcoming analysisXML. > ditto and also out of scope for mzML *The model XML schema* > > The concepts described in the XML schema are familiar to those who have > previously used mzData. Most of the schema appears to satisfactorily cove= r > the requirements for describing mass spectrometry data. > > There are questions related to the concept of describing samples in the > schema as it currently stands. > > - The first regards the concept of multidimensional LC/MS/MS > experiments, in which an individual sample may be separated by a numbe= r of > LC steps, generating a number of fractions from a single sample. Each = of > these subsamples are typically analysed by taking performing an indivi= dual > acquisition, thus generating a number of runs and sourceFiles for > the original sample. As the mzML schema specifies that there is a 1:1 > relationship between mzML and run, it is not clear whether an > individual mzML file in this context would contain data from an indivi= dual > subsample. run has a single, optional sampleRef attribute, but also > may contain a sourceFileRefList, in which a number of sourceFiles > can be referenced. Furthermore, a CV term exists to describe a 'sample > batch' (MS:10000053), but it is unclear how the current schema could b= e used > to relate subsamples to an original, pre-fractionation sample. It is t= hought > then that the management of the relationship between original sample a= nd > subsamples is not unambiguously catered for in the schema as it stands= . As > such multidimensional LC/MS/MS experiments are becoming increasingly > commonplace, it is felt that this is an issue that may need to be addr= essed. > - The second question regards a related yet separate issue. In the > case of quantitative proteomics (and metabolomics) experiments, a give= n > sample that is analysed with MS is usually a mixture of two or more sa= mples, > which are isotopically labelled to allow components to be identified a= nd > quantified. Considering the example of a proteomic iTRAQ4experiment, t= he sample that is analysed is a mixture of four (latterly, > eight) samples, each of which is labelled with an individual isotopic > component. The schema allows for multiple samples to be specified in > the sampleList, which is appropriate for such experiments. It is > however unclear as to how or if these samples can be annotated in such= a way > that allows mzML to be used in quantitative analyses. In order to do t= his, > any analysis software would need to know both the type of quantitative > experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.) and how > the individual samples were labelled (iTRAQ label 114, C-terminal 18O, > etc.). It may be that this meta-data is considered to be outside of th= e > scope of the mzML format. If this meta-data were absent, then mzML fil= es of > this format would be incredibly difficult to analyse by third parties.= A > similar problem occurs with digestion protocols used to generate the s= ample. > In the case of typical proteomics studies in which the sample is diges= ted > before analysis, if meta-data regarding which digestion enzyme was use= d were > not present, this data would be very difficult to analyse with databas= e > search engines. > > These two are related and are outside of our scope, IMHO. > *Reviewer 3* > > I looked at the Word document briefly, but I wasn't able to answer my > question about whether mzML supports the storage of MRM (multiple reactio= n > monitoring) data. This is an alternative scanning strategy that records > intensities for a small set of specified m/z transitions. > Was that in there? I thought we had not worked the scheme out for MRM in 0.= 9 ? -angel ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > --=20 Angel Pizarro Director, ITMAT Bioinformatics Facility 806 Biological Research Building 421 Curie Blvd. Philadelphia, PA 19104-6160 215-573-3736 |
From: Eric D. <ede...@sy...> - 2008-01-29 08:29:30
|
Hi everyone, the mzML reviews are back. Thank you to Norman Paton and the anonymous reviewers. I have pasted the reviews below for your perusal if you are interested. I hope everyone can devote a little time in the next month to making another (and hopefully final) push to get mzML finished. I would like to propose a telephone conference in a week: =20 Tuesday February 5: =20 09:00 San Francisco 12:00 New York 17:00 London (GMT) 18:00 Europe =20 Let me know if there are concerns about the time. =20 Later this week I will draft up the list of things yet to do (including items from reviews and other things that have emerged in the last two months) and send out an agenda. =20 Thanks, Eric =20 ------------------------------------------------------------------------ -------------------------- =20 Reviewer 1 =20 Introduction This review concerns Draft of Version 1.0.0 of the specification for the mzML data format developed by the HUPO Proteomics Standards Initiative. mzML is intended as a replacement for the existing XML data formats used for markup of mass spectrometry data, mzData and mzXML. Implementation of the format The philosophy behind the development of mzML is to combine the flexibility of mzData with the robustness of mzXML. mzData allowed the additional of new controlled vocabulary terms as technologies and concepts develop, but has the disadvantage of allowing inconsistent use of these terms. mzXML is less flexible in terms of allowing the use of controlled vocabulary terms, and has the disadvantage of requiring full schema revisions to keep pace with advancements in mass spectrometry. mzML proposes solving this issue by releasing a semantic validator with the data format, enforcing rules as to which controlled vocabularies may (and must) be used within a given location in the document. This appears a sensible approach, as the dependence on both an XML and supplied semantic validator, along with a managed centralised repository for controlled vocabulary terms, is likely to prevent mzML from developing into a number of diversifying dialects. An existing open issue is how to support new CV terms. The most robust, and therefore preferable, approach is the "ideal world" scenario, in which a new term is suggested to the CV coordinator, who would verify that the term is indeed novel and not an synonym of an existing term, and if so add the term to the CV by updating the centralised repository that is used by the existing semantic validator. It is claimed that this approach may be objectionable to some, due to the fact that parsing software must be connected to the internet to ensure that the semantic validation is taking place according to the updated CV. This seems a somewhat outdated concern, as software is increasing generated that assumes that an internet connection is present. Furthermore, there have been concerns that implementing such an approach is non-trivial. This too seems an improbable argument, as implementing a module to download a centralised CV parameter file would appear to be far less complex than the implementation of tools to process the data once it has been parsed. Much of these issues could be resolved by implementing a number of parser libraries, such as have been developed for the parsing of systems biology markup language (SBML)1 in the form of libSBML2. libSBML is implemented in a number of programming languages, and as such has been widely uptaken by the systems biology software development community, removing the need to implement parsers. This approach allows the software development community to concentrate on tool development, and reduces some of the issues that are found objectionable here. It is claimed that use of additional ontologies may be useful to supplement the mzML controlled vocabulary. It is very unclear why ChEBI3 - an ontology describing chemical entities - may be considered appropriate in these circumstances. mzML concerns itself with mass spectrometry data. Any subsequent identifications of molecules from this data, in which ChEBI terms may be appropriate, is considered separately in other formats such as the forthcoming analysisXML. The model XML schema The concepts described in the XML schema are familiar to those who have previously used mzData. Most of the schema appears to satisfactorily cover the requirements for describing mass spectrometry data. There are questions related to the concept of describing samples in the schema as it currently stands. * The first regards the concept of multidimensional LC/MS/MS experiments, in which an individual sample may be separated by a number of LC steps, generating a number of fractions from a single sample. Each of these subsamples are typically analysed by taking performing an individual acquisition, thus generating a number of runs and sourceFiles for the original sample. As the mzML schema specifies that there is a 1:1 relationship between mzML and run, it is not clear whether an individual mzML file in this context would contain data from an individual subsample. run has a single, optional sampleRef attribute, but also may contain a sourceFileRefList, in which a number of sourceFiles can be referenced. Furthermore, a CV term exists to describe a 'sample batch' (MS:10000053), but it is unclear how the current schema could be used to relate subsamples to an original, pre-fractionation sample. It is thought then that the management of the relationship between original sample and subsamples is not unambiguously catered for in the schema as it stands. As such multidimensional LC/MS/MS experiments are becoming increasingly commonplace, it is felt that this is an issue that may need to be addressed. * The second question regards a related yet separate issue. In the case of quantitative proteomics (and metabolomics) experiments, a given sample that is analysed with MS is usually a mixture of two or more samples, which are isotopically labelled to allow components to be identified and quantified. Considering the example of a proteomic iTRAQ4 experiment, the sample that is analysed is a mixture of four (latterly, eight) samples, each of which is labelled with an individual isotopic component. The schema allows for multiple samples to be specified in the sampleList, which is appropriate for such experiments. It is however unclear as to how or if these samples can be annotated in such a way that allows mzML to be used in quantitative analyses. In order to do this, any analysis software would need to know both the type of quantitative experiment that was performed (iTRAQ, SILAC5, ICAT6, etc.) and how the individual samples were labelled (iTRAQ label 114, C-terminal 18O, etc.). It may be that this meta-data is considered to be outside of the scope of the mzML format. If this meta-data were absent, then mzML files of this format would be incredibly difficult to analyse by third parties. A similar problem occurs with digestion protocols used to generate the sample. In the case of typical proteomics studies in which the sample is digested before analysis, if meta-data regarding which digestion enzyme was used were not present, this data would be very difficult to analyse with database search engines. References 1The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Hucka M, et al. Bioinformatics 19(4):524-31 (2003). 2http://sbml.org/software/libsbml/ <http://sbml.org/software/libsbml/>=20 3ChEBI: a database and ontology for chemical entities of biological interest. Degtyarenko K, et al. Nucleic Acids Res. 36:D344-50. (2008). 4Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Ross PL, et al. Mol Cell Proteomics 3(12):1154-69. (2004). 5Stable Isotope Labeling by Amino Acids in Cell Culture, SILAC, as a Simple and Accurate Approach to Expression Proteomics. Ong SE, et al. Mol. Cell. Proteom. 1:376-386. (2002). 6Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. Gygi SP, et al. J Proteome Res. 1(1):47-54. (2002). =20 Reviewer 2 The specification clearly fits the purpose, as there is a need for a broadly adopted standard, both among theMS vendors and tool developers, as well as among users. =20 The specification is clearly written and at this stage there is no need for any major changes. One potential consideration: The authors have decided to create a comprehensive and expandable standard, which means that the specification itself is quite heavy, especially all the cvParam material. This leads to the usual problem of having too restrictive standard into potential future problems: maintaining CV terms and having very incomplete implementations of parser/writer code in software tools. It will thus be essential to have a broad community support for mzML to succeed and grow with the advancing field. =20 Reviewer 3 I looked at the Word document briefly, but I wasn't able to answer my question about whether mzML supports the storage of MRM (multiple reaction monitoring) data. This is an alternative scanning strategy that records intensities for a small set of specified m/z transitions. =20 |
From: Kessner, D. E. <Dar...@cs...> - 2008-01-28 19:46:34
|
Hi all, =20 Just an FYI, and something to check before the next mzML file kit release. =20 I ran some tests of our MSData parser on the example file 1min.mzML, and found a couple of things: 1) the spectrum indexing is not quite right (by about 22 bytes on every <offset>) 2) the binary data encoding is big endian (network), while the specification calls for little endian encoding =20 I'll have a working mzML<->mzXML converter shortly, so it will be easy to create some more example files. =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or = entity to which it is addressed and may contain information that is privi= leged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipi= ent, or the employee or agent responsible for delivering it to the intend= ed recipient, you are hereby notified that any dissemination, distributio= n or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for= your cooperation. |
From: Brian P. <bri...@in...> - 2008-01-24 18:58:04
|
No worries. I'm in complete agreement about preferring simplicity - which is why I don't think much of mzML's convoluted everything's-a-cvParam design. But I'm afraid you're right about it being too late to make significant changes. The train appears to be limping out of the station already. -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Thursday, January 24, 2008 10:42 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML indexing Sorry--didn't mean to scratch a sore point. I just meant to express a general preference, which is, if something can be simplified or left out, please do so. (I realize that the time for changes has largely passed.) Mike On Jan 24, 2008 10:53 AM, Brian Pratt <bri...@in...> wrote: > Well, let's not reopen the index debate again (there have been many > skirmishes). It's been left as an optional item so those who detest the > idea don't have to mess with it. In practice this means a parser that does > choose to exploit the index also has to be ready to possibly derive it (and > of course deal with the distinct possibility that it's bogus due to CRLF > issues etc). > > Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Thursday, January 24, 2008 8:14 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML indexing > > For what it's worth, I'd be in favor of removing *all* redundant > information from the format (with the possible exception of a > checksum). This would include the index, derivable counts, and > anything else that can be determined by inspection. > > The general argument for doing this would be that it would eliminate a > whole class of design decisions of the form "What do I if thing A and > thing B, which by definition are supposed to be consistent, are not?" > It's easy to say "just reject the file", but in reality, we won't be > able to do that. That leaves us with writing code to try to correct > the inconsistencies, for all of the different ways that they occur > across different producers and different versions thereof, and > arguably that will be more complex than the code that, for example, > does things like build indices in the first place. > > Mike > > > > On Jan 24, 2008 9:47 AM, Brian Pratt <bri...@in...> wrote: > > > >> 6) Regarding the length attribute in offset, I am neutral on this. > > >> This makes it a little harder for the writers. I can see that it > > >> would be easier for random access readers. Darren says he's not > > >> interested in it. > > >> Anyone else out there want to lobby for it? > > > > Not me. I never felt the need for it in mzXML - it's derivable anyway as > > length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > > simple. > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On Behalf Of Eric > > Deutsch > > Sent: Wednesday, January 23, 2008 7:40 PM > > To: Mass spectrometry standard development > > > > Cc: Eric Deutsch > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > Hi Darren et al. for this discussion. A few points from me: > > > > 1) It was decided (although not set in stone) that we would like to have > > a unique id per spectrum per file for two reasons: > > > > a) At some point in the future if we have multiple runs per file (not > > supported in this first release) it will continue to be true that a > > spectrum id must be unique within a file. We decided that external > > references from analysisXML, for example, should be to a unique id > > rather than a scan number. > > > > b) We also left the door open to use LSIDs for the id. It can be any > > unique string, and thus if someone wants to use LSIDs for this, the door > > is open. > > > > 2) Also, according to the docs, the precursor scan references are by id > > rather than by scan number (although this appears to be incorrectly > > represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > > an id and thus it needs to be in the index for things to work nicely. I > > see that Matt disagrees that spectrumRef should be to an id. > > > > 3) It is true that the current examples only have scanNumber in the > > index and not id. This should also be fixed before release. > > > > 4) The spec document indeed currently says that scanNumbers in the file > > must be in ascending order, but not necessarily sequential. The comment > > could perhaps be a little more clearly written. > > > > 5) I also do not see the need for the index attribute. I think it should > > be left out, but if there is still a clear need, we could add. > > > > 6) Regarding the length attribute in offset, I am neutral on this. This > > makes it a little harder for the writers. I can see that it would be > > easier for random access readers. Darren says he's not interested in it. > > Anyone else out there want to lobby for it? > > > > 7) Regarding going fully attribute: the intent was to preserve the > > format of the mzXML index as closely as possible to reduce coding work, > > but a better syntax could be entertained. I wouldn't object to: > > > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > > > > 8) Regarding enforcing some of these index rules, we should add them to > > the validator so that validator will do that. > > > > Comments on these items? > > > > Thanks, > > Eric > > > > > > > > > > > From: psi...@li... > > [mailto:psidev-ms-dev- > > > bo...@li...] On Behalf Of Kessner, Darren E. > > > Sent: Wednesday, January 23, 2008 11:19 AM > > > To: Mass spectrometry standard development > > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > Right -- that's why I included the alternative, though I could have > > been > > > less terse: > > > > > > > > > > > > "The alternative is to require that the <index> entries are written in > > the > > > same order as the <spectrumList> entries." > > > > > > > > > > > > I don't know if there is a way to enforce this... > > > > > > > > > > > > > > > > > > Darren > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > From: psi...@li... > > [mailto:psidev-ms-dev- > > > bo...@li...] On Behalf Of Brian Pratt > > > Sent: Wednesday, January 23, 2008 11:08 AM > > > To: 'Mass spectrometry standard development' > > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > > > > > > > Hi Darren, > > > > > > > > > > > > I wonder about this possibility: > > > > > > > > > > > > <index name="spectrum" > > > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > > > <offset index="2" scanNumber="23" id="S23">16217</offset> > > > > > > <offset index="4" scanNumber="25" id="S25">17258</offset> > > > > > > ... > > > > > > </index> > > > > > > > > > > > > If the response is "well, that's not legal, the index values must > > increase > > > in increments of 1 starting from 0" then I don't see why it's needed > > in > > > the first place - I'd expect that the index would just get snarfed up > > into > > > an array and you'd access the nth element to get info on the nth scan > > > appearing in the file. And if it is legal then I don't see what it's > > for... > > > > > > > > > > > > Brian > > > > > > > > > > > > ________________________________ > > > > > > From: psi...@li... > > [mailto:psidev-ms-dev- > > > bo...@li...] On Behalf Of Kessner, Darren E. > > > Sent: Wednesday, January 23, 2008 10:33 AM > > > To: Mass spectrometry standard development > > > Subject: [Psidev-ms-dev] mzML indexing > > > > > > > > > > > > Hi all, > > > > > > > > > > > > There are three ways to refer to a <spectrum> element -- by zero-based > > > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, > > the > > > <index> currently only contains scanNumber. I would like to encode > > the > > > zero-based index and the id as well in the <index> as follows: > > > > > > > > > > > > <index name="spectrum" > > > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > > > <offset index="1" scanNumber="20" id="S20">16217</offset> > > > > > > ... > > > > > > </index> > > > > > > > > > > > > Including the zero-based index is important to enable random access to > > the > > > mzML file when you don't know what scan numbers are contained in the > > file. > > > The alternative is to require that the <index> entries are written in > > the > > > same order as the <spectrumList> entries. > > > > > > > > > > > > Including the 'id' in the <index> entries is necessary for efficiently > > > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > > > this, a dereference requires reading through the <spectrumList> to > > find > > > the right 'id'. This info could be read once and cached, but this > > still > > > defeats the purpose of indexing. > > > > > > > > > > > > > > > > > > Darren > > > > > > > > > > > > > > > > > > > > > > > > Darren Kessner > > > > > > Scientific Programmer > > > > > > Dar...@cs... > > > > > > 310-423-9538 > > > > > > > > > > > > Spielberg Family Center for Applied Proteomics > > > > > > Cedars-Sinai Medical Center > > > > > > http://www.sfcap.cshs.org/ > > > > > > > > > > > > > > > > > > IMPORTANT WARNING: This message is intended for the use of the person > > or > > > entity to which it is addressed and may contain information that is > > > privileged and confidential, the disclosure of which is governed by > > > applicable law. If the reader of this message is not the intended > > > recipient, or the employee or agent responsible for delivering it to > > the > > > intended recipient, you are hereby notified that any dissemination, > > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > > > If you have received this message in error, please notify us > > immediately > > > by calling (310) 423-6428 and destroy the related message. Thank You > > for > > > your cooperation. > > > > > > IMPORTANT WARNING: This message is intended for the use of the person > > or > > > entity to which it is addressed and may contain information that is > > > privileged and confidential, the disclosure of which is governed by > > > applicable law. If the reader of this message is not the intended > > > recipient, or the employee or agent responsible for delivering it to > > the > > > intended recipient, you are hereby notified that any dissemination, > > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > > > If you have received this message in error, please notify us > > immediately > > > by calling (310) 423-6428 and destroy the related message. Thank You > > for > > > your cooperation. > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2008. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2008. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Mike C. <tu...@gm...> - 2008-01-24 18:42:04
|
Sorry--didn't mean to scratch a sore point. I just meant to express a general preference, which is, if something can be simplified or left out, please do so. (I realize that the time for changes has largely passed.) Mike On Jan 24, 2008 10:53 AM, Brian Pratt <bri...@in...> wrote: > Well, let's not reopen the index debate again (there have been many > skirmishes). It's been left as an optional item so those who detest the > idea don't have to mess with it. In practice this means a parser that does > choose to exploit the index also has to be ready to possibly derive it (and > of course deal with the distinct possibility that it's bogus due to CRLF > issues etc). > > Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Thursday, January 24, 2008 8:14 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML indexing > > For what it's worth, I'd be in favor of removing *all* redundant > information from the format (with the possible exception of a > checksum). This would include the index, derivable counts, and > anything else that can be determined by inspection. > > The general argument for doing this would be that it would eliminate a > whole class of design decisions of the form "What do I if thing A and > thing B, which by definition are supposed to be consistent, are not?" > It's easy to say "just reject the file", but in reality, we won't be > able to do that. That leaves us with writing code to try to correct > the inconsistencies, for all of the different ways that they occur > across different producers and different versions thereof, and > arguably that will be more complex than the code that, for example, > does things like build indices in the first place. > > Mike > > > > On Jan 24, 2008 9:47 AM, Brian Pratt <bri...@in...> wrote: > > > >> 6) Regarding the length attribute in offset, I am neutral on this. > > >> This makes it a little harder for the writers. I can see that it > > >> would be easier for random access readers. Darren says he's not > > >> interested in it. > > >> Anyone else out there want to lobby for it? > > > > Not me. I never felt the need for it in mzXML - it's derivable anyway as > > length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > > simple. > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On Behalf Of Eric > > Deutsch > > Sent: Wednesday, January 23, 2008 7:40 PM > > To: Mass spectrometry standard development > > > > Cc: Eric Deutsch > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > Hi Darren et al. for this discussion. A few points from me: > > > > 1) It was decided (although not set in stone) that we would like to have > > a unique id per spectrum per file for two reasons: > > > > a) At some point in the future if we have multiple runs per file (not > > supported in this first release) it will continue to be true that a > > spectrum id must be unique within a file. We decided that external > > references from analysisXML, for example, should be to a unique id > > rather than a scan number. > > > > b) We also left the door open to use LSIDs for the id. It can be any > > unique string, and thus if someone wants to use LSIDs for this, the door > > is open. > > > > 2) Also, according to the docs, the precursor scan references are by id > > rather than by scan number (although this appears to be incorrectly > > represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > > an id and thus it needs to be in the index for things to work nicely. I > > see that Matt disagrees that spectrumRef should be to an id. > > > > 3) It is true that the current examples only have scanNumber in the > > index and not id. This should also be fixed before release. > > > > 4) The spec document indeed currently says that scanNumbers in the file > > must be in ascending order, but not necessarily sequential. The comment > > could perhaps be a little more clearly written. > > > > 5) I also do not see the need for the index attribute. I think it should > > be left out, but if there is still a clear need, we could add. > > > > 6) Regarding the length attribute in offset, I am neutral on this. This > > makes it a little harder for the writers. I can see that it would be > > easier for random access readers. Darren says he's not interested in it. > > Anyone else out there want to lobby for it? > > > > 7) Regarding going fully attribute: the intent was to preserve the > > format of the mzXML index as closely as possible to reduce coding work, > > but a better syntax could be entertained. I wouldn't object to: > > > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > > > > 8) Regarding enforcing some of these index rules, we should add them to > > the validator so that validator will do that. > > > > Comments on these items? > > > > Thanks, > > Eric > > > > > > > > > > > From: psi...@li... > > [mailto:psidev-ms-dev- > > > bo...@li...] On Behalf Of Kessner, Darren E. > > > Sent: Wednesday, January 23, 2008 11:19 AM > > > To: Mass spectrometry standard development > > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > Right -- that's why I included the alternative, though I could have > > been > > > less terse: > > > > > > > > > > > > "The alternative is to require that the <index> entries are written in > > the > > > same order as the <spectrumList> entries." > > > > > > > > > > > > I don't know if there is a way to enforce this... > > > > > > > > > > > > > > > > > > Darren > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > From: psi...@li... > > [mailto:psidev-ms-dev- > > > bo...@li...] On Behalf Of Brian Pratt > > > Sent: Wednesday, January 23, 2008 11:08 AM > > > To: 'Mass spectrometry standard development' > > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > > > > > > > Hi Darren, > > > > > > > > > > > > I wonder about this possibility: > > > > > > > > > > > > <index name="spectrum" > > > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > > > <offset index="2" scanNumber="23" id="S23">16217</offset> > > > > > > <offset index="4" scanNumber="25" id="S25">17258</offset> > > > > > > ... > > > > > > </index> > > > > > > > > > > > > If the response is "well, that's not legal, the index values must > > increase > > > in increments of 1 starting from 0" then I don't see why it's needed > > in > > > the first place - I'd expect that the index would just get snarfed up > > into > > > an array and you'd access the nth element to get info on the nth scan > > > appearing in the file. And if it is legal then I don't see what it's > > for... > > > > > > > > > > > > Brian > > > > > > > > > > > > ________________________________ > > > > > > From: psi...@li... > > [mailto:psidev-ms-dev- > > > bo...@li...] On Behalf Of Kessner, Darren E. > > > Sent: Wednesday, January 23, 2008 10:33 AM > > > To: Mass spectrometry standard development > > > Subject: [Psidev-ms-dev] mzML indexing > > > > > > > > > > > > Hi all, > > > > > > > > > > > > There are three ways to refer to a <spectrum> element -- by zero-based > > > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, > > the > > > <index> currently only contains scanNumber. I would like to encode > > the > > > zero-based index and the id as well in the <index> as follows: > > > > > > > > > > > > <index name="spectrum" > > > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > > > <offset index="1" scanNumber="20" id="S20">16217</offset> > > > > > > ... > > > > > > </index> > > > > > > > > > > > > Including the zero-based index is important to enable random access to > > the > > > mzML file when you don't know what scan numbers are contained in the > > file. > > > The alternative is to require that the <index> entries are written in > > the > > > same order as the <spectrumList> entries. > > > > > > > > > > > > Including the 'id' in the <index> entries is necessary for efficiently > > > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > > > this, a dereference requires reading through the <spectrumList> to > > find > > > the right 'id'. This info could be read once and cached, but this > > still > > > defeats the purpose of indexing. > > > > > > > > > > > > > > > > > > Darren > > > > > > > > > > > > > > > > > > > > > > > > Darren Kessner > > > > > > Scientific Programmer > > > > > > Dar...@cs... > > > > > > 310-423-9538 > > > > > > > > > > > > Spielberg Family Center for Applied Proteomics > > > > > > Cedars-Sinai Medical Center > > > > > > http://www.sfcap.cshs.org/ > > > > > > > > > > > > > > > > > > IMPORTANT WARNING: This message is intended for the use of the person > > or > > > entity to which it is addressed and may contain information that is > > > privileged and confidential, the disclosure of which is governed by > > > applicable law. If the reader of this message is not the intended > > > recipient, or the employee or agent responsible for delivering it to > > the > > > intended recipient, you are hereby notified that any dissemination, > > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > > > If you have received this message in error, please notify us > > immediately > > > by calling (310) 423-6428 and destroy the related message. Thank You > > for > > > your cooperation. > > > > > > IMPORTANT WARNING: This message is intended for the use of the person > > or > > > entity to which it is addressed and may contain information that is > > > privileged and confidential, the disclosure of which is governed by > > > applicable law. If the reader of this message is not the intended > > > recipient, or the employee or agent responsible for delivering it to > > the > > > intended recipient, you are hereby notified that any dissemination, > > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > > > If you have received this message in error, please notify us > > immediately > > > by calling (310) 423-6428 and destroy the related message. Thank You > > for > > > your cooperation. > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2008. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2008. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Mike C. <tu...@gm...> - 2008-01-24 18:40:11
|
I see the checksum as a special case, kind of in the same way that "one" is a special case when asking whether numbers are prime or not. The checksum is redundant in the sense that it can always be regenerated by examining the file. Unlike other redundant items, though, the regenerated value is of no use as a replacement for the originally generated value, because it does not answer the question "Has this file been corrupted?". So, I could see making an exception for it. I do think, given the realities, that a checksum is useful--the alternative would be to put it in a separate file that rides alongside, kind of like PGP signatures. I don't have strong feelings either way, though. The nice thing about leaving out redundant items is that it makes it easier to do a "diff" without having to wade through all of the spurious changes in redundant items. Mike On Jan 24, 2008 10:49 AM, Matthew Chambers <mat...@va...> wrote: > A file-based checksum is not redundant if the file has changed, so I > guess that's why you make an exception? The point of having the checksum > in there in the first place is to allow for consistency guarantees that > guide the design decisions you refer to. I've been confronted with those > decisions myself and I think it's easy enough to ignore the redundant > index information if the software's design does not account for it. But > it's undeniably very useful for designs that do account for it. My > current random-access XML readers ignore the fact that mzML and mzXML > support built-in indexes and build the index from scratch every time. > Eventually, I'll add support to allow for trusting existing indexes. The > design jump from sequential XML parsing to random-access XML parsing is > far more treacherous than the jump from always building an index to > sometimes building one and sometimes not. > > -Matt > > > Mike Coleman wrote: > > For what it's worth, I'd be in favor of removing *all* redundant > > information from the format (with the possible exception of a > > checksum). This would include the index, derivable counts, and > > anything else that can be determined by inspection. > > > > The general argument for doing this would be that it would eliminate a > > whole class of design decisions of the form "What do I if thing A and > > thing B, which by definition are supposed to be consistent, are not?" > > It's easy to say "just reject the file", but in reality, we won't be > > able to do that. That leaves us with writing code to try to correct > > the inconsistencies, for all of the different ways that they occur > > across different producers and different versions thereof, and > > arguably that will be more complex than the code that, for example, > > does things like build indices in the first place. > > > > Mike > > > > > > > > > On Jan 24, 2008 9:47 AM, Brian Pratt <bri...@in...> wrote: > > > >>>> 6) Regarding the length attribute in offset, I am neutral on this. > >>>> This makes it a little harder for the writers. I can see that it > >>>> would be easier for random access readers. Darren says he's not > >>>> interested in it. > >>>> Anyone else out there want to lobby for it? > >>>> > >> Not me. I never felt the need for it in mzXML - it's derivable anyway as > >> length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > >> simple. > >> > >> -----Original Message----- > >> From: psi...@li... > >> [mailto:psi...@li...] On Behalf Of Eric > >> Deutsch > >> Sent: Wednesday, January 23, 2008 7:40 PM > >> To: Mass spectrometry standard development > >> > >> Cc: Eric Deutsch > >> Subject: Re: [Psidev-ms-dev] mzML indexing > >> > >> > >> Hi Darren et al. for this discussion. A few points from me: > >> > >> 1) It was decided (although not set in stone) that we would like to have > >> a unique id per spectrum per file for two reasons: > >> > >> a) At some point in the future if we have multiple runs per file (not > >> supported in this first release) it will continue to be true that a > >> spectrum id must be unique within a file. We decided that external > >> references from analysisXML, for example, should be to a unique id > >> rather than a scan number. > >> > >> b) We also left the door open to use LSIDs for the id. It can be any > >> unique string, and thus if someone wants to use LSIDs for this, the door > >> is open. > >> > >> 2) Also, according to the docs, the precursor scan references are by id > >> rather than by scan number (although this appears to be incorrectly > >> represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > >> an id and thus it needs to be in the index for things to work nicely. I > >> see that Matt disagrees that spectrumRef should be to an id. > >> > >> 3) It is true that the current examples only have scanNumber in the > >> index and not id. This should also be fixed before release. > >> > >> 4) The spec document indeed currently says that scanNumbers in the file > >> must be in ascending order, but not necessarily sequential. The comment > >> could perhaps be a little more clearly written. > >> > >> 5) I also do not see the need for the index attribute. I think it should > >> be left out, but if there is still a clear need, we could add. > >> > >> 6) Regarding the length attribute in offset, I am neutral on this. This > >> makes it a little harder for the writers. I can see that it would be > >> easier for random access readers. Darren says he's not interested in it. > >> Anyone else out there want to lobby for it? > >> > >> 7) Regarding going fully attribute: the intent was to preserve the > >> format of the mzXML index as closely as possible to reduce coding work, > >> but a better syntax could be entertained. I wouldn't object to: > >> > >> <offset scanNumber="19" id="S19" byteOffset="3512"/> > >> > >> 8) Regarding enforcing some of these index rules, we should add them to > >> the validator so that validator will do that. > >> > >> Comments on these items? > >> > >> Thanks, > >> Eric > >> > >> > >> > >> > >> > >>> From: psi...@li... > >>> > >> [mailto:psidev-ms-dev- > >> > >>> bo...@li...] On Behalf Of Kessner, Darren E. > >>> Sent: Wednesday, January 23, 2008 11:19 AM > >>> To: Mass spectrometry standard development > >>> Subject: Re: [Psidev-ms-dev] mzML indexing > >>> > >>> Right -- that's why I included the alternative, though I could have > >>> > >> been > >> > >>> less terse: > >>> > >>> > >>> > >>> "The alternative is to require that the <index> entries are written in > >>> > >> the > >> > >>> same order as the <spectrumList> entries." > >>> > >>> > >>> > >>> I don't know if there is a way to enforce this... > >>> > >>> > >>> > >>> > >>> > >>> Darren > >>> > >>> > >>> > >>> > >>> > >>> ________________________________ > >>> > >>> From: psi...@li... > >>> > >> [mailto:psidev-ms-dev- > >> > >>> bo...@li...] On Behalf Of Brian Pratt > >>> Sent: Wednesday, January 23, 2008 11:08 AM > >>> To: 'Mass spectrometry standard development' > >>> Subject: Re: [Psidev-ms-dev] mzML indexing > >>> > >>> > >>> > >>> Hi Darren, > >>> > >>> > >>> > >>> I wonder about this possibility: > >>> > >>> > >>> > >>> <index name="spectrum" > > >>> > >>> <offset index="0" scanNumber="19" id="S19">3512</offset> > >>> > >>> <offset index="2" scanNumber="23" id="S23">16217</offset> > >>> > >>> <offset index="4" scanNumber="25" id="S25">17258</offset> > >>> > >>> ... > >>> > >>> </index> > >>> > >>> > >>> > >>> If the response is "well, that's not legal, the index values must > >>> > >> increase > >> > >>> in increments of 1 starting from 0" then I don't see why it's needed > >>> > >> in > >> > >>> the first place - I'd expect that the index would just get snarfed up > >>> > >> into > >> > >>> an array and you'd access the nth element to get info on the nth scan > >>> appearing in the file. And if it is legal then I don't see what it's > >>> > >> for... > >> > >>> > >>> Brian > >>> > >>> > >>> > >>> ________________________________ > >>> > >>> From: psi...@li... > >>> > >> [mailto:psidev-ms-dev- > >> > >>> bo...@li...] On Behalf Of Kessner, Darren E. > >>> Sent: Wednesday, January 23, 2008 10:33 AM > >>> To: Mass spectrometry standard development > >>> Subject: [Psidev-ms-dev] mzML indexing > >>> > >>> > >>> > >>> Hi all, > >>> > >>> > >>> > >>> There are three ways to refer to a <spectrum> element -- by zero-based > >>> index into the <spectrumList>, by 'scanNumber', and by 'id'. However, > >>> > >> the > >> > >>> <index> currently only contains scanNumber. I would like to encode > >>> > >> the > >> > >>> zero-based index and the id as well in the <index> as follows: > >>> > >>> > >>> > >>> <index name="spectrum" > > >>> > >>> <offset index="0" scanNumber="19" id="S19">3512</offset> > >>> > >>> <offset index="1" scanNumber="20" id="S20">16217</offset> > >>> > >>> ... > >>> > >>> </index> > >>> > >>> > >>> > >>> Including the zero-based index is important to enable random access to > >>> > >> the > >> > >>> mzML file when you don't know what scan numbers are contained in the > >>> > >> file. > >> > >>> The alternative is to require that the <index> entries are written in > >>> > >> the > >> > >>> same order as the <spectrumList> entries. > >>> > >>> > >>> > >>> Including the 'id' in the <index> entries is necessary for efficiently > >>> dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > >>> this, a dereference requires reading through the <spectrumList> to > >>> > >> find > >> > >>> the right 'id'. This info could be read once and cached, but this > >>> > >> still > >> > >>> defeats the purpose of indexing. > >>> > >>> > >>> > >>> > >>> > >>> Darren > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Darren Kessner > >>> > >>> Scientific Programmer > >>> > >>> Dar...@cs... > >>> > >>> 310-423-9538 > >>> > >>> > >>> > >>> Spielberg Family Center for Applied Proteomics > >>> > >>> Cedars-Sinai Medical Center > >>> > >>> http://www.sfcap.cshs.org/ > >>> > >>> > >>> > >>> > >>> > >>> IMPORTANT WARNING: This message is intended for the use of the person > >>> > >> or > >> > >>> entity to which it is addressed and may contain information that is > >>> privileged and confidential, the disclosure of which is governed by > >>> applicable law. If the reader of this message is not the intended > >>> recipient, or the employee or agent responsible for delivering it to > >>> > >> the > >> > >>> intended recipient, you are hereby notified that any dissemination, > >>> distribution or copying of this information is STRICTLY PROHIBITED. > >>> > >>> If you have received this message in error, please notify us > >>> > >> immediately > >> > >>> by calling (310) 423-6428 and destroy the related message. Thank You > >>> > >> for > >> > >>> your cooperation. > >>> > >>> IMPORTANT WARNING: This message is intended for the use of the person > >>> > >> or > >> > >>> entity to which it is addressed and may contain information that is > >>> privileged and confidential, the disclosure of which is governed by > >>> applicable law. If the reader of this message is not the intended > >>> recipient, or the employee or agent responsible for delivering it to > >>> > >> the > >> > >>> intended recipient, you are hereby notified that any dissemination, > >>> distribution or copying of this information is STRICTLY PROHIBITED. > >>> > >>> If you have received this message in error, please notify us > >>> > >> immediately > >> > >>> by calling (310) 423-6428 and destroy the related message. Thank You > >>> > >> for > >> > >>> your cooperation. > >>> > >> > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Kessner, D. E. <Dar...@cs...> - 2008-01-24 17:49:45
|
Hi Eric, Thank you for the summary, and for requesting comments. 1, 2, 3) Thank you for the clarifications on the unique id. I definitely support using the unique id for identifying spectra. 5, 8) I'm fine with leaving out the index attribute, as long as it's documented (and enforced) that the index entries follow the same order as the spectrum list entries. 7) Regarding going fully attribute: doesn't change the amount of coding -- any changes to this are easy to make. Darren IMPORTANT WARNING: This message is intended for the use of the person or = entity to which it is addressed and may contain information that is privi= leged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipi= ent, or the employee or agent responsible for delivering it to the intend= ed recipient, you are hereby notified that any dissemination, distributio= n or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for= your cooperation. |
From: Brian P. <bri...@in...> - 2008-01-24 17:05:31
|
Assuming you're doing SAX (streaming) style parsing, grabbing a little extra text after the stuff you want won't offend the parser as long as you stop parsing at the tag that balances the one opening the fragment, which is what you'd do anyway. If you try to use a DOM style parser on a fragment, even an exactly balanced one, I suspect there would be trouble since it lacks the anticipated root structure. But nobody uses DOM parsers with mzXML/mzdata/mzML anyway, streaming just makes more sense with files this size. So, no worries! I think. Certainly it hasn't been a problem in mzXML world. Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Thursday, January 24, 2008 8:04 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML indexing Brain, you're assuming that no junk (comments, user params, whatever) is allowed between the spectrum elements. Otherwise you'd get a fragment of XML that isn't very parser friendly (it would be multi-rooted). Is the no-junk assumption safe? As for the spectrumRef, unless we plan to support referencing a spectrum outside the current run as a precursor, the scan number makes more sense than the id. If we DO plan to support that, I would support referencing the id instead, but it would definitely need to go into the index at that point. -Matt Brian Pratt wrote: >>> 6) Regarding the length attribute in offset, I am neutral on this. >>> This makes it a little harder for the writers. I can see that it >>> would be easier for random access readers. Darren says he's not >>> interested in it. >>> Anyone else out there want to lobby for it? >>> > > Not me. I never felt the need for it in mzXML - it's derivable anyway as > length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > simple. > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Eric > Deutsch > Sent: Wednesday, January 23, 2008 7:40 PM > To: Mass spectrometry standard development > Cc: Eric Deutsch > Subject: Re: [Psidev-ms-dev] mzML indexing > > > Hi Darren et al. for this discussion. A few points from me: > > 1) It was decided (although not set in stone) that we would like to have > a unique id per spectrum per file for two reasons: > > a) At some point in the future if we have multiple runs per file (not > supported in this first release) it will continue to be true that a > spectrum id must be unique within a file. We decided that external > references from analysisXML, for example, should be to a unique id > rather than a scan number. > > b) We also left the door open to use LSIDs for the id. It can be any > unique string, and thus if someone wants to use LSIDs for this, the door > is open. > > 2) Also, according to the docs, the precursor scan references are by id > rather than by scan number (although this appears to be incorrectly > represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > an id and thus it needs to be in the index for things to work nicely. I > see that Matt disagrees that spectrumRef should be to an id. > > 3) It is true that the current examples only have scanNumber in the > index and not id. This should also be fixed before release. > > 4) The spec document indeed currently says that scanNumbers in the file > must be in ascending order, but not necessarily sequential. The comment > could perhaps be a little more clearly written. > > 5) I also do not see the need for the index attribute. I think it should > be left out, but if there is still a clear need, we could add. > > 6) Regarding the length attribute in offset, I am neutral on this. This > makes it a little harder for the writers. I can see that it would be > easier for random access readers. Darren says he's not interested in it. > Anyone else out there want to lobby for it? > > 7) Regarding going fully attribute: the intent was to preserve the > format of the mzXML index as closely as possible to reduce coding work, > but a better syntax could be entertained. I wouldn't object to: > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > > 8) Regarding enforcing some of these index rules, we should add them to > the validator so that validator will do that. > > Comments on these items? > > Thanks, > Eric > > > > > >> From: psi...@li... >> > [mailto:psidev-ms-dev- > >> bo...@li...] On Behalf Of Kessner, Darren E. >> Sent: Wednesday, January 23, 2008 11:19 AM >> To: Mass spectrometry standard development >> Subject: Re: [Psidev-ms-dev] mzML indexing >> >> Right -- that's why I included the alternative, though I could have >> > been > >> less terse: >> >> >> >> "The alternative is to require that the <index> entries are written in >> > the > >> same order as the <spectrumList> entries." >> >> >> >> I don't know if there is a way to enforce this... >> >> >> >> >> >> Darren >> >> >> >> >> >> ________________________________ >> >> From: psi...@li... >> > [mailto:psidev-ms-dev- > >> bo...@li...] On Behalf Of Brian Pratt >> Sent: Wednesday, January 23, 2008 11:08 AM >> To: 'Mass spectrometry standard development' >> Subject: Re: [Psidev-ms-dev] mzML indexing >> >> >> >> Hi Darren, >> >> >> >> I wonder about this possibility: >> >> >> >> <index name="spectrum" > >> >> <offset index="0" scanNumber="19" id="S19">3512</offset> >> >> <offset index="2" scanNumber="23" id="S23">16217</offset> >> >> <offset index="4" scanNumber="25" id="S25">17258</offset> >> >> ... >> >> </index> >> >> >> >> If the response is "well, that's not legal, the index values must >> > increase > >> in increments of 1 starting from 0" then I don't see why it's needed >> > in > >> the first place - I'd expect that the index would just get snarfed up >> > into > >> an array and you'd access the nth element to get info on the nth scan >> appearing in the file. And if it is legal then I don't see what it's >> > for... > >> >> Brian >> >> >> >> ________________________________ >> >> From: psi...@li... >> > [mailto:psidev-ms-dev- > >> bo...@li...] On Behalf Of Kessner, Darren E. >> Sent: Wednesday, January 23, 2008 10:33 AM >> To: Mass spectrometry standard development >> Subject: [Psidev-ms-dev] mzML indexing >> >> >> >> Hi all, >> >> >> >> There are three ways to refer to a <spectrum> element -- by zero-based >> index into the <spectrumList>, by 'scanNumber', and by 'id'. However, >> > the > >> <index> currently only contains scanNumber. I would like to encode >> > the > >> zero-based index and the id as well in the <index> as follows: >> >> >> >> <index name="spectrum" > >> >> <offset index="0" scanNumber="19" id="S19">3512</offset> >> >> <offset index="1" scanNumber="20" id="S20">16217</offset> >> >> ... >> >> </index> >> >> >> >> Including the zero-based index is important to enable random access to >> > the > >> mzML file when you don't know what scan numbers are contained in the >> > file. > >> The alternative is to require that the <index> entries are written in >> > the > >> same order as the <spectrumList> entries. >> >> >> >> Including the 'id' in the <index> entries is necessary for efficiently >> dereferencing a "spectrumRef" (e.g. in <precursor> element). Without >> this, a dereference requires reading through the <spectrumList> to >> > find > >> the right 'id'. This info could be read once and cached, but this >> > still > >> defeats the purpose of indexing. >> >> >> >> >> >> Darren >> >> >> >> >> >> >> >> Darren Kessner >> >> Scientific Programmer >> >> Dar...@cs... >> >> 310-423-9538 >> >> >> >> Spielberg Family Center for Applied Proteomics >> >> Cedars-Sinai Medical Center >> >> http://www.sfcap.cshs.org/ >> >> >> >> >> >> IMPORTANT WARNING: This message is intended for the use of the person >> > or > >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to >> > the > >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is STRICTLY PROHIBITED. >> >> If you have received this message in error, please notify us >> > immediately > >> by calling (310) 423-6428 and destroy the related message. Thank You >> > for > >> your cooperation. >> >> IMPORTANT WARNING: This message is intended for the use of the person >> > or > >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to >> > the > >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is STRICTLY PROHIBITED. >> >> If you have received this message in error, please notify us >> > immediately > >> by calling (310) 423-6428 and destroy the related message. Thank You >> > for > >> your cooperation. >> > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Brian P. <bri...@in...> - 2008-01-24 16:54:10
|
Well, let's not reopen the index debate again (there have been many skirmishes). It's been left as an optional item so those who detest the idea don't have to mess with it. In practice this means a parser that does choose to exploit the index also has to be ready to possibly derive it (and of course deal with the distinct possibility that it's bogus due to CRLF issues etc). Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Thursday, January 24, 2008 8:14 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML indexing For what it's worth, I'd be in favor of removing *all* redundant information from the format (with the possible exception of a checksum). This would include the index, derivable counts, and anything else that can be determined by inspection. The general argument for doing this would be that it would eliminate a whole class of design decisions of the form "What do I if thing A and thing B, which by definition are supposed to be consistent, are not?" It's easy to say "just reject the file", but in reality, we won't be able to do that. That leaves us with writing code to try to correct the inconsistencies, for all of the different ways that they occur across different producers and different versions thereof, and arguably that will be more complex than the code that, for example, does things like build indices in the first place. Mike On Jan 24, 2008 9:47 AM, Brian Pratt <bri...@in...> wrote: > >> 6) Regarding the length attribute in offset, I am neutral on this. > >> This makes it a little harder for the writers. I can see that it > >> would be easier for random access readers. Darren says he's not > >> interested in it. > >> Anyone else out there want to lobby for it? > > Not me. I never felt the need for it in mzXML - it's derivable anyway as > length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > simple. > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Eric > Deutsch > Sent: Wednesday, January 23, 2008 7:40 PM > To: Mass spectrometry standard development > > Cc: Eric Deutsch > Subject: Re: [Psidev-ms-dev] mzML indexing > > > Hi Darren et al. for this discussion. A few points from me: > > 1) It was decided (although not set in stone) that we would like to have > a unique id per spectrum per file for two reasons: > > a) At some point in the future if we have multiple runs per file (not > supported in this first release) it will continue to be true that a > spectrum id must be unique within a file. We decided that external > references from analysisXML, for example, should be to a unique id > rather than a scan number. > > b) We also left the door open to use LSIDs for the id. It can be any > unique string, and thus if someone wants to use LSIDs for this, the door > is open. > > 2) Also, according to the docs, the precursor scan references are by id > rather than by scan number (although this appears to be incorrectly > represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > an id and thus it needs to be in the index for things to work nicely. I > see that Matt disagrees that spectrumRef should be to an id. > > 3) It is true that the current examples only have scanNumber in the > index and not id. This should also be fixed before release. > > 4) The spec document indeed currently says that scanNumbers in the file > must be in ascending order, but not necessarily sequential. The comment > could perhaps be a little more clearly written. > > 5) I also do not see the need for the index attribute. I think it should > be left out, but if there is still a clear need, we could add. > > 6) Regarding the length attribute in offset, I am neutral on this. This > makes it a little harder for the writers. I can see that it would be > easier for random access readers. Darren says he's not interested in it. > Anyone else out there want to lobby for it? > > 7) Regarding going fully attribute: the intent was to preserve the > format of the mzXML index as closely as possible to reduce coding work, > but a better syntax could be entertained. I wouldn't object to: > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > > 8) Regarding enforcing some of these index rules, we should add them to > the validator so that validator will do that. > > Comments on these items? > > Thanks, > Eric > > > > > > From: psi...@li... > [mailto:psidev-ms-dev- > > bo...@li...] On Behalf Of Kessner, Darren E. > > Sent: Wednesday, January 23, 2008 11:19 AM > > To: Mass spectrometry standard development > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > Right -- that's why I included the alternative, though I could have > been > > less terse: > > > > > > > > "The alternative is to require that the <index> entries are written in > the > > same order as the <spectrumList> entries." > > > > > > > > I don't know if there is a way to enforce this... > > > > > > > > > > > > Darren > > > > > > > > > > > > ________________________________ > > > > From: psi...@li... > [mailto:psidev-ms-dev- > > bo...@li...] On Behalf Of Brian Pratt > > Sent: Wednesday, January 23, 2008 11:08 AM > > To: 'Mass spectrometry standard development' > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > > > Hi Darren, > > > > > > > > I wonder about this possibility: > > > > > > > > <index name="spectrum" > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > <offset index="2" scanNumber="23" id="S23">16217</offset> > > > > <offset index="4" scanNumber="25" id="S25">17258</offset> > > > > ... > > > > </index> > > > > > > > > If the response is "well, that's not legal, the index values must > increase > > in increments of 1 starting from 0" then I don't see why it's needed > in > > the first place - I'd expect that the index would just get snarfed up > into > > an array and you'd access the nth element to get info on the nth scan > > appearing in the file. And if it is legal then I don't see what it's > for... > > > > > > > > Brian > > > > > > > > ________________________________ > > > > From: psi...@li... > [mailto:psidev-ms-dev- > > bo...@li...] On Behalf Of Kessner, Darren E. > > Sent: Wednesday, January 23, 2008 10:33 AM > > To: Mass spectrometry standard development > > Subject: [Psidev-ms-dev] mzML indexing > > > > > > > > Hi all, > > > > > > > > There are three ways to refer to a <spectrum> element -- by zero-based > > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, > the > > <index> currently only contains scanNumber. I would like to encode > the > > zero-based index and the id as well in the <index> as follows: > > > > > > > > <index name="spectrum" > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > <offset index="1" scanNumber="20" id="S20">16217</offset> > > > > ... > > > > </index> > > > > > > > > Including the zero-based index is important to enable random access to > the > > mzML file when you don't know what scan numbers are contained in the > file. > > The alternative is to require that the <index> entries are written in > the > > same order as the <spectrumList> entries. > > > > > > > > Including the 'id' in the <index> entries is necessary for efficiently > > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > > this, a dereference requires reading through the <spectrumList> to > find > > the right 'id'. This info could be read once and cached, but this > still > > defeats the purpose of indexing. > > > > > > > > > > > > Darren > > > > > > > > > > > > > > > > Darren Kessner > > > > Scientific Programmer > > > > Dar...@cs... > > > > 310-423-9538 > > > > > > > > Spielberg Family Center for Applied Proteomics > > > > Cedars-Sinai Medical Center > > > > http://www.sfcap.cshs.org/ > > > > > > > > > > > > IMPORTANT WARNING: This message is intended for the use of the person > or > > entity to which it is addressed and may contain information that is > > privileged and confidential, the disclosure of which is governed by > > applicable law. If the reader of this message is not the intended > > recipient, or the employee or agent responsible for delivering it to > the > > intended recipient, you are hereby notified that any dissemination, > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > If you have received this message in error, please notify us > immediately > > by calling (310) 423-6428 and destroy the related message. Thank You > for > > your cooperation. > > > > IMPORTANT WARNING: This message is intended for the use of the person > or > > entity to which it is addressed and may contain information that is > > privileged and confidential, the disclosure of which is governed by > > applicable law. If the reader of this message is not the intended > > recipient, or the employee or agent responsible for delivering it to > the > > intended recipient, you are hereby notified that any dissemination, > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > If you have received this message in error, please notify us > immediately > > by calling (310) 423-6428 and destroy the related message. Thank You > for > > your cooperation. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matthew C. <mat...@va...> - 2008-01-24 16:49:46
|
A file-based checksum is not redundant if the file has changed, so I guess that's why you make an exception? The point of having the checksum in there in the first place is to allow for consistency guarantees that guide the design decisions you refer to. I've been confronted with those decisions myself and I think it's easy enough to ignore the redundant index information if the software's design does not account for it. But it's undeniably very useful for designs that do account for it. My current random-access XML readers ignore the fact that mzML and mzXML support built-in indexes and build the index from scratch every time. Eventually, I'll add support to allow for trusting existing indexes. The design jump from sequential XML parsing to random-access XML parsing is far more treacherous than the jump from always building an index to sometimes building one and sometimes not. -Matt Mike Coleman wrote: > For what it's worth, I'd be in favor of removing *all* redundant > information from the format (with the possible exception of a > checksum). This would include the index, derivable counts, and > anything else that can be determined by inspection. > > The general argument for doing this would be that it would eliminate a > whole class of design decisions of the form "What do I if thing A and > thing B, which by definition are supposed to be consistent, are not?" > It's easy to say "just reject the file", but in reality, we won't be > able to do that. That leaves us with writing code to try to correct > the inconsistencies, for all of the different ways that they occur > across different producers and different versions thereof, and > arguably that will be more complex than the code that, for example, > does things like build indices in the first place. > > Mike > > > > On Jan 24, 2008 9:47 AM, Brian Pratt <bri...@in...> wrote: > >>>> 6) Regarding the length attribute in offset, I am neutral on this. >>>> This makes it a little harder for the writers. I can see that it >>>> would be easier for random access readers. Darren says he's not >>>> interested in it. >>>> Anyone else out there want to lobby for it? >>>> >> Not me. I never felt the need for it in mzXML - it's derivable anyway as >> length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it >> simple. >> >> -----Original Message----- >> From: psi...@li... >> [mailto:psi...@li...] On Behalf Of Eric >> Deutsch >> Sent: Wednesday, January 23, 2008 7:40 PM >> To: Mass spectrometry standard development >> >> Cc: Eric Deutsch >> Subject: Re: [Psidev-ms-dev] mzML indexing >> >> >> Hi Darren et al. for this discussion. A few points from me: >> >> 1) It was decided (although not set in stone) that we would like to have >> a unique id per spectrum per file for two reasons: >> >> a) At some point in the future if we have multiple runs per file (not >> supported in this first release) it will continue to be true that a >> spectrum id must be unique within a file. We decided that external >> references from analysisXML, for example, should be to a unique id >> rather than a scan number. >> >> b) We also left the door open to use LSIDs for the id. It can be any >> unique string, and thus if someone wants to use LSIDs for this, the door >> is open. >> >> 2) Also, according to the docs, the precursor scan references are by id >> rather than by scan number (although this appears to be incorrectly >> represented in tiny1.mzML: FIXME). I think the spectrumRef should be to >> an id and thus it needs to be in the index for things to work nicely. I >> see that Matt disagrees that spectrumRef should be to an id. >> >> 3) It is true that the current examples only have scanNumber in the >> index and not id. This should also be fixed before release. >> >> 4) The spec document indeed currently says that scanNumbers in the file >> must be in ascending order, but not necessarily sequential. The comment >> could perhaps be a little more clearly written. >> >> 5) I also do not see the need for the index attribute. I think it should >> be left out, but if there is still a clear need, we could add. >> >> 6) Regarding the length attribute in offset, I am neutral on this. This >> makes it a little harder for the writers. I can see that it would be >> easier for random access readers. Darren says he's not interested in it. >> Anyone else out there want to lobby for it? >> >> 7) Regarding going fully attribute: the intent was to preserve the >> format of the mzXML index as closely as possible to reduce coding work, >> but a better syntax could be entertained. I wouldn't object to: >> >> <offset scanNumber="19" id="S19" byteOffset="3512"/> >> >> 8) Regarding enforcing some of these index rules, we should add them to >> the validator so that validator will do that. >> >> Comments on these items? >> >> Thanks, >> Eric >> >> >> >> >> >>> From: psi...@li... >>> >> [mailto:psidev-ms-dev- >> >>> bo...@li...] On Behalf Of Kessner, Darren E. >>> Sent: Wednesday, January 23, 2008 11:19 AM >>> To: Mass spectrometry standard development >>> Subject: Re: [Psidev-ms-dev] mzML indexing >>> >>> Right -- that's why I included the alternative, though I could have >>> >> been >> >>> less terse: >>> >>> >>> >>> "The alternative is to require that the <index> entries are written in >>> >> the >> >>> same order as the <spectrumList> entries." >>> >>> >>> >>> I don't know if there is a way to enforce this... >>> >>> >>> >>> >>> >>> Darren >>> >>> >>> >>> >>> >>> ________________________________ >>> >>> From: psi...@li... >>> >> [mailto:psidev-ms-dev- >> >>> bo...@li...] On Behalf Of Brian Pratt >>> Sent: Wednesday, January 23, 2008 11:08 AM >>> To: 'Mass spectrometry standard development' >>> Subject: Re: [Psidev-ms-dev] mzML indexing >>> >>> >>> >>> Hi Darren, >>> >>> >>> >>> I wonder about this possibility: >>> >>> >>> >>> <index name="spectrum" > >>> >>> <offset index="0" scanNumber="19" id="S19">3512</offset> >>> >>> <offset index="2" scanNumber="23" id="S23">16217</offset> >>> >>> <offset index="4" scanNumber="25" id="S25">17258</offset> >>> >>> ... >>> >>> </index> >>> >>> >>> >>> If the response is "well, that's not legal, the index values must >>> >> increase >> >>> in increments of 1 starting from 0" then I don't see why it's needed >>> >> in >> >>> the first place - I'd expect that the index would just get snarfed up >>> >> into >> >>> an array and you'd access the nth element to get info on the nth scan >>> appearing in the file. And if it is legal then I don't see what it's >>> >> for... >> >>> >>> Brian >>> >>> >>> >>> ________________________________ >>> >>> From: psi...@li... >>> >> [mailto:psidev-ms-dev- >> >>> bo...@li...] On Behalf Of Kessner, Darren E. >>> Sent: Wednesday, January 23, 2008 10:33 AM >>> To: Mass spectrometry standard development >>> Subject: [Psidev-ms-dev] mzML indexing >>> >>> >>> >>> Hi all, >>> >>> >>> >>> There are three ways to refer to a <spectrum> element -- by zero-based >>> index into the <spectrumList>, by 'scanNumber', and by 'id'. However, >>> >> the >> >>> <index> currently only contains scanNumber. I would like to encode >>> >> the >> >>> zero-based index and the id as well in the <index> as follows: >>> >>> >>> >>> <index name="spectrum" > >>> >>> <offset index="0" scanNumber="19" id="S19">3512</offset> >>> >>> <offset index="1" scanNumber="20" id="S20">16217</offset> >>> >>> ... >>> >>> </index> >>> >>> >>> >>> Including the zero-based index is important to enable random access to >>> >> the >> >>> mzML file when you don't know what scan numbers are contained in the >>> >> file. >> >>> The alternative is to require that the <index> entries are written in >>> >> the >> >>> same order as the <spectrumList> entries. >>> >>> >>> >>> Including the 'id' in the <index> entries is necessary for efficiently >>> dereferencing a "spectrumRef" (e.g. in <precursor> element). Without >>> this, a dereference requires reading through the <spectrumList> to >>> >> find >> >>> the right 'id'. This info could be read once and cached, but this >>> >> still >> >>> defeats the purpose of indexing. >>> >>> >>> >>> >>> >>> Darren >>> >>> >>> >>> >>> >>> >>> >>> Darren Kessner >>> >>> Scientific Programmer >>> >>> Dar...@cs... >>> >>> 310-423-9538 >>> >>> >>> >>> Spielberg Family Center for Applied Proteomics >>> >>> Cedars-Sinai Medical Center >>> >>> http://www.sfcap.cshs.org/ >>> >>> >>> >>> >>> >>> IMPORTANT WARNING: This message is intended for the use of the person >>> >> or >> >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to >>> >> the >> >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. >>> >>> If you have received this message in error, please notify us >>> >> immediately >> >>> by calling (310) 423-6428 and destroy the related message. Thank You >>> >> for >> >>> your cooperation. >>> >>> IMPORTANT WARNING: This message is intended for the use of the person >>> >> or >> >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to >>> >> the >> >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. >>> >>> If you have received this message in error, please notify us >>> >> immediately >> >>> by calling (310) 423-6428 and destroy the related message. Thank You >>> >> for >> >>> your cooperation. >>> >> > |
From: Mike C. <tu...@gm...> - 2008-01-24 16:14:13
|
For what it's worth, I'd be in favor of removing *all* redundant information from the format (with the possible exception of a checksum). This would include the index, derivable counts, and anything else that can be determined by inspection. The general argument for doing this would be that it would eliminate a whole class of design decisions of the form "What do I if thing A and thing B, which by definition are supposed to be consistent, are not?" It's easy to say "just reject the file", but in reality, we won't be able to do that. That leaves us with writing code to try to correct the inconsistencies, for all of the different ways that they occur across different producers and different versions thereof, and arguably that will be more complex than the code that, for example, does things like build indices in the first place. Mike On Jan 24, 2008 9:47 AM, Brian Pratt <bri...@in...> wrote: > >> 6) Regarding the length attribute in offset, I am neutral on this. > >> This makes it a little harder for the writers. I can see that it > >> would be easier for random access readers. Darren says he's not > >> interested in it. > >> Anyone else out there want to lobby for it? > > Not me. I never felt the need for it in mzXML - it's derivable anyway as > length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > simple. > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Eric > Deutsch > Sent: Wednesday, January 23, 2008 7:40 PM > To: Mass spectrometry standard development > > Cc: Eric Deutsch > Subject: Re: [Psidev-ms-dev] mzML indexing > > > Hi Darren et al. for this discussion. A few points from me: > > 1) It was decided (although not set in stone) that we would like to have > a unique id per spectrum per file for two reasons: > > a) At some point in the future if we have multiple runs per file (not > supported in this first release) it will continue to be true that a > spectrum id must be unique within a file. We decided that external > references from analysisXML, for example, should be to a unique id > rather than a scan number. > > b) We also left the door open to use LSIDs for the id. It can be any > unique string, and thus if someone wants to use LSIDs for this, the door > is open. > > 2) Also, according to the docs, the precursor scan references are by id > rather than by scan number (although this appears to be incorrectly > represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > an id and thus it needs to be in the index for things to work nicely. I > see that Matt disagrees that spectrumRef should be to an id. > > 3) It is true that the current examples only have scanNumber in the > index and not id. This should also be fixed before release. > > 4) The spec document indeed currently says that scanNumbers in the file > must be in ascending order, but not necessarily sequential. The comment > could perhaps be a little more clearly written. > > 5) I also do not see the need for the index attribute. I think it should > be left out, but if there is still a clear need, we could add. > > 6) Regarding the length attribute in offset, I am neutral on this. This > makes it a little harder for the writers. I can see that it would be > easier for random access readers. Darren says he's not interested in it. > Anyone else out there want to lobby for it? > > 7) Regarding going fully attribute: the intent was to preserve the > format of the mzXML index as closely as possible to reduce coding work, > but a better syntax could be entertained. I wouldn't object to: > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > > 8) Regarding enforcing some of these index rules, we should add them to > the validator so that validator will do that. > > Comments on these items? > > Thanks, > Eric > > > > > > From: psi...@li... > [mailto:psidev-ms-dev- > > bo...@li...] On Behalf Of Kessner, Darren E. > > Sent: Wednesday, January 23, 2008 11:19 AM > > To: Mass spectrometry standard development > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > Right -- that's why I included the alternative, though I could have > been > > less terse: > > > > > > > > "The alternative is to require that the <index> entries are written in > the > > same order as the <spectrumList> entries." > > > > > > > > I don't know if there is a way to enforce this... > > > > > > > > > > > > Darren > > > > > > > > > > > > ________________________________ > > > > From: psi...@li... > [mailto:psidev-ms-dev- > > bo...@li...] On Behalf Of Brian Pratt > > Sent: Wednesday, January 23, 2008 11:08 AM > > To: 'Mass spectrometry standard development' > > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > > > > > Hi Darren, > > > > > > > > I wonder about this possibility: > > > > > > > > <index name="spectrum" > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > <offset index="2" scanNumber="23" id="S23">16217</offset> > > > > <offset index="4" scanNumber="25" id="S25">17258</offset> > > > > ... > > > > </index> > > > > > > > > If the response is "well, that's not legal, the index values must > increase > > in increments of 1 starting from 0" then I don't see why it's needed > in > > the first place - I'd expect that the index would just get snarfed up > into > > an array and you'd access the nth element to get info on the nth scan > > appearing in the file. And if it is legal then I don't see what it's > for... > > > > > > > > Brian > > > > > > > > ________________________________ > > > > From: psi...@li... > [mailto:psidev-ms-dev- > > bo...@li...] On Behalf Of Kessner, Darren E. > > Sent: Wednesday, January 23, 2008 10:33 AM > > To: Mass spectrometry standard development > > Subject: [Psidev-ms-dev] mzML indexing > > > > > > > > Hi all, > > > > > > > > There are three ways to refer to a <spectrum> element -- by zero-based > > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, > the > > <index> currently only contains scanNumber. I would like to encode > the > > zero-based index and the id as well in the <index> as follows: > > > > > > > > <index name="spectrum" > > > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > > > <offset index="1" scanNumber="20" id="S20">16217</offset> > > > > ... > > > > </index> > > > > > > > > Including the zero-based index is important to enable random access to > the > > mzML file when you don't know what scan numbers are contained in the > file. > > The alternative is to require that the <index> entries are written in > the > > same order as the <spectrumList> entries. > > > > > > > > Including the 'id' in the <index> entries is necessary for efficiently > > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > > this, a dereference requires reading through the <spectrumList> to > find > > the right 'id'. This info could be read once and cached, but this > still > > defeats the purpose of indexing. > > > > > > > > > > > > Darren > > > > > > > > > > > > > > > > Darren Kessner > > > > Scientific Programmer > > > > Dar...@cs... > > > > 310-423-9538 > > > > > > > > Spielberg Family Center for Applied Proteomics > > > > Cedars-Sinai Medical Center > > > > http://www.sfcap.cshs.org/ > > > > > > > > > > > > IMPORTANT WARNING: This message is intended for the use of the person > or > > entity to which it is addressed and may contain information that is > > privileged and confidential, the disclosure of which is governed by > > applicable law. If the reader of this message is not the intended > > recipient, or the employee or agent responsible for delivering it to > the > > intended recipient, you are hereby notified that any dissemination, > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > If you have received this message in error, please notify us > immediately > > by calling (310) 423-6428 and destroy the related message. Thank You > for > > your cooperation. > > > > IMPORTANT WARNING: This message is intended for the use of the person > or > > entity to which it is addressed and may contain information that is > > privileged and confidential, the disclosure of which is governed by > > applicable law. If the reader of this message is not the intended > > recipient, or the employee or agent responsible for delivering it to > the > > intended recipient, you are hereby notified that any dissemination, > > distribution or copying of this information is STRICTLY PROHIBITED. > > > > If you have received this message in error, please notify us > immediately > > by calling (310) 423-6428 and destroy the related message. Thank You > for > > your cooperation. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Matthew C. <mat...@va...> - 2008-01-24 16:04:27
|
Brain, you're assuming that no junk (comments, user params, whatever) is allowed between the spectrum elements. Otherwise you'd get a fragment of XML that isn't very parser friendly (it would be multi-rooted). Is the no-junk assumption safe? As for the spectrumRef, unless we plan to support referencing a spectrum outside the current run as a precursor, the scan number makes more sense than the id. If we DO plan to support that, I would support referencing the id instead, but it would definitely need to go into the index at that point. -Matt Brian Pratt wrote: >>> 6) Regarding the length attribute in offset, I am neutral on this. >>> This makes it a little harder for the writers. I can see that it >>> would be easier for random access readers. Darren says he's not >>> interested in it. >>> Anyone else out there want to lobby for it? >>> > > Not me. I never felt the need for it in mzXML - it's derivable anyway as > length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it > simple. > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Eric > Deutsch > Sent: Wednesday, January 23, 2008 7:40 PM > To: Mass spectrometry standard development > Cc: Eric Deutsch > Subject: Re: [Psidev-ms-dev] mzML indexing > > > Hi Darren et al. for this discussion. A few points from me: > > 1) It was decided (although not set in stone) that we would like to have > a unique id per spectrum per file for two reasons: > > a) At some point in the future if we have multiple runs per file (not > supported in this first release) it will continue to be true that a > spectrum id must be unique within a file. We decided that external > references from analysisXML, for example, should be to a unique id > rather than a scan number. > > b) We also left the door open to use LSIDs for the id. It can be any > unique string, and thus if someone wants to use LSIDs for this, the door > is open. > > 2) Also, according to the docs, the precursor scan references are by id > rather than by scan number (although this appears to be incorrectly > represented in tiny1.mzML: FIXME). I think the spectrumRef should be to > an id and thus it needs to be in the index for things to work nicely. I > see that Matt disagrees that spectrumRef should be to an id. > > 3) It is true that the current examples only have scanNumber in the > index and not id. This should also be fixed before release. > > 4) The spec document indeed currently says that scanNumbers in the file > must be in ascending order, but not necessarily sequential. The comment > could perhaps be a little more clearly written. > > 5) I also do not see the need for the index attribute. I think it should > be left out, but if there is still a clear need, we could add. > > 6) Regarding the length attribute in offset, I am neutral on this. This > makes it a little harder for the writers. I can see that it would be > easier for random access readers. Darren says he's not interested in it. > Anyone else out there want to lobby for it? > > 7) Regarding going fully attribute: the intent was to preserve the > format of the mzXML index as closely as possible to reduce coding work, > but a better syntax could be entertained. I wouldn't object to: > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > > 8) Regarding enforcing some of these index rules, we should add them to > the validator so that validator will do that. > > Comments on these items? > > Thanks, > Eric > > > > > >> From: psi...@li... >> > [mailto:psidev-ms-dev- > >> bo...@li...] On Behalf Of Kessner, Darren E. >> Sent: Wednesday, January 23, 2008 11:19 AM >> To: Mass spectrometry standard development >> Subject: Re: [Psidev-ms-dev] mzML indexing >> >> Right -- that's why I included the alternative, though I could have >> > been > >> less terse: >> >> >> >> "The alternative is to require that the <index> entries are written in >> > the > >> same order as the <spectrumList> entries." >> >> >> >> I don't know if there is a way to enforce this... >> >> >> >> >> >> Darren >> >> >> >> >> >> ________________________________ >> >> From: psi...@li... >> > [mailto:psidev-ms-dev- > >> bo...@li...] On Behalf Of Brian Pratt >> Sent: Wednesday, January 23, 2008 11:08 AM >> To: 'Mass spectrometry standard development' >> Subject: Re: [Psidev-ms-dev] mzML indexing >> >> >> >> Hi Darren, >> >> >> >> I wonder about this possibility: >> >> >> >> <index name="spectrum" > >> >> <offset index="0" scanNumber="19" id="S19">3512</offset> >> >> <offset index="2" scanNumber="23" id="S23">16217</offset> >> >> <offset index="4" scanNumber="25" id="S25">17258</offset> >> >> ... >> >> </index> >> >> >> >> If the response is "well, that's not legal, the index values must >> > increase > >> in increments of 1 starting from 0" then I don't see why it's needed >> > in > >> the first place - I'd expect that the index would just get snarfed up >> > into > >> an array and you'd access the nth element to get info on the nth scan >> appearing in the file. And if it is legal then I don't see what it's >> > for... > >> >> Brian >> >> >> >> ________________________________ >> >> From: psi...@li... >> > [mailto:psidev-ms-dev- > >> bo...@li...] On Behalf Of Kessner, Darren E. >> Sent: Wednesday, January 23, 2008 10:33 AM >> To: Mass spectrometry standard development >> Subject: [Psidev-ms-dev] mzML indexing >> >> >> >> Hi all, >> >> >> >> There are three ways to refer to a <spectrum> element -- by zero-based >> index into the <spectrumList>, by 'scanNumber', and by 'id'. However, >> > the > >> <index> currently only contains scanNumber. I would like to encode >> > the > >> zero-based index and the id as well in the <index> as follows: >> >> >> >> <index name="spectrum" > >> >> <offset index="0" scanNumber="19" id="S19">3512</offset> >> >> <offset index="1" scanNumber="20" id="S20">16217</offset> >> >> ... >> >> </index> >> >> >> >> Including the zero-based index is important to enable random access to >> > the > >> mzML file when you don't know what scan numbers are contained in the >> > file. > >> The alternative is to require that the <index> entries are written in >> > the > >> same order as the <spectrumList> entries. >> >> >> >> Including the 'id' in the <index> entries is necessary for efficiently >> dereferencing a "spectrumRef" (e.g. in <precursor> element). Without >> this, a dereference requires reading through the <spectrumList> to >> > find > >> the right 'id'. This info could be read once and cached, but this >> > still > >> defeats the purpose of indexing. >> >> >> >> >> >> Darren >> >> >> >> >> >> >> >> Darren Kessner >> >> Scientific Programmer >> >> Dar...@cs... >> >> 310-423-9538 >> >> >> >> Spielberg Family Center for Applied Proteomics >> >> Cedars-Sinai Medical Center >> >> http://www.sfcap.cshs.org/ >> >> >> >> >> >> IMPORTANT WARNING: This message is intended for the use of the person >> > or > >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to >> > the > >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is STRICTLY PROHIBITED. >> >> If you have received this message in error, please notify us >> > immediately > >> by calling (310) 423-6428 and destroy the related message. Thank You >> > for > >> your cooperation. >> >> IMPORTANT WARNING: This message is intended for the use of the person >> > or > >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to >> > the > >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is STRICTLY PROHIBITED. >> >> If you have received this message in error, please notify us >> > immediately > >> by calling (310) 423-6428 and destroy the related message. Thank You >> > for > >> your cooperation. >> > > |
From: Brian P. <bri...@in...> - 2008-01-24 15:47:28
|
>> 6) Regarding the length attribute in offset, I am neutral on this. >> This makes it a little harder for the writers. I can see that it >> would be easier for random access readers. Darren says he's not >> interested in it. >> Anyone else out there want to lobby for it? Not me. I never felt the need for it in mzXML - it's derivable anyway as length[n]=offset[n+1]-offset[n], or EOF-offset[n] when n==nmax. Keep it simple. -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Wednesday, January 23, 2008 7:40 PM To: Mass spectrometry standard development Cc: Eric Deutsch Subject: Re: [Psidev-ms-dev] mzML indexing Hi Darren et al. for this discussion. A few points from me: 1) It was decided (although not set in stone) that we would like to have a unique id per spectrum per file for two reasons: a) At some point in the future if we have multiple runs per file (not supported in this first release) it will continue to be true that a spectrum id must be unique within a file. We decided that external references from analysisXML, for example, should be to a unique id rather than a scan number. b) We also left the door open to use LSIDs for the id. It can be any unique string, and thus if someone wants to use LSIDs for this, the door is open. 2) Also, according to the docs, the precursor scan references are by id rather than by scan number (although this appears to be incorrectly represented in tiny1.mzML: FIXME). I think the spectrumRef should be to an id and thus it needs to be in the index for things to work nicely. I see that Matt disagrees that spectrumRef should be to an id. 3) It is true that the current examples only have scanNumber in the index and not id. This should also be fixed before release. 4) The spec document indeed currently says that scanNumbers in the file must be in ascending order, but not necessarily sequential. The comment could perhaps be a little more clearly written. 5) I also do not see the need for the index attribute. I think it should be left out, but if there is still a clear need, we could add. 6) Regarding the length attribute in offset, I am neutral on this. This makes it a little harder for the writers. I can see that it would be easier for random access readers. Darren says he's not interested in it. Anyone else out there want to lobby for it? 7) Regarding going fully attribute: the intent was to preserve the format of the mzXML index as closely as possible to reduce coding work, but a better syntax could be entertained. I wouldn't object to: <offset scanNumber="19" id="S19" byteOffset="3512"/> 8) Regarding enforcing some of these index rules, we should add them to the validator so that validator will do that. Comments on these items? Thanks, Eric > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Kessner, Darren E. > Sent: Wednesday, January 23, 2008 11:19 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML indexing > > Right -- that's why I included the alternative, though I could have been > less terse: > > > > "The alternative is to require that the <index> entries are written in the > same order as the <spectrumList> entries." > > > > I don't know if there is a way to enforce this... > > > > > > Darren > > > > > > ________________________________ > > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Wednesday, January 23, 2008 11:08 AM > To: 'Mass spectrometry standard development' > Subject: Re: [Psidev-ms-dev] mzML indexing > > > > Hi Darren, > > > > I wonder about this possibility: > > > > <index name="spectrum" > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > <offset index="2" scanNumber="23" id="S23">16217</offset> > > <offset index="4" scanNumber="25" id="S25">17258</offset> > > ... > > </index> > > > > If the response is "well, that's not legal, the index values must increase > in increments of 1 starting from 0" then I don't see why it's needed in > the first place - I'd expect that the index would just get snarfed up into > an array and you'd access the nth element to get info on the nth scan > appearing in the file. And if it is legal then I don't see what it's for... > > > > Brian > > > > ________________________________ > > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Kessner, Darren E. > Sent: Wednesday, January 23, 2008 10:33 AM > To: Mass spectrometry standard development > Subject: [Psidev-ms-dev] mzML indexing > > > > Hi all, > > > > There are three ways to refer to a <spectrum> element -- by zero-based > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the > <index> currently only contains scanNumber. I would like to encode the > zero-based index and the id as well in the <index> as follows: > > > > <index name="spectrum" > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > <offset index="1" scanNumber="20" id="S20">16217</offset> > > ... > > </index> > > > > Including the zero-based index is important to enable random access to the > mzML file when you don't know what scan numbers are contained in the file. > The alternative is to require that the <index> entries are written in the > same order as the <spectrumList> entries. > > > > Including the 'id' in the <index> entries is necessary for efficiently > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > this, a dereference requires reading through the <spectrumList> to find > the right 'id'. This info could be read once and cached, but this still > defeats the purpose of indexing. > > > > > > Darren > > > > > > > > Darren Kessner > > Scientific Programmer > > Dar...@cs... > > 310-423-9538 > > > > Spielberg Family Center for Applied Proteomics > > Cedars-Sinai Medical Center > > http://www.sfcap.cshs.org/ > > > > > > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is STRICTLY PROHIBITED. > > If you have received this message in error, please notify us immediately > by calling (310) 423-6428 and destroy the related message. Thank You for > your cooperation. > > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is STRICTLY PROHIBITED. > > If you have received this message in error, please notify us immediately > by calling (310) 423-6428 and destroy the related message. Thank You for > your cooperation. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Philip J. <pj...@eb...> - 2008-01-24 14:45:26
|
Hi, Just a minor request for the mzML schema - this does not involve any change to the XML format, just the way the schema is constructed. It would be useful if the schema definition was consistent in the way it defines <complexType/> elements, with regard to giving them all explicit names and referencing them from wherever they are used - there are a handful of instances in the schema that contain 'anonymous' complexType definitions within an element definition. We have been working with the mzML schema using JAXB (a Java library for converting XML files to a Java object model). The 'anonymous' complexType definitions play havoc with JAXB - however the problem is very easily fixed without making the schema any more complex than it already is. We suspect this may also be useful for libraries other than JAXB. We have already made these changes locally to schema version 0.99.1. All of the example mzML files at psidev.info validate against the modified version. You can find our version of the schema here: http://www.ebi.ac.uk/~pjones/mzML0.99.1_mod.xsd Best regards, Phil Jones & Richard Cote. -- Phil Jones Software Engineer PRIDE Project Team PANDA Group, EMBL-EBI Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD UK. Work phone: +44 1223 492610 Skype: philip-jones |
From: Phil J. @ E. <pj...@eb...> - 2008-01-24 14:25:27
|
Hi, Just a minor request for the mzML schema - this does not involve any change to the XML format, just the way the schema is constructed. It would be useful if the schema definition was consistent in the way it defines <complexType/> elements, with regard to giving them all explicit names and referencing them from wherever they are used - there are a handful of instances in the schema that contain 'anonymous' complexType definitions within an element definition. We have been working with the mzML schema using JAXB (a Java library for converting XML files to a Java object model). The 'anonymous' complexType definitions play havoc with JAXB - however the problem is very easily fixed without making the schema any more complex than it already is. We suspect this may also be useful for libraries other than JAXB. We have already made these changes locally to schema version 0.99.1. All of the example mzML files at psidev.info validate against the modified version. You can find our version of the schema here: http://www.ebi.ac.uk/~pjones/mzML0.99.1_mod.xsd<http://www.ebi.ac.uk/%7Epjones/mzML0.99.1_mod.xsd> Best regards, Phil Jones & Richard Cote. -- Phil Jones Software Engineer PRIDE Project Team PANDA Group, EMBL-EBI Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD UK. Work phone: +44 1223 492610 Skype: philip-jones |
From: Rune S. P. <mai...@ph...> - 2008-01-24 10:38:29
|
Eric Deutsch wrote: > a) At some point in the future if we have multiple runs per file (not > supported in this first release) it will continue to be true that a > spectrum id must be unique within a file. We decided that external > references from analysisXML, for example, should be to a unique id > rather than a scan number. > I don't like the idea of an id. I find it unnecessarily redundant. If you have several runs in a file and you need to specify a scannumber in a specific run, why not just specify both run and scannumber? Combined these keys are unique. I think this will be easier and more intuitive to manage. > 5) I also do not see the need for the index attribute. I think it should > be left out, but if there is still a clear need, we could add. > What index attribute are you referring to here? > 7) Regarding going fully attribute: the intent was to preserve the > format of the mzXML index as closely as possible to reduce coding work, > but a better syntax could be entertained. I wouldn't object to: > > <offset scanNumber="19" id="S19" byteOffset="3512"/> > I prefer having the offset value like so: <offset scanNumber="19">offset</offset> If you decide to record id and length too then include those as attributes. -- Regards Rune |
From: Eric D. <ede...@sy...> - 2008-01-24 03:40:00
|
Hi Darren et al. for this discussion. A few points from me: 1) It was decided (although not set in stone) that we would like to have a unique id per spectrum per file for two reasons: a) At some point in the future if we have multiple runs per file (not supported in this first release) it will continue to be true that a spectrum id must be unique within a file. We decided that external references from analysisXML, for example, should be to a unique id rather than a scan number. b) We also left the door open to use LSIDs for the id. It can be any unique string, and thus if someone wants to use LSIDs for this, the door is open. 2) Also, according to the docs, the precursor scan references are by id rather than by scan number (although this appears to be incorrectly represented in tiny1.mzML: FIXME). I think the spectrumRef should be to an id and thus it needs to be in the index for things to work nicely. I see that Matt disagrees that spectrumRef should be to an id. 3) It is true that the current examples only have scanNumber in the index and not id. This should also be fixed before release. 4) The spec document indeed currently says that scanNumbers in the file must be in ascending order, but not necessarily sequential. The comment could perhaps be a little more clearly written. 5) I also do not see the need for the index attribute. I think it should be left out, but if there is still a clear need, we could add. 6) Regarding the length attribute in offset, I am neutral on this. This makes it a little harder for the writers. I can see that it would be easier for random access readers. Darren says he's not interested in it. Anyone else out there want to lobby for it? 7) Regarding going fully attribute: the intent was to preserve the format of the mzXML index as closely as possible to reduce coding work, but a better syntax could be entertained. I wouldn't object to: <offset scanNumber=3D"19" id=3D"S19" byteOffset=3D"3512"/> 8) Regarding enforcing some of these index rules, we should add them to the validator so that validator will do that. Comments on these items? Thanks, Eric > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Kessner, Darren E. > Sent: Wednesday, January 23, 2008 11:19 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML indexing >=20 > Right -- that's why I included the alternative, though I could have been > less terse: >=20 >=20 >=20 > "The alternative is to require that the <index> entries are written in the > same order as the <spectrumList> entries." >=20 >=20 >=20 > I don't know if there is a way to enforce this... >=20 >=20 >=20 >=20 >=20 > Darren >=20 >=20 >=20 >=20 >=20 > ________________________________ >=20 > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Wednesday, January 23, 2008 11:08 AM > To: 'Mass spectrometry standard development' > Subject: Re: [Psidev-ms-dev] mzML indexing >=20 >=20 >=20 > Hi Darren, >=20 >=20 >=20 > I wonder about this possibility: >=20 >=20 >=20 > <index name=3D"spectrum" > >=20 > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> >=20 > <offset index=3D"2" scanNumber=3D"23" id=3D"S23">16217</offset> >=20 > <offset index=3D"4" scanNumber=3D"25" id=3D"S25">17258</offset> >=20 > ... >=20 > </index> >=20 >=20 >=20 > If the response is "well, that's not legal, the index values must increase > in increments of 1 starting from 0" then I don't see why it's needed in > the first place - I'd expect that the index would just get snarfed up into > an array and you'd access the nth element to get info on the nth scan > appearing in the file. And if it is legal then I don't see what it's for... >=20 >=20 >=20 > Brian >=20 >=20 >=20 > ________________________________ >=20 > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Kessner, Darren E. > Sent: Wednesday, January 23, 2008 10:33 AM > To: Mass spectrometry standard development > Subject: [Psidev-ms-dev] mzML indexing >=20 >=20 >=20 > Hi all, >=20 >=20 >=20 > There are three ways to refer to a <spectrum> element -- by zero-based > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the > <index> currently only contains scanNumber. I would like to encode the > zero-based index and the id as well in the <index> as follows: >=20 >=20 >=20 > <index name=3D"spectrum" > >=20 > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> >=20 > <offset index=3D"1" scanNumber=3D"20" id=3D"S20">16217</offset> >=20 > ... >=20 > </index> >=20 >=20 >=20 > Including the zero-based index is important to enable random access to the > mzML file when you don't know what scan numbers are contained in the file. > The alternative is to require that the <index> entries are written in the > same order as the <spectrumList> entries. >=20 >=20 >=20 > Including the 'id' in the <index> entries is necessary for efficiently > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > this, a dereference requires reading through the <spectrumList> to find > the right 'id'. This info could be read once and cached, but this still > defeats the purpose of indexing. >=20 >=20 >=20 >=20 >=20 > Darren >=20 >=20 >=20 >=20 >=20 >=20 >=20 > Darren Kessner >=20 > Scientific Programmer >=20 > Dar...@cs... >=20 > 310-423-9538 >=20 >=20 >=20 > Spielberg Family Center for Applied Proteomics >=20 > Cedars-Sinai Medical Center >=20 > http://www.sfcap.cshs.org/ >=20 >=20 >=20 >=20 >=20 > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is STRICTLY PROHIBITED. >=20 > If you have received this message in error, please notify us immediately > by calling (310) 423-6428 and destroy the related message. Thank You for > your cooperation. >=20 > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended > recipient, or the employee or agent responsible for delivering it to the > intended recipient, you are hereby notified that any dissemination, > distribution or copying of this information is STRICTLY PROHIBITED. >=20 > If you have received this message in error, please notify us immediately > by calling (310) 423-6428 and destroy the related message. Thank You for > your cooperation. |
From: Kessner, D. E. <Dar...@cs...> - 2008-01-23 19:32:18
|
Matt, thanks for the clarification on the order guarantee for scan numbers. =20 If scanNumber is unique, I agree that I don't see the point of 'id'. But if we use 'id', and especially if we reference based on 'id', we should have it in the index. =20 As for "length": I'm using a stream-based parser that will read in a single element, so I won't be needing it. I could see it being useful for other parsers though, in particular if you want to memory-map that portion of the file before doing the text processing. =20 =20 Darren =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Wednesday, January 23, 2008 11:18 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML indexing =20 Hi Darren, Since scan numbers are guaranteed to be in ascending order in the spectrumList (established elsewhere in this mailing list), it makes sense to extend that guarantee to the index. Also, "spectrumRef" should refer to the scanNumber, not the "id" - the "id" can be any unique string and I don't see why referencing based on that is desirable. I acknowledge the consistency problem with having an "id" attribute that a "Ref" attribute ignores in favor of a non-"id" attribute, but if "id" is not simply the scan number, then it should be somewhat irrelevant (like the "title" attribute in MGF). So the <offset> can still function unambiguously with only a "scanNumber" attribute. However, one thing I would like to see is not just an offset, but a size of the spectrum element to really make reading via the index easy and as fast as possible (instead of fumbling around with code and cpu cycles to figure out where the indexed spectrum element ends the entire block can be read with one call). Thus, I would like to see: <offset scanNumber=3D"19" byteOffset=3D"3512" length=3D"12705" /> At first glance, you might think that simply reading until the next offset would work, but that might include a bunch of unexpected elements if comments are allowed in the spectrumList, e.g.: <spectrum>...</spectrum><comment>foo</comment><spectrum></spectrum> If such comments aren't allowed in the list and the next element is guaranteed to be the next <spectrum> element, then the length attribute is unnecessary, so I'd like to get that clarified. -Matt Kessner, Darren E. wrote:=20 Hi all, =20 There are three ways to refer to a <spectrum> element -- by zero-based index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the <index> currently only contains scanNumber. I would like to encode the zero-based index and the id as well in the <index> as follows: =20 <index name=3D"spectrum" > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> <offset index=3D"1" scanNumber=3D"20" id=3D"S20">16217</offset> ... </index> =20 Including the zero-based index is important to enable random access to the mzML file when you don't know what scan numbers are contained in the file. The alternative is to require that the <index> entries are written in the same order as the <spectrumList> entries. =20 Including the 'id' in the <index> entries is necessary for efficiently dereferencing a "spectrumRef" (e.g. in <precursor> element). Without this, a dereference requires reading through the <spectrumList> to find the right 'id'. This info could be read once and cached, but this still defeats the purpose of indexing. =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or = entity to which it is addressed and may contain information that is privi= leged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipi= ent, or the employee or agent responsible for delivering it to the intend= ed recipient, you are hereby notified that any dissemination, distributio= n or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for= your cooperation. |
From: Kessner, D. E. <Dar...@cs...> - 2008-01-23 19:18:57
|
Right -- that's why I included the alternative, though I could have been less terse: =20 "The alternative is to require that the <index> entries are written in the same order as the <spectrumList> entries." =20 I don't know if there is a way to enforce this... =20 =20 Darren =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Wednesday, January 23, 2008 11:08 AM To: 'Mass spectrometry standard development' Subject: Re: [Psidev-ms-dev] mzML indexing =20 Hi Darren, =20 I wonder about this possibility: =20 <index name=3D"spectrum" > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> <offset index=3D"2" scanNumber=3D"23" id=3D"S23">16217</offset> <offset index=3D"4" scanNumber=3D"25" id=3D"S25">17258</offset> ... </index> =20 If the response is "well, that's not legal, the index values must increase in increments of 1 starting from 0" then I don't see why it's needed in the first place - I'd expect that the index would just get snarfed up into an array and you'd access the nth element to get info on the nth scan appearing in the file. And if it is legal then I don't see what it's for... =20 =20 Brian =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Kessner, Darren E. Sent: Wednesday, January 23, 2008 10:33 AM To: Mass spectrometry standard development Subject: [Psidev-ms-dev] mzML indexing =20 Hi all, =20 There are three ways to refer to a <spectrum> element -- by zero-based index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the <index> currently only contains scanNumber. I would like to encode the zero-based index and the id as well in the <index> as follows: =20 <index name=3D"spectrum" > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> <offset index=3D"1" scanNumber=3D"20" id=3D"S20">16217</offset> ... </index> =20 Including the zero-based index is important to enable random access to the mzML file when you don't know what scan numbers are contained in the file. The alternative is to require that the <index> entries are written in the same order as the <spectrumList> entries. =20 Including the 'id' in the <index> entries is necessary for efficiently dereferencing a "spectrumRef" (e.g. in <precursor> element). Without this, a dereference requires reading through the <spectrumList> to find the right 'id'. This info could be read once and cached, but this still defeats the purpose of indexing. =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. |
From: Matthew C. <mat...@va...> - 2008-01-23 19:18:26
|
Hi Darren, Since scan numbers are guaranteed to be in ascending order in the spectrumList (established elsewhere in this mailing list), it makes sense to extend that guarantee to the index. Also, "spectrumRef" should refer to the scanNumber, not the "id" - the "id" can be any unique string and I don't see why referencing based on that is desirable. I acknowledge the consistency problem with having an "id" attribute that a "Ref" attribute ignores in favor of a non-"id" attribute, but if "id" is not simply the scan number, then it should be somewhat irrelevant (like the "title" attribute in MGF). So the <offset> can still function unambiguously with only a "scanNumber" attribute. However, one thing I would like to see is not just an offset, but a size of the spectrum element to really make reading via the index easy and as fast as possible (instead of fumbling around with code and cpu cycles to figure out where the indexed spectrum element ends the entire block can be read with one call). Thus, I would like to see: <offset scanNumber="19" byteOffset="3512" length="12705" /> At first glance, you might think that simply reading until the next offset would work, but that might include a bunch of unexpected elements if comments are allowed in the spectrumList, e.g.: <spectrum>...</spectrum><comment>foo</comment><spectrum></spectrum> If such comments aren't allowed in the list and the next element is guaranteed to be the next <spectrum> element, then the length attribute is unnecessary, so I'd like to get that clarified. -Matt Kessner, Darren E. wrote: > > Hi all, > > > > There are three ways to refer to a <spectrum> element -- by zero-based > index into the <spectrumList>, by 'scanNumber', and by 'id'. However, > the <index> currently only contains scanNumber. I would like to > encode the zero-based index and the id as well in the <index> as follows: > > > > <index name="spectrum" > > > <offset index="0" scanNumber="19" id="S19">3512</offset> > > <offset index="1" scanNumber="20" id="S20">16217</offset> > > ... > > </index> > > > > Including the zero-based index is important to enable random access to > the mzML file when you don't know what scan numbers are contained in > the file. The alternative is to require that the <index> entries are > written in the same order as the <spectrumList> entries. > > > > Including the 'id' in the <index> entries is necessary for efficiently > dereferencing a "spectrumRef" (e.g. in <precursor> element). Without > this, a dereference requires reading through the <spectrumList> to > find the right 'id'. This info could be read once and cached, but > this still defeats the purpose of indexing. > > > > > > Darren > > > > > > > > Darren Kessner > > Scientific Programmer > > Dar...@cs... <mailto:Dar...@cs...> > > 310-423-9538 > > > > Spielberg Family Center for Applied Proteomics > > Cedars-Sinai Medical Center > > http://www.sfcap.cshs.org/ > > > |
From: Brian P. <bri...@in...> - 2008-01-23 19:08:41
|
Hi Darren, I wonder about this possibility: <index name="spectrum" > <offset index="0" scanNumber="19" id="S19">3512</offset> <offset index="2" scanNumber="23" id="S23">16217</offset> <offset index="4" scanNumber="25" id="S25">17258</offset> ... </index> If the response is "well, that's not legal, the index values must increase in increments of 1 starting from 0" then I don't see why it's needed in the first place - I'd expect that the index would just get snarfed up into an array and you'd access the nth element to get info on the nth scan appearing in the file. And if it is legal then I don't see what it's for. Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Kessner, Darren E. Sent: Wednesday, January 23, 2008 10:33 AM To: Mass spectrometry standard development Subject: [Psidev-ms-dev] mzML indexing Hi all, There are three ways to refer to a <spectrum> element -- by zero-based index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the <index> currently only contains scanNumber. I would like to encode the zero-based index and the id as well in the <index> as follows: <index name="spectrum" > <offset index="0" scanNumber="19" id="S19">3512</offset> <offset index="1" scanNumber="20" id="S20">16217</offset> ... </index> Including the zero-based index is important to enable random access to the mzML file when you don't know what scan numbers are contained in the file. The alternative is to require that the <index> entries are written in the same order as the <spectrumList> entries. Including the 'id' in the <index> entries is necessary for efficiently dereferencing a "spectrumRef" (e.g. in <precursor> element). Without this, a dereference requires reading through the <spectrumList> to find the right 'id'. This info could be read once and cached, but this still defeats the purpose of indexing. Darren Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. |
From: Kessner, D. E. <Dar...@cs...> - 2008-01-23 18:32:50
|
Hi all, =20 There are three ways to refer to a <spectrum> element -- by zero-based index into the <spectrumList>, by 'scanNumber', and by 'id'. However, the <index> currently only contains scanNumber. I would like to encode the zero-based index and the id as well in the <index> as follows: =20 <index name=3D"spectrum" > <offset index=3D"0" scanNumber=3D"19" id=3D"S19">3512</offset> <offset index=3D"1" scanNumber=3D"20" id=3D"S20">16217</offset> ... </index> =20 Including the zero-based index is important to enable random access to the mzML file when you don't know what scan numbers are contained in the file. The alternative is to require that the <index> entries are written in the same order as the <spectrumList> entries. =20 Including the 'id' in the <index> entries is necessary for efficiently dereferencing a "spectrumRef" (e.g. in <precursor> element). Without this, a dereference requires reading through the <spectrumList> to find the right 'id'. This info could be read once and cached, but this still defeats the purpose of indexing. =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or = entity to which it is addressed and may contain information that is privi= leged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipi= ent, or the employee or agent responsible for delivering it to the intend= ed recipient, you are hereby notified that any dissemination, distributio= n or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for= your cooperation. |
From: Eric D. <ede...@sy...> - 2008-01-22 22:00:46
|
Hi Darren, many thanks for writing this and making it available. As Angel recently pointed out, we really need to demonstrate that mzML works in an existing pipeline before we release. I would very much like to get a working version of the RAMP library for mzML as this would allow us to test our tools with mzML and close the loop. I hope your very nice library here can help us implement this quickly. =20 Thanks! Eric =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Kessner, Darren E. Sent: Monday, January 21, 2008 3:48 PM To: psi...@li... Subject: [Psidev-ms-dev] MSData: C++ library for mzML reading/writing =20 Hi all, =20 I just wanted to introduce a C++ library I've been working on for handling mzML (among other things). =20 Some background: Our group (Spielberg Family Center for Applied Proteomics, Cedars-Sinai Medical Center) has a number of software tools for analyzing RAW and mzXML data files. This library (MSData) is the next version of our data file access layer. The MSData library implements a C++ representation of mzML.=20 =20 Current implemented functionality: =20 - compile-time parsing of the psi-ms.obo controlled vocabulary file to generate C++ code for typesafe use of the controlled vocabulary terms =20 - mzML <-> MSData data structure mapping, including reading/writing mzML XML fragments from/to C++ iostreams =20 - diff functionality for each MSData data structure =20 - binary data array encoding with 64-32 bit conversions and endianization=20 =20 - abstract interface to spectrum binary data, to allow lazy evaluation of binary data backed by data files =20 Next steps: =20 - mzXML reading/writing, RAW reading =20 - simple analysis interface for accessing scan data and common meta-data fields =20 =20 I put a source package here: =20 http://www.sfcap.cshs.org/private =20 login: psidev pass: pwiz =20 =20 We will be making this library available open source, as soon as we finish wading through our legal department's bureaucracy. =20 =20 Please have a look, and feel free to email me with any questions, comments, or suggestions! =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. |
From: Eric D. <ede...@sy...> - 2008-01-22 21:55:58
|
Hi Darren, many thanks for these suggestions, these are very good points that we will address. The official review period for mzML is ending soon and Norman Paton is collecting the reviews and will hopefully forward them to us soon. At that point we will address all of the comments we have received. =20 Thanks! Eric =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Kessner, Darren E. Sent: Tuesday, January 22, 2008 8:14 AM To: psi...@li... Subject: [Psidev-ms-dev] mzML consistency miscellanea =20 Hi all, =20 I've collected some notes regarding the mzML spec: =20 1) There are references in the specification document to InstrumentType, SampleType, etc. that I assume mean <instrument> element, <sample> element, etc, though this is not explicitly stated anywhere. =20 2) The <precursor> element has a spectrumRef attribute that is supposed to refer to the id attribute of a <spectrum>. However, the <precursor> element in tiny1.mzML0.99.1.mzML appears to refer to a scanNumber, not id. Which is the intended attribute to reference (I assume 'id')? =20 3) The <cv> element has the attribute = fullName=3D"Proteomics Standards Initiative Mass Spectrometry Ontology". This text does not appear in psi-ms.obo - perhaps it should? Basically, I think it would be useful to have some identifier that appears in both psi-ms.obo and in mzML files generated with that psi-ms.obo. Or even better, an id and a version, just like the <softwareParam> elements, but in the psi-ms.obo it could appear in the header. =20 4) Regarding <softwareParam> elements, is there a reason not to use two of the more general <cvParam> elements, one to specify the software, and one to specify the version? =20 5) Element reference naming consistency -- in many cases, there is an element name and a corresponding (either attribute or element) name for a reference to it: =20 <instrument> <-- instrumentRef <sourceFile> <-- sourceFileRef <spectrum> <-- spectrumRef =20 But there are a few exceptions: =20 <referenceableParamGroup> <-- paramGroupRef=20 <software> <-- softwareRef AND instrumentSoftwareRef =20 Suggestions: Replace <referenceableParamGroup> with <paramGroup> Remove <instrumentSoftwareRef> and use <softwareRef> =20 Since the id attribute is usually used for references, we could also have: <cv id=3D"MS" ... > ...=20 <cvParam cvRef=3D"MS" ...> =20 There is also some redundancy in the naming of <sourceFile> attributes: <sourceFile id=3D"1" sourceFileName=3D"tiny1.RAW" sourceFileLocation=3D"file://F:/data/Exp01" > could be shortened to: <sourceFile id=3D"1" name=3D"tiny1.RAW" = location=3D"file://F:/data/Exp01" > =20 =20 =20 Darren =20 =20 =20 Darren Kessner Scientific Programmer Dar...@cs... 310-423-9538 =20 Spielberg Family Center for Applied Proteomics Cedars-Sinai Medical Center http://www.sfcap.cshs.org/ =20 =20 IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. |