You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(1) |
Aug
(5) |
Sep
|
Oct
(5) |
Nov
(1) |
Dec
(2) |
2005 |
Jan
(2) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(5) |
Jun
(2) |
Jul
(3) |
Aug
(7) |
Sep
(18) |
Oct
(22) |
Nov
(10) |
Dec
(15) |
2006 |
Jan
(15) |
Feb
(8) |
Mar
(16) |
Apr
(8) |
May
(2) |
Jun
(5) |
Jul
(3) |
Aug
(1) |
Sep
(34) |
Oct
(21) |
Nov
(14) |
Dec
(2) |
2007 |
Jan
|
Feb
(17) |
Mar
(10) |
Apr
(25) |
May
(11) |
Jun
(30) |
Jul
(1) |
Aug
(38) |
Sep
|
Oct
(119) |
Nov
(18) |
Dec
(3) |
2008 |
Jan
(34) |
Feb
(202) |
Mar
(57) |
Apr
(76) |
May
(44) |
Jun
(33) |
Jul
(33) |
Aug
(32) |
Sep
(41) |
Oct
(49) |
Nov
(84) |
Dec
(216) |
2009 |
Jan
(102) |
Feb
(126) |
Mar
(112) |
Apr
(26) |
May
(91) |
Jun
(54) |
Jul
(39) |
Aug
(29) |
Sep
(16) |
Oct
(18) |
Nov
(12) |
Dec
(23) |
2010 |
Jan
(29) |
Feb
(7) |
Mar
(11) |
Apr
(22) |
May
(9) |
Jun
(13) |
Jul
(7) |
Aug
(10) |
Sep
(9) |
Oct
(20) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
(4) |
Mar
(27) |
Apr
(15) |
May
(23) |
Jun
(13) |
Jul
(15) |
Aug
(11) |
Sep
(23) |
Oct
(18) |
Nov
(10) |
Dec
(7) |
2012 |
Jan
(23) |
Feb
(19) |
Mar
(7) |
Apr
(20) |
May
(16) |
Jun
(4) |
Jul
(6) |
Aug
(6) |
Sep
(14) |
Oct
(16) |
Nov
(31) |
Dec
(23) |
2013 |
Jan
(14) |
Feb
(19) |
Mar
(7) |
Apr
(25) |
May
(8) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(20) |
Oct
(19) |
Nov
(10) |
Dec
(12) |
2014 |
Jan
(6) |
Feb
(15) |
Mar
(6) |
Apr
(4) |
May
(16) |
Jun
(6) |
Jul
(4) |
Aug
(2) |
Sep
(3) |
Oct
(3) |
Nov
(7) |
Dec
(3) |
2015 |
Jan
(3) |
Feb
(8) |
Mar
(14) |
Apr
(3) |
May
(17) |
Jun
(9) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
(13) |
Nov
|
Dec
(6) |
2016 |
Jan
(8) |
Feb
(1) |
Mar
(20) |
Apr
(16) |
May
(11) |
Jun
(6) |
Jul
(5) |
Aug
|
Sep
(2) |
Oct
(5) |
Nov
(7) |
Dec
(2) |
2017 |
Jan
(10) |
Feb
(3) |
Mar
(17) |
Apr
(7) |
May
(5) |
Jun
(11) |
Jul
(4) |
Aug
(12) |
Sep
(9) |
Oct
(7) |
Nov
(2) |
Dec
(4) |
2018 |
Jan
(7) |
Feb
(2) |
Mar
(5) |
Apr
(6) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(1) |
Sep
(9) |
Oct
(5) |
Nov
(3) |
Dec
(5) |
2019 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(4) |
May
(2) |
Jun
(8) |
Jul
(2) |
Aug
(2) |
Sep
|
Oct
(2) |
Nov
(9) |
Dec
(1) |
2020 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
|
May
(3) |
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Brian P. <bri...@in...> - 2007-08-03 00:03:49
|
Hi Matt, OK, I see the disconnect - you aren't using an API for reading mass spec data, you're using an API for reading XML (expat - an excellent choice). You're speaking in terms of "the parser", but the APIs we're concerned with (RAMP, JRAP) are front ends to multiple parsers and they abstract the mass spec file format choice away from the logic that deals with mass spec data, which keeps us from needing to change a couple dozen programs (along with others we don't even know about, since RAMP and JRAP are open source) when a new format pops up. So yes, you'll certainly need to make extensive parser changes to deal with mzML, as will RAMP and JRAP. And, if you want to retain the ability to read the mass spec files you're already reading, you'll need to somehow deal with using multiple parsers inside your code. In short, you'll need to create your own mass spec reader file API. So why all the excitement about API stability? Consider this: originally, RAMP read mzXML only. Then we added the ability to read mzData. Now, all of the many programs that employ RAMP suddenly could read both mzData and mzXML with nothing more than a recompilation (OK, that first time actually required a small RAMP API tweak - using a RAMPFILE handle instead of a FILE handle). Later we added mzXML 3.0 with its compressed peak lists, and RAMP users only needed to recompile to get this additional capability - no "downstream" changes needed. There have even been unreleased versions of RAMP that read intermediate proposed forms of mzML. Such ease of adoption is very powerful when trying to establish a new data standard. But guess what? RAMP can't be made to transparently handle the current proposed mzML format due to the breaking of the one file / one run mapping. Truly new mass spec behaviors will eventually make it necessary to expand or even break the current mass spec data reader APIs. Multiple precursors are actually a good example of this (as an expansion, hopefully). But, breaking the one run / one file relationship isn't driven by new mass spec behaviors that I know of. What is the use case for this feature, anyway? What's so compelling about having multiple runs in a single mzML file that everyone will want to massively rejigger their code to implement this? Seems like we're just creating an orphan feature that will only serve to trip up unwary mzML output writers ("nice multi-run output ya got there - too bad nobody can read it"), which I think is exactly the kind of thing the committee said they wanted to avoid. - Brian -----Original Message----- From: Matthew Chambers [mailto:mat...@va...] Sent: Thursday, August 02, 2007 3:11 PM To: 'Joshua Tasman' Cc: 'Brian Pratt'; psi...@li... Subject: RE: [Psidev-ms-dev] mzML 0.93 ready for first review > -----Original Message----- > From: Joshua Tasman [mailto:jt...@sy...] > Sent: Thursday, August 02, 2007 5:03 PM > To: Matthew Chambers > Cc: 'Brian Pratt'; psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review > > Hi Matt, > > As the person writing both the writers and readers (at least for > now)-- a brief comment: > > > Parameter groups, multiple runs, multiple precursors, and > >> compressed binary data are all major "completely predictable > >> trouble spots." > > Brian is correct-- neither parameter groups or compressed binary data > change the expected relationship of scans-to-file. (Nor do multiple > precursor, which will require downstream code changes, but only apply > to MS level > 2 scans and it's important info to get in there anyhow, > so a good downstream change to make.) RAMP already reads compressed > mzXML, for example. > > -Josh I didn't mean to imply that parameter groups or multiple precursors changed the relationship of scans-to-file, only that they change the parser and underlying data structure quite significantly. I don't use RAMP in my C++ code, I wrote my own simple expat-based parser, but I know changes will have to be made with the new format. When I said "readers develop faster than the file writers" I meant all the way downstream to the UI (i.e. supporting multiple sources in a single input file). In any case, even before those downstream changes are made, a RunList::count == 1 dependent parser can simply only read the first run from a multi-run file, just like some parsers may only read MS>=2 spectra from the file. -Matt |
From: Matthew C. <mat...@va...> - 2007-08-02 22:11:10
|
> -----Original Message----- > From: Joshua Tasman [mailto:jt...@sy...] > Sent: Thursday, August 02, 2007 5:03 PM > To: Matthew Chambers > Cc: 'Brian Pratt'; psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review > > Hi Matt, > > As the person writing both the writers and readers (at least for now)-- > a brief comment: > > > Parameter groups, multiple runs, multiple precursors, and > >> compressed binary data are all major "completely predictable trouble > >> spots." > > Brian is correct-- neither parameter groups or compressed binary data > change the expected relationship of scans-to-file. (Nor do multiple > precursor, which will require downstream code changes, but only apply to > MS level > 2 scans and it's important info to get in there anyhow, so a > good downstream change to make.) RAMP already reads compressed mzXML, > for example. > > -Josh I didn't mean to imply that parameter groups or multiple precursors changed the relationship of scans-to-file, only that they change the parser and underlying data structure quite significantly. I don't use RAMP in my C++ code, I wrote my own simple expat-based parser, but I know changes will have to be made with the new format. When I said "readers develop faster than the file writers" I meant all the way downstream to the UI (i.e. supporting multiple sources in a single input file). In any case, even before those downstream changes are made, a RunList::count == 1 dependent parser can simply only read the first run from a multi-run file, just like some parsers may only read MS>=2 spectra from the file. -Matt |
From: Joshua T. <jt...@sy...> - 2007-08-02 22:03:16
|
Hi Matt, As the person writing both the writers and readers (at least for now)-- a brief comment: > Parameter groups, multiple runs, multiple precursors, and >> compressed binary data are all major “completely predictable trouble >> spots.” Brian is correct-- neither parameter groups or compressed binary data change the expected relationship of scans-to-file. (Nor do multiple precursor, which will require downstream code changes, but only apply to MS level > 2 scans and it's important info to get in there anyhow, so a good downstream change to make.) RAMP already reads compressed mzXML, for example. -Josh Matthew Chambers wrote: > What’s wrong with the schema supporting multiple runs per file and > letting implementers gradually add support for it? There are many > features of mzML that will require substantial rewrites of the existing > parser APIs. Parameter groups, multiple runs, multiple precursors, and > compressed binary data are all major “completely predictable trouble > spots.” As long as the file readers develop faster than the file > writers, there won’t be a problem. ;) I very much doubt that writers > (e.g. ReAdW) will be writing multiple instrument files into one mzML > file any time soon (unless somebody is itching to do this without saying > so?). The parameter groups and multiple precursors are more > problematic, IMO, but still good improvements. > > > > I have a few comments: > > - There seems to be a timestamp on the run element now (maybe I just > missed it before), of type xs:dateTime. It’s an optional attribute and > it has an ambiguous meaning. Why isn’t this expanded into a start and > stop timestamp for the run? Also, why is it optional? > > > > - Most every cvParam has a “cvLabel” attribute that is “MS” but the > accession attribute of each cvParam seems to include the cvLabel in it > (“MS:xxxxxxxx”). If that is just a coincidence, I think it should be > changed so that it is required and the cvLabel can be eliminated. If > it’s not a coincidence, why is like that? If the parser needs to know > which vocabulary an accession number is from, it can parse until the > colon delimiter. Alternatively, keeping cvLabel and getting rid of the > “MS:” in the accession attribute would allow somewhat more efficient > parsing. In the alternative case, I suggest a required default cvLabel > somewhere in the header, similar to setting the default XML namespace. > > > > - I see a TODO item is giving the binaryDataArray’s “dataType” attribute > a CV entry. I agree with this. But I think the values should be more > machine-oriented, like “float32”, “float64”, “int32”, “uint64”, etc. > > > > - Parameter groups are good, especially since the spectrum headers seem > to have ballooned to be more flexible. Anything that makes the > file-dominating spectrum elements smaller and faster to parse is nice - > indexing the shared parameters is a good way to do this. > > > > - I’d still like to see a clear definition of “run” relative to “sample” > and “source file.” Seems like these three are all tightly coupled. > > > > -Matt Chambers > > Vanderbilt MSRC > > > > ------------------------------------------------------------------------ > > *From:* psi...@li... > [mailto:psi...@li...] *On Behalf Of > *Brian Pratt > *Sent:* Thursday, August 02, 2007 2:47 PM > *To:* psi...@li... > *Subject:* Re: [Psidev-ms-dev] mzML 0.93 ready for first review > > > > (Note: I know I'm late to the party with this comment, but I think it's > important) > > > > I noticed this in the todo file: > > " - Now that we’re allowing multiple runs in a file, how will the index > look to handle this?" > > > > Better question: what will software that //uses //such an index look like? > > > > Answer: it won't look much like anything that currently reads mzXML and > mzData - including X!Tandem or anything using RAMP (TPP and others) > or JRAP (CPAS and others). These programs easily deal with both mzData > and mzXML in their various versions by using APIs which, as it > happens, assume one file per run and one run per file. Breaking this > one to one correspondence in mzML means you can't just slide mzML > support in behind the API, and of course also violates a fundamental > assumption which flows through the code that calls these APIs, right out > to the user interface in most cases. This means extensive surgery to > any program that wants to read mzML properly, and my guess is that means > mzML is DOA. At a minimum it becomes a completely predictable trouble > spot since you can now write legal mzML files that the majority of mzML > readers will simply not know how to handle. They'll be OK with > RunList::count == 1, but no more - so, why set ourselves up for trouble? > > > > Multiple runs per file are probably useful in some cases, but if the > stated goal of mzML is to replace mzXML and mzData then I think this > feature is actually scope creep which threatens the mission and should > be dropped. Let those who really want this feature come up with a > wrapper schema, but don't call it mzML lest you force the vast majority > of mzML consuming software to be broken from the start. > > > > - Brian > > > > ------------------------------------------------------------------------ > > *From:* psi...@li... > [mailto:psi...@li...] *On Behalf Of *Eric > Deutsch > *Sent:* Thursday, August 02, 2007 1:02 AM > *To:* len...@eb...; Jimmy Eng; lu...@eb...; Puneet > Souda; Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy > Julian; Andy Jones; David Creasy; Sean L Seymour; Angel Pizarro; David > Fenyo; Jam...@wa...; Mike Coleman; Matthew Chambers; Helen > Jenkins; Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas Römpp; Kent > Laursen; Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli > Patrick Gino Angelo; Hans Vissers; Eric Deutsch; cl...@br...; > dav...@ag...; rb...@be...; > psi...@li... > *Cc:* Rolf Apweiler; Ruedi Aebersold > *Subject:* [Psidev-ms-dev] mzML 0.93 ready for first review > > Hi everyone, after considerable hard work from many people, we have a > prerelease of mzML (the union of mzData and mzXML) available for comment > by you, a major stakeholder in mzML. > > You may download a kit of material to examine at: > > _http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip_ > > The general mzML development page is at: > > _http://psidev.info/index.php?q=node/257_ > > Please send feedback to: > > psi...@li... > > We ask that you respond by August 20. > > Additional releases with more information may be provided during the > coming month. > > The current format has been guided by these principles: > > - Keep the format simple > > - Minimize alternate ways of encoding the same information > > - Allow some flexibility for encoding new important information > > - Support the features of mzData and mzXML but not a lot more > > - But do provide clear support for SRM data > > - Finish the format soon with the resources available > > There are many enhancements that have been suggested, but the small > group of volunteers that have actively developed this format have opted > to focus on the primary goal set before us: develop a single format that > the vendors and current software can easily support and thereby obsolete > mzData and mzXML. The enhancements not considered compatible with this > goal will be entertained for mzML 2.0 > > We are committed to providing not just the format, but also a set of > working implementations, converters and readers, as well as a format > validator, all to ensure that mzML is a format that will be adopted > quickly and implemented uniformly. Prior to submission to the PSI > document process, the following software will implement mzML: > > - 2 or more converters from vendor formats to mzML > > - the popular reader library RAMP that currently supports mzData and mzXML > > - an mzML semantic validator that checks for correct implementation > > We hope to follow this schedule: > > 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment > > 2007-08-20 Comments from major stakeholders received > > 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning > 30 days internal review > > 2007-10-01 Revised mzML 1.01 begins 60 days community review > > 2007-10-06 Formal announcement that feedback is sought at HUPO world > congress > > 2007-12-01 Formal 60 days community review closes > > 2008-01-01 Revised mzML 1.02 officially released > > Thank you for your help! Feel free to forward this message to someone > whom you think should review the format at this stage. > > Regards, > > Eric > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > > ------------------------------------------------------------------------ > > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matthew C. <mat...@va...> - 2007-08-02 21:37:00
|
What=92s wrong with the schema supporting multiple runs per file and = letting implementers gradually add support for it? There are many features of = mzML that will require substantial rewrites of the existing parser APIs. Parameter groups, multiple runs, multiple precursors, and compressed = binary data are all major =93completely predictable trouble spots.=94 As long = as the file readers develop faster than the file writers, there won=92t be a = problem. ;) I very much doubt that writers (e.g. ReAdW) will be writing multiple instrument files into one mzML file any time soon (unless somebody is itching to do this without saying so?). The parameter groups and = multiple precursors are more problematic, IMO, but still good improvements. =20 I have a few comments: - There seems to be a timestamp on the run element now (maybe I just = missed it before), of type xs:dateTime. It=92s an optional attribute and it = has an ambiguous meaning. Why isn=92t this expanded into a start and stop = timestamp for the run? Also, why is it optional? =20 - Most every cvParam has a =93cvLabel=94 attribute that is =93MS=94 but = the accession attribute of each cvParam seems to include the cvLabel in it (=93MS:xxxxxxxx=94). If that is just a coincidence, I think it should = be changed so that it is required and the cvLabel can be eliminated. If = it=92s not a coincidence, why is like that? If the parser needs to know which vocabulary an accession number is from, it can parse until the colon delimiter. Alternatively, keeping cvLabel and getting rid of the = =93MS:=94 in the accession attribute would allow somewhat more efficient parsing. In = the alternative case, I suggest a required default cvLabel somewhere in the header, similar to setting the default XML namespace. =20 - I see a TODO item is giving the binaryDataArray=92s =93dataType=94 = attribute a CV entry. I agree with this. But I think the values should be more machine-oriented, like =93float32=94, =93float64=94, =93int32=94, = =93uint64=94, etc. =20 - Parameter groups are good, especially since the spectrum headers seem = to have ballooned to be more flexible. Anything that makes the = file-dominating spectrum elements smaller and faster to parse is nice - indexing the = shared parameters is a good way to do this. =20 - I=92d still like to see a clear definition of =93run=94 relative to = =93sample=94 and =93source file.=94 Seems like these three are all tightly coupled. =20 -Matt Chambers Vanderbilt MSRC =20 _____ =20 From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Thursday, August 02, 2007 2:47 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review =20 (Note: I know I'm late to the party with this comment, but I think it's important) =20 I noticed this in the todo file: " - Now that we=92re allowing multiple runs in a file, how will the = index look to handle this?" =20 Better question: what will software that uses such an index look like?=20 =20 Answer: it won't look much like anything that currently reads mzXML and mzData - including X!Tandem or anything using RAMP (TPP and others) or = JRAP (CPAS and others). These programs easily deal with both mzData and = mzXML in their various versions by using APIs which, as it happens, assume one = file per run and one run per file. Breaking this one to one correspondence = in mzML means you can't just slide mzML support in behind the API, and of course also violates a fundamental assumption which flows through the = code that calls these APIs, right out to the user interface in most cases. = This means extensive surgery to any program that wants to read mzML properly, = and my guess is that means mzML is DOA. At a minimum it becomes a = completely predictable trouble spot since you can now write legal mzML files that = the majority of mzML readers will simply not know how to handle. They'll = be OK with RunList::count =3D=3D 1, but no more - so, why set ourselves up for trouble? =20 Multiple runs per file are probably useful in some cases, but if the = stated goal of mzML is to replace mzXML and mzData then I think this feature is actually scope creep which threatens the mission and should be dropped. = Let those who really want this feature come up with a wrapper schema, but = don't call it mzML lest you force the vast majority of mzML consuming software = to be broken from the start. =20 - Brian =20 _____ =20 From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Thursday, August 02, 2007 1:02 AM To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet Souda; Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; Andy Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo; Jam...@wa...; Mike Coleman; Matthew Chambers; Helen = Jenkins; Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas R=F6mpp; Kent = Laursen; Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick = Gino Angelo; Hans Vissers; Eric Deutsch; cl...@br...; dav...@ag...; rb...@be...; psi...@li... Cc: Rolf Apweiler; Ruedi Aebersold Subject: [Psidev-ms-dev] mzML 0.93 ready for first review Hi everyone, after considerable hard work from many people, we have a prerelease of mzML (the union of mzData and mzXML) available for comment = by you, a major stakeholder in mzML. You may download a kit of material to examine at: http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip The general mzML development page is at: http://psidev.info/index.php?q=3Dnode/257 Please send feedback to: psi...@li... We ask that you respond by August 20. Additional releases with more information may be provided during the = coming month. The current format has been guided by these principles: - Keep the format simple - Minimize alternate ways of encoding the same information - Allow some flexibility for encoding new important information - Support the features of mzData and mzXML but not a lot more - But do provide clear support for SRM data - Finish the format soon with the resources available There are many enhancements that have been suggested, but the small = group of volunteers that have actively developed this format have opted to focus = on the primary goal set before us: develop a single format that the vendors = and current software can easily support and thereby obsolete mzData and = mzXML. The enhancements not considered compatible with this goal will be entertained for mzML 2.0 We are committed to providing not just the format, but also a set of = working implementations, converters and readers, as well as a format validator, = all to ensure that mzML is a format that will be adopted quickly and = implemented uniformly. Prior to submission to the PSI document process, the = following software will implement mzML: - 2 or more converters from vendor formats to mzML - the popular reader library RAMP that currently supports mzData and = mzXML - an mzML semantic validator that checks for correct implementation We hope to follow this schedule: 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment 2007-08-20 Comments from major stakeholders received 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning = 30 days internal review 2007-10-01 Revised mzML 1.01 begins 60 days community review 2007-10-06 Formal announcement that feedback is sought at HUPO world congress 2007-12-01 Formal 60 days community review closes 2008-01-01 Revised mzML 1.02 officially released Thank you for your help! Feel free to forward this message to someone = whom you think should review the format at this stage. Regards, Eric |
From: Jimmy E. <jk...@gm...> - 2007-08-02 20:33:50
|
I completely agree; that line in the todo file piqued my interest too and I was going to followup with Eric offline to get clarification. In the interest of getting something out and keeping it simple, I suggest encapsulating multiple runs in a single instance document get dropped and put on the discussion list for the next rev of mzML. On 8/2/07, Brian Pratt <bri...@in...> wrote: > (Note: I know I'm late to the party with this comment, but I think it's > important) > > I noticed this in the todo file: > " - Now that we're allowing multiple runs in a file, how will the index l= ook > to handle this?" > > Better question: what will software that uses such an index look like? > > Answer: it won't look much like anything that currently reads mzXML and > mzData - including X!Tandem or anything using RAMP (TPP and others) or JR= AP > (CPAS and others). These programs easily deal with both mzData and mzXML= in > their various versions by using APIs which, as it happens, assume one fil= e > per run and one run per file. Breaking this one to one correspondence i= n > mzML means you can't just slide mzML support in behind the API, and of > course also violates a fundamental assumption which flows through the cod= e > that calls these APIs, right out to the user interface in most cases. Th= is > means extensive surgery to any program that wants to read mzML properly, = and > my guess is that means mzML is DOA. At a minimum it becomes a completely > predictable trouble spot since you can now write legal mzML files that th= e > majority of mzML readers will simply not know how to handle. They'll be= OK > with RunList::count =3D=3D 1, but no more - so, why set ourselves up for > trouble? > > Multiple runs per file are probably useful in some cases, but if the stat= ed > goal of mzML is to replace mzXML and mzData then I think this feature is > actually scope creep which threatens the mission and should be dropped. = Let > those who really want this feature come up with a wrapper schema, but don= 't > call it mzML lest you force the vast majority of mzML consuming software = to > be broken from the start. > > - Brian > > ________________________________ > From: psi...@li... > [mailto:psi...@li...] On > Behalf Of Eric Deutsch > Sent: Thursday, August 02, 2007 1:02 AM > To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet Souda; > Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; Andy > Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo; > Jam...@wa...; Mike Coleman; Matthew Chambers; Helen Jenkins= ; > Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas R=F6mpp; Kent Laursen; > Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick Gin= o > Angelo; Hans Vissers; Eric Deutsch; cl...@br...; > dav...@ag...; rb...@be...; > psi...@li... > Cc: Rolf Apweiler; Ruedi Aebersold > Subject: [Psidev-ms-dev] mzML 0.93 ready for first review > > > > > Hi everyone, after considerable hard work from many people, we have a > prerelease of mzML (the union of mzData and mzXML) available for comment = by > you, a major stakeholder in mzML. > > You may download a kit of material to examine at: > > http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip > > The general mzML development page is at: > > http://psidev.info/index.php?q=3Dnode/257 > > Please send feedback to: > > psi...@li... > > > > We ask that you respond by August 20. > > > > Additional releases with more information may be provided during the comi= ng > month. > > > > The current format has been guided by these principles: > > - Keep the format simple > > - Minimize alternate ways of encoding the same information > > - Allow some flexibility for encoding new important information > > - Support the features of mzData and mzXML but not a lot more > > - But do provide clear support for SRM data > > - Finish the format soon with the resources available > > > > There are many enhancements that have been suggested, but the small group= of > volunteers that have actively developed this format have opted to focus o= n > the primary goal set before us: develop a single format that the vendors = and > current software can easily support and thereby obsolete mzData and mzXML= . > The enhancements not considered compatible with this goal will be > entertained for mzML 2.0 > > > > We are committed to providing not just the format, but also a set of work= ing > implementations, converters and readers, as well as a format validator, a= ll > to ensure that mzML is a format that will be adopted quickly and implemen= ted > uniformly. Prior to submission to the PSI document process, the following > software will implement mzML: > > - 2 or more converters from vendor formats to mzML > > - the popular reader library RAMP that currently supports mzData and mzXM= L > > - an mzML semantic validator that checks for correct implementation > > > > We hope to follow this schedule: > > 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment > > 2007-08-20 Comments from major stakeholders received > > 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning = 30 > days internal review > > 2007-10-01 Revised mzML 1.01 begins 60 days community review > > 2007-10-06 Formal announcement that feedback is sought at HUPO world > congress > > 2007-12-01 Formal 60 days community review closes > > 2008-01-01 Revised mzML 1.02 officially released > > > > Thank you for your help! Feel free to forward this message to someone who= m > you think should review the format at this stage. > > > > Regards, > > Eric > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > |
From: Brian P. <bri...@in...> - 2007-08-02 19:48:03
|
(Note: I know I'm late to the party with this comment, but I think it's important) =20 I noticed this in the todo file: " - Now that we=92re allowing multiple runs in a file, how will the = index look to handle this?" =20 Better question: what will software that uses such an index look like?=20 =20 Answer: it won't look much like anything that currently reads mzXML and mzData - including X!Tandem or anything using RAMP (TPP and others) or = JRAP (CPAS and others). These programs easily deal with both mzData and = mzXML in their various versions by using APIs which, as it happens, assume one = file per run and one run per file. Breaking this one to one correspondence = in mzML means you can't just slide mzML support in behind the API, and of course also violates a fundamental assumption which flows through the = code that calls these APIs, right out to the user interface in most cases. = This means extensive surgery to any program that wants to read mzML properly, = and my guess is that means mzML is DOA. At a minimum it becomes a = completely predictable trouble spot since you can now write legal mzML files that = the majority of mzML readers will simply not know how to handle. They'll = be OK with RunList::count =3D=3D 1, but no more - so, why set ourselves up for trouble? =20 Multiple runs per file are probably useful in some cases, but if the = stated goal of mzML is to replace mzXML and mzData then I think this feature is actually scope creep which threatens the mission and should be dropped. = Let those who really want this feature come up with a wrapper schema, but = don't call it mzML lest you force the vast majority of mzML consuming software = to be broken from the start. =20 - Brian _____ =20 From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Thursday, August 02, 2007 1:02 AM To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet Souda; Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; Andy Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo; Jam...@wa...; Mike Coleman; Matthew Chambers; Helen = Jenkins; Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas R=F6mpp; Kent = Laursen; Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick = Gino Angelo; Hans Vissers; Eric Deutsch; cl...@br...; dav...@ag...; rb...@be...; psi...@li... Cc: Rolf Apweiler; Ruedi Aebersold Subject: [Psidev-ms-dev] mzML 0.93 ready for first review Hi everyone, after considerable hard work from many people, we have a prerelease of mzML (the union of mzData and mzXML) available for comment = by you, a major stakeholder in mzML. You may download a kit of material to examine at: http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip The general mzML development page is at: http://psidev.info/index.php?q=3Dnode/257 Please send feedback to: psi...@li... We ask that you respond by August 20. Additional releases with more information may be provided during the = coming month. The current format has been guided by these principles: - Keep the format simple - Minimize alternate ways of encoding the same information - Allow some flexibility for encoding new important information - Support the features of mzData and mzXML but not a lot more - But do provide clear support for SRM data - Finish the format soon with the resources available There are many enhancements that have been suggested, but the small = group of volunteers that have actively developed this format have opted to focus = on the primary goal set before us: develop a single format that the vendors = and current software can easily support and thereby obsolete mzData and = mzXML. The enhancements not considered compatible with this goal will be entertained for mzML 2.0 We are committed to providing not just the format, but also a set of = working implementations, converters and readers, as well as a format validator, = all to ensure that mzML is a format that will be adopted quickly and = implemented uniformly. Prior to submission to the PSI document process, the = following software will implement mzML: - 2 or more converters from vendor formats to mzML - the popular reader library RAMP that currently supports mzData and = mzXML - an mzML semantic validator that checks for correct implementation We hope to follow this schedule: 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment 2007-08-20 Comments from major stakeholders received 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning = 30 days internal review 2007-10-01 Revised mzML 1.01 begins 60 days community review 2007-10-06 Formal announcement that feedback is sought at HUPO world congress 2007-12-01 Formal 60 days community review closes 2008-01-01 Revised mzML 1.02 officially released Thank you for your help! Feel free to forward this message to someone = whom you think should review the format at this stage. Regards, Eric |
From: Eric D. <ede...@sy...> - 2007-08-02 08:02:22
|
Hi everyone, after considerable hard work from many people, we have a = prerelease of mzML (the union of mzData and mzXML) available for comment = by you, a major stakeholder in mzML. You may download a kit of material to examine at: http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip The general mzML development page is at: http://psidev.info/index.php?q=3Dnode/257 Please send feedback to: psi...@li... We ask that you respond by August 20. Additional releases with more information may be provided during the = coming month. The current format has been guided by these principles: - Keep the format simple - Minimize alternate ways of encoding the same information - Allow some flexibility for encoding new important information - Support the features of mzData and mzXML but not a lot more - But do provide clear support for SRM data - Finish the format soon with the resources available There are many enhancements that have been suggested, but the small = group of volunteers that have actively developed this format have opted = to focus on the primary goal set before us: develop a single format that = the vendors and current software can easily support and thereby obsolete = mzData and mzXML. The enhancements not considered compatible with this = goal will be entertained for mzML 2.0 We are committed to providing not just the format, but also a set of = working implementations, converters and readers, as well as a format = validator, all to ensure that mzML is a format that will be adopted = quickly and implemented uniformly. Prior to submission to the PSI = document process, the following software will implement mzML: - 2 or more converters from vendor formats to mzML - the popular reader library RAMP that currently supports mzData and = mzXML - an mzML semantic validator that checks for correct implementation We hope to follow this schedule: 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment 2007-08-20 Comments from major stakeholders received 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning = 30 days internal review 2007-10-01 Revised mzML 1.01 begins 60 days community review 2007-10-06 Formal announcement that feedback is sought at HUPO world = congress 2007-12-01 Formal 60 days community review closes 2008-01-01 Revised mzML 1.02 officially released Thank you for your help! Feel free to forward this message to someone = whom you think should review the format at this stage. Regards, Eric |
From: Alison M. <al...@d-...> - 2007-07-29 13:17:36
|
Dear All, PSI, HUPO and other work such as yours is of fundamental importance, hence my email to you and your list. We covered PSI in an earlier study, as an example of best practice and the huge challenges you face. We are conducting a policy study for the European Commission into scientific digital repositories, needs, opportunities, and a supporting e-infrastructure. This study will inform EC policy, including for FP7 funding. We would particularly value your input! There is a short consultation, quick to do, sadly available only for a short time at: http://ec.europa.eu/yourvoice/ipm/forms/dispatch?form=eSciDR If you have any questions or would like to send comments, please email me any time. With kind regards Alison ----------------------------------------- Alison Macdonald, Director The Digital Archiving Consultancy Limited 2 Wayside Court, Twickenham, TW1 2BQ, UK Tel: +44-208 607 9102 Registered in England, number 4900634 www.d-archiving.com www.e-scidr.eu |
From: Eric D. <ede...@sy...> - 2007-06-26 15:02:27
|
Hi everyone, sorry to cancel at the last minute, but there will not be a = conference call today. Good progress is being made at EBI. Results from = the workshop will be forthcoming. =20 Regards, Eric =20 =20 ________________________________ From: Eric Deutsch=20 Sent: Tuesday, June 12, 2007 9:39 AM To: 'psi...@li...' Cc: Eric Deutsch; 'Trish Whetzel'; 'Puneet Souda'; 'Juan Antonio = Vizca=EDno Gonz=E1lez' Subject: RE: PSI-MS WG ccall Tue 9am PDT, 12n EDT, 4pm GMT =20 Hi everyone, here are the notes from today's telecom. I would like to = have another call in a week to discuss the possible MRM implementation = and other items as well. I think it would also be useful to hold = another call the following week while several of us are at EBI. So the = schedule for upcoming calls is: =20 June 19: 12n EDT June 26: 12n EDT =20 Please mark your calendars. In the mean time, email discussion on the = MRM schema would be helpful! =20 Notes from today's con: =20 Present: Juan Antonio, Eric, Pete =20 - Web site Eric: The web site has been updated with the very latest files. Please = examine. Feedback is welcome. =20 - schema Eric: I added an example for encoding MRM data. The only change to the = schema was adding a spectrumType attribute which would distinguish MS1, MSn, = MRMScan, and possibly Chromatogram. Otherwise no change needed for this = implementation. There is still a list of minor mods to address, hopefully by next week. There is a meeting at EBI in two weeks. The hope is to finish the schema = then and disseminate to the community as a beta test version. =20 - MRM support Eric: Comments on the MRM implementation? Anyone? =20 - CV Eric: Latest OBO file is 1.7.4. Plans to work on this? Pete is ready to work on this. He will add terms on the list and also = write better definitions for some. Eric will add link to term tracker on web site. DONE. Pete: Individual vendors should submit their new terms to the term = tracker. Invite them after next update. =20 - Validator Lennart and Luisa will work on validator =20 =20 Action Items: - Pete has a list of things to update in the CV and will make the = updates. - Eric will send the new proposed terms for this latest rev - When Pete has finished doing any specific instrument editing, invite = all vendor to check the list and suggest changes/additions via term = tracker. =20 Thanks, Eric =20 =20 =20 ________________________________ From: Eric Deutsch=20 Sent: Tuesday, June 12, 2007 12:05 AM To: 'psi...@li...' Cc: 'Trish Whetzel'; 'Puneet Souda'; 'Juan Antonio Vizca=EDno = Gonz=E1lez'; Eric Deutsch Subject: RE: PSI-MS WG ccall Tue 9am PDT, 12n EDT, 4pm GMT =20 Hi everyone, it would be nice if everyone interested in MRM data would = have a peek at the proposed MRM implementation before the conference = call in 9 hours. =20 The high-level summary is that data are encoded something like this = (less relevant things removed and data arrays truncated to simplify the = view) [also attached as a txt file in case mailers wrap the xml too = badly]: =20 <spectrum id=3D"S19" scanNumber=3D"19" spectrumType=3D"MRMscan"> <spectrumHeader> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000038" = name=3D"Time" value=3D"5.890500"> <cvParam cvLabel=3D"PSI-MS" = accession=3D"PSI-MS:1000038" name=3D"Minutes" value=3D""/> </cvParam> </spectrumHeader> <binaryData precision=3D"64" compressionType=3D"none" = length=3D"3" encodedLength=3D"50"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999920" = name=3D"DataArrayContentType" value=3D"Q1MassToChargeRatioArray"/> <binary>AAAAwDsGeUAAAEEnEAAAADAOg6cQA=3D=3D</binary> </binaryData> <binaryData precision=3D"64" compressionType=3D"none" = length=3D"3" encodedLength=3D"50"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999921" = name=3D"DataArrayContentType" value=3D"Q3MassToChargeRatioArray"/> <binary>AAAAAAGJdeUAAAABA/nAAADAOg6cQA=3D=3D</binary> </binaryData> <binaryData precision=3D"64" compressionType=3D"none" = length=3D"3" encodedLength=3D"50"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999921" = name=3D"DataArrayContentType" value=3D"IntensityArray"/> <binary>AAAAAIBJxkAAAAAAAIrXQAAAAAMysQA=3D=3D</binary> </binaryData> </spectrum> =20 <spectrum id=3D"S20" scanNumber=3D"20" spectrumType=3D"MSn"> <spectrumHeader> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000038" = name=3D"TimeInMinutes" value=3D"5.990230"/> <precursorList count=3D"1"> <precursor spectrumRef=3D"19"> <ionSelection> <cvParam cvLabel=3D"PSI-MS" = accession=3D"PSI-MS:9999920" name=3D"TransitionArrayElementNumber" = value=3D"2"/> <cvParam cvLabel=3D"PSI-MS" = accession=3D"PSI-MS:1000041" name=3D"ChargeState" value=3D"2"/> </ionSelection> </precursor> </precursorList> </spectrumHeader> <binaryData precision=3D"64" compressionType=3D"none" = length=3D"43" encodedLength=3D"5000"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999920" = name=3D"DataArrayContentType" value=3D"MassToChargeRatioArray"/> <binary>AAAAAgKCYgEA=3D</binary> </binaryData> <binaryData precision=3D"64" compressionType=3D"none" = length=3D"43" encodedLength=3D"5000"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999921" = name=3D"DataArrayContentType" value=3D"IntensityArray"/> <binary>AAAAAAD+ AAyeQAAAAAAAWIVAAAAAAABgnUA=3D</binary> </binaryData> </spectrum> =20 (excerpted from the complete: http://db.systemsbiology.net/projects/PSI/mzML/tiny2_MRM.mzML0.91.xml = <http://db.systemsbiology.net/projects/PSI/mzML/tiny2_MRM.mzML0.91.xml>=20 ) =20 There are clearly some known and obvious little issues that need to be = fixed like with the format of the CV terms, but does this capture what = we need to write out? This just handles the transition measurements and = not chromatograms. Although I don't see why we couldn't allow = chromatograms with DataArrayContentTypes of "TimeArray" and = "IntensityArray" with some CV param annotations of the appropriate q1 = and q3 values as well. =20 Opinions and discussion on this would be good. =20 Thanks, Eric =20 =20 _____________________________________________ From: Eric Deutsch=20 Sent: Sunday, June 10, 2007 11:40 PM To: 'psi...@li...' Cc: 'Trish Whetzel'; 'Puneet Souda'; 'Juan Antonio Vizca=EDno = Gonz=E1lez'; Eric Deutsch Subject: PSI-MS WG ccall Tue 9am PDT, 12n EDT, 4pm GMT =20 Hi everyone, this is a reminder of the conference call for the PSI MS = working group Tuesday, 9am PDT, 12n EDT, 5pm London, 4pm GMT. Call = information is: =20 http://www.timeanddate.com/worldclock/fixedtime.html?day=3D12&month=3D6&y= ear=3D2007&hour=3D17&min=3D0&sec=3D0&p1=3D136 = <http://www.timeanddate.com/worldclock/fixedtime.html?day=3D12&month=3D6&= year=3D2007&hour=3D17&min=3D0&sec=3D0&p1=3D136>=20 =20 phone numbers are: + Germany: 08001012079 + Switzerland: 0800000860 + UK: 08081095644 + USA: 1-866-314-3683 + Generic international: +44 2083222500 (UK number) access code: 297427 =20 Agenda items are: =20 - Status and plans for: - Web site - schema - MRM support - CV - validator =20 The mzML development web page has been updated with the latest files. = You can view them at: =20 http://psidev.info/index.php?q=3Dnode/257 = <http://psidev.info/index.php?q=3Dnode/257>=20 =20 The ABI/Sciex folks have described in a document how they would encode = MRM data, which seems to be consistent with previous thoughts. I have = updated the file tiny2_MRM.mzML0.91.xml to reflect a possible simple way = in which MRM data might be encoded (a toy example, not containing real = data). =20 If you are unable to attend but wish to pass on a comment, please email = it to me or the list. =20 Regards, Eric =20 =20 =20 |
From: Randy J. <rkj...@in...> - 2007-06-23 22:09:49
|
The goal of suggesting a ZIP file was to keep multiple files together - your suggestion of using XOP would both keep 'files' together, and allows seeking directly. I think compression will be a non-issue with a binary representation of the raw data, and XOP is a standard which is likely to be supported with multiple language bindings. Does anyone have experience using XOP? Could we get an example file to begin thinking about what this would look like? Randy -----Original Message----- From: Christopher Mason [mailto:Mas...@ma...]=20 Sent: Thursday, June 21, 2007 3:00 PM To: psi...@li...; Randy Julian Subject: Re: Indexes, binary files, multiple files and compression in, mzML=20 Hello. From: "Randy Julian" <rkj...@in...> > Many of us have now had experience with very large data sets > represented in XML and I think it would be useful to > seriously consider an alternative representation to the > embedded base64 data vectors. I agree that it would be really nice to addess the binary/XML issue.=20 Note that combining XML and binary is not a new problem [1]. It's really painful to parse mzXML efficiently (in other words to=20 retreieve a single scan from a ~gigabyte sized document with ~10,000=20 scans). You can't use a standard XML parser, because most don't support arbitrarily seeking in the file to take advantage of the index (the=20 exception being Xerces, which has its own issues), and you have to write additional code to deal with the base-64/compressed data anyway. Anyone who regularly deals with Orbi/FT data knows how impossible it use to=20 just use plain XML; it's a nice idea, but it just isn't=20 feasible/efficient with today's instruments. We spent a bunch of time trying to write a simple parser for mzXML and=20 ended by giving up and using RAMP. This is fine, but makes it difficult for others to use languages/tools that can't use RAMP/JRAMP. There also maybe licensing issues. > My recommendation is that we consider creating a purely binary > representation of the data vectors which can be referenced by a purely > XML document describing the experiment. Zip files would work but you would have to write the individual scans as separate entries in the zip file as there's no way to seek in a file=20 inside a zipfile. I'm not sure what the efficiency of this would be,=20 but it can't be worse than the existing gzipped base64 scans in mzXML.=20 However, you couldn't use standard XML parsing utils to read the=20 metadata without first extracting the XML file from the zip file. Why not use an existing standard like XOP [2] or something similar which combines, in a single file, an initial, text-only XML metadata chunk,=20 followed by a binary blob for the actual data. The metadata contains=20 references or addresses that index into the binary chunk, relative to=20 the start of the binary chunk. Normal XML tools can parse the XML bits=20 and ignore the binary bits. This would remove the need for a separate=20 index because the metadata would be small enough to parse completely=20 into memory, and would contain references to the much larger binary data. You could easily create such a file by writing two files and then=20 combining them, or by padding. One key point would be to use UTF-8 or=20 similar encoding with a BOM [3] so that legacy FTP clients would=20 correctly transfer them. As another idea: you could formulate the references in such a way that=20 it was possible to support both a two file/side-by-side scheme and a=20 single file, one after the other scheme. I'd be happy to help with this; maybe doing some proof of concept work... Thanks for bring this up, -c PS- I've done some benchmarking work to study storing compressed spectra in sqlite databases that might be of interest to others. [1] http://www.xml.com/pub/a/2003/02/26/binaryxml.html [2] http://www.w3.org/TR/xop10/ [3] http://unicode.org/faq/utf_bom.html#22 |
From: Randy J. <rkj...@in...> - 2007-06-23 16:47:46
|
Mike and Angel, =20 I agree with Angel - the goal is balance between interchange use and operational use. Both have performance concerns and both have development concerns. As mzData moved from a specification to a standard in the minds of the vendors supporting the format, the investment has risen and there are some other basic needs that have to be met with mzML if the commercial support is to continue. The specification has to be stable and easy for vendors to support. The use of cvParam/userParam to allow vendors to do what they need to in describing the instrument is one area where we have made things better than previous specifications. =20 An area where we need to improve is in the reduction of redundancy or complexity via too many choices. The basic agreement in the community is to find one way to represent information rather than leaving large numbers of choices which increase development and support cost and reduce interoperability. On the X!Tandem site there is already a comment expressing frustration with the 'multiple ways' mzData could represent parent ion charge state. =20 Early in the development we eliminated all alternatives to base64 binary, partly to improve size, partly to eliminate variations. The raw data representation is key to this specification and I agree that there should be one way to represent raw data in the specification. =20 Randy =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Angel Pizarro Sent: Thursday, June 21, 2007 4:55 PM To: Mike Coleman Cc: psi...@li... Subject: Re: [Psidev-ms-dev] Indexes, binary files,multiple files and compression in mzML =20 Hi Mike, =20 To answer your question in the shortest answer I can give is that we have been at this for a very long time and the current proposal is looking to strike a balance between opposing goals.=20 In short, The files would be waaaayyy too big if we took the plain text approach. This has been proposed in the past and decided against. SPC/ISB and other open source/academic shops needs the format to be somewhat usable as their native format, since developers in the field are also using the format to rapid development of tools. More complex format =3D longer development cycle.=20 I think a suitable reference implementation can provide the best of both worlds. Currently for most folks this role is being fulfilled by mzXML/RAMP.=20 -angel On 6/21/07, Mike Coleman <tu...@gm...> wrote: Interesting! So, one could say that one point of discord regarding mzML is whether its purpose is primarily to provide a simple/clear/parseable representation of the data, suitable for communicating information and=20 as a foundation for other tasks, or whether its purpose is to provide concrete, disk-based data structure that programs can use to (efficiently) access the data. If one thinks that the purpose is the former, one may wonder why=20 indices would be included, or for that matter, why anything is being rendered in a binary form at all; or, if the latter, why a more programmatically accessible format is not being chosen. Just as a thought experiment, what would happen if we had two specs?=20 The former spec would be absolutely pure, untortured XML, stripped of all redundant information. (Think "<peak mz=3D"1234.5" intensity=3D"123456">".) Parsing would be completely trivial with=20 existing XML libraries or tools. Compression, if needed, could be done orthogonally and trivially with existing standard tools. The latter spec would be a rendering of the data onto an existing on-disk format--say sqlite, for example. This would mostly amount to=20 writing the schema. It would be optimized for performance, so that, for example, accessing a random spectrum would be quick. Ideally there would be no BLOBs *within* the database--how much bigger would the database grow if peak data were stored as a list or table of=20 literal values? The idea would be that each spec would be free to cover its purpose, rather than having a single spec trying to cover both at once (perhaps poorly). Mike On 6/21/07, Angel Pizarro < an...@ma...> wrote: > I would not be opposed to such a move, but I have to say that really this is > an operational issue in my opinion. [...] --=20 Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd.=20 Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004=20 |
From: Angel P. <an...@ma...> - 2007-06-21 20:54:45
|
Hi Mike, To answer your question in the shortest answer I can give is that we have been at this for a very long time and the current proposal is looking to strike a balance between opposing goals. In short, The files would be waaaayyy too big if we took the plain text approach. This has been proposed in the past and decided against. SPC/ISB and other open source/academic shops needs the format to be somewhat usable as their native format, since developers in the field are also using the format to rapid development of tools. More complex format = longer development cycle. I think a suitable reference implementation can provide the best of both worlds. Currently for most folks this role is being fulfilled by mzXML/RAMP. -angel On 6/21/07, Mike Coleman <tu...@gm...> wrote: > > Interesting! > > So, one could say that one point of discord regarding mzML is whether > its purpose is primarily to provide a simple/clear/parseable > representation of the data, suitable for communicating information and > as a foundation for other tasks, or whether its purpose is to provide > concrete, disk-based data structure that programs can use to > (efficiently) access the data. > > If one thinks that the purpose is the former, one may wonder why > indices would be included, or for that matter, why anything is being > rendered in a binary form at all; or, if the latter, why a more > programmatically accessible format is not being chosen. > > Just as a thought experiment, what would happen if we had two specs? > The former spec would be absolutely pure, untortured XML, stripped of > all redundant information. (Think "<peak mz="1234.5" > intensity="123456">".) Parsing would be completely trivial with > existing XML libraries or tools. Compression, if needed, could be > done orthogonally and trivially with existing standard tools. > > The latter spec would be a rendering of the data onto an existing > on-disk format--say sqlite, for example. This would mostly amount to > writing the schema. It would be optimized for performance, so that, > for example, accessing a random spectrum would be quick. Ideally > there would be no BLOBs *within* the database--how much bigger would > the database grow if peak data were stored as a list or table of > literal values? > > The idea would be that each spec would be free to cover its purpose, > rather than having a single spec trying to cover both at once (perhaps > poorly). > > Mike > > > On 6/21/07, Angel Pizarro <an...@ma...> wrote: > > I would not be opposed to such a move, but I have to say that really > this is > > an operational issue in my opinion. > [...] > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Mike C. <tu...@gm...> - 2007-06-21 20:35:53
|
Interesting! So, one could say that one point of discord regarding mzML is whether its purpose is primarily to provide a simple/clear/parseable representation of the data, suitable for communicating information and as a foundation for other tasks, or whether its purpose is to provide concrete, disk-based data structure that programs can use to (efficiently) access the data. If one thinks that the purpose is the former, one may wonder why indices would be included, or for that matter, why anything is being rendered in a binary form at all; or, if the latter, why a more programmatically accessible format is not being chosen. Just as a thought experiment, what would happen if we had two specs? The former spec would be absolutely pure, untortured XML, stripped of all redundant information. (Think "<peak mz="1234.5" intensity="123456">".) Parsing would be completely trivial with existing XML libraries or tools. Compression, if needed, could be done orthogonally and trivially with existing standard tools. The latter spec would be a rendering of the data onto an existing on-disk format--say sqlite, for example. This would mostly amount to writing the schema. It would be optimized for performance, so that, for example, accessing a random spectrum would be quick. Ideally there would be no BLOBs *within* the database--how much bigger would the database grow if peak data were stored as a list or table of literal values? The idea would be that each spec would be free to cover its purpose, rather than having a single spec trying to cover both at once (perhaps poorly). Mike On 6/21/07, Angel Pizarro <an...@ma...> wrote: > I would not be opposed to such a move, but I have to say that really this is > an operational issue in my opinion. [...] |
From: Angel P. <an...@ma...> - 2007-06-21 19:02:46
|
I would not be opposed to such a move, but I have to say that really this is an operational issue in my opinion. Huge amounts of microarray data gets shipped around without any sort of binary encoding or compression. The result of this, tho, has been that no one uses MAGE-ML for anything other than long-term storage and almost all algorithms use the native formats (usually text based), Excel, or some other text formats of just the data points. If we want mzML to fulfill an operational role, then this is something to consider greatly, along with a reference implementation. Also I did of bit of research a while ago into binary formats and just spending a few minutes looking today, what I found still holds true: cross platform binary formats are basically limited to two flavors and a couple of specific options: 1) embedded/small databases -- Berkeley DB, HSQL, sqlite 2) binary data formats -- basically HDF5 or netCDF HDF5, netCDF, HSQL and sqlite all do not have any restrictions other than including the license when re-distributing, but bdb does have a copy-left GPL I think. As a side note, as an experiment I once took a mzData file and parsed it into a sqlite3 file for random access (spectra where put in as the encoded BLOB) and the file size only increased by less than a meg. A smart implementation would probably reduce the size and increase performance by a lot. -angel On 6/21/07, Randy Julian <rkj...@in...> wrote: > > > > Another topic on this list and discussed informally at ASMS was the > > problem of efficiency related to storing binary data directly in mzData > > and mzXML and therefore presumed to trouble mzML. > > > > >From the beginning of the mzData effort, the PSI-MS group attempted to > > remain consistent in it's representation of data vectors with the > > efforts of the ASTM and IUPAC. The result was two separate data vectors > > stored as base64 encoded IEEE-754 floating point numbers. The use of > > base64 was derived from the desired to keep to a strict use of XML > > schema and therefore even the very useful binary index used in mzXML was > > avoided. Further, the group valued a single file to represent > > information, so the scheme developed by the X-Ray community which > > involved two files (one XML, one binary) was not considered. > > > > Many of us have now had experience with very large data sets represented > > in XML and I think it would be useful to seriously consider an > > alternative representation to the embedded base64 data vectors. > > > > Anyone who has attempted to open a medium-to-large size mzData file in > > XMLSpy can tell you that even on a 2GB machine, the program will not > > open some of the files we can generate, especially from the TOF and FT > > instruments. To address the speed of using the file operationally, the > > current proposal includes binary indexes and in-line compression. These > > ideas are difficult to manage within XML and people pick sides on what > > is more important - our main use case remains interchange of data, so > > those positions have historically won out. > > > > We like XML because it is cross-platform and can be validated and > > somehow we think binary is not as portable because it cannot be viewed > > in notepad or vi. In fact we are entirely dependent on binary already > > through the use of the base64 encoded IEEE floating point > > representations. Because we want to embed these binary BLOBs in the > > XML, we accept the additional cost of encoding, storing and decoding the > > > > BLOB as printable characters because that is what XML Schema suggests. > > > > The size the data vector, once encoded, can be reduced through > > compression, and because we accept the cross-platform compatibility of > > gzip, bzip and other compression methods, we can compress the binary and > > not lose interoperability. > > > > I think we should strongly consider using these components, binary, > > compression and XML in a different way to meet the practical > > expectations of the community while maintaining strict adherence to W3C > > specifications. > > > > My recommendation is that we consider creating a purely binary > > representation of the data vectors which can be referenced by a purely > > XML document describing the experiment. The binary indexes, checksums > > and other binary ideas can be incorporated without a need to stretch the > > XML tools beyond where they are effective. If we are clever about the > > references, we can also address a question which comes up in the > > repository (ProDac and others) discussion regarding what data items > > might be interchanged between repositories - if the binary component can > > be referenced in a distributed environment (via something like an LSID, > > or some other URI), then the XML component could be interchanged and the > > binary left for later consumption. > > > > The problem of multiple files could be addressed through the use of the > > ZIP format - it is the standard for package distribution in the Java > > platform and the 'native' file format for many programs (like MagicDraw, > > and others). The use of a zip archive to organize the XML and binary > > components would allow users to still deal with 'only one file', but it > > would also allow situations where multiple experiments need to be kept > > together (I am a fan of putting multiple injections in the SciEx WIFF > > format, for example). The example from the Java world suggests that > > these archives can also ensure data integrity and could supply a > > practical encryption mechanism if desired. > > > > The format of the mzML schema would remain basically unchanged with the > > main difference being a reference in place of the BLOB, and the format > > of the binary file could be designed quite quickly. > > > > This represents a departure from the goal of remaining consistent with > > the basic designs of other standard efforts, but the improvement in > > practicality could have a very positive impact on acceptance and use > > within the -omics community. > > > > Thoughts? > > > > > > > > Randall K Julian, Jr. Ph.D. > > CEO Indigo BioSystems > > (317) 536-2736 x101 > > (317) 306-5447 mobile > > > > www.indigobio.com > > > > NOTICE: This message may contain confidential or privileged information > > that is for the sole use of the intended recipient. Any unauthorized > > review, use, disclosure, copying or distribution is strictly prohibited. > > > > If you are not the intended recipient, please contact the sender by > > reply e-mail and destroy all copies of the original message > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by DB2 Express > > Download DB2 Express C - the FREE version of DB2 express and take > > control of your XML. No limits. Just data. Click to get it now. > > http://sourceforge.net/powerbar/db2/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > > -- > Angel Pizarro > Director, Bioinformatics Facility > Institute for Translational Medicine and Therapeutics > University of Pennsylvania > 806 BRB II/III > 421 Curie Blvd. > Philadelphia, PA 19104-6160 > > P: 215-573-3736 > F: 215-573-9004 -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Christopher M. <Mas...@ma...> - 2007-06-21 19:00:27
|
Hello. From: "Randy Julian" <rkj...@in...> > Many of us have now had experience with very large data sets > represented in XML and I think it would be useful to > seriously consider an alternative representation to the > embedded base64 data vectors. I agree that it would be really nice to addess the binary/XML issue. Note that combining XML and binary is not a new problem [1]. It's really painful to parse mzXML efficiently (in other words to retreieve a single scan from a ~gigabyte sized document with ~10,000 scans). You can't use a standard XML parser, because most don't support arbitrarily seeking in the file to take advantage of the index (the exception being Xerces, which has its own issues), and you have to write additional code to deal with the base-64/compressed data anyway. Anyone who regularly deals with Orbi/FT data knows how impossible it use to just use plain XML; it's a nice idea, but it just isn't feasible/efficient with today's instruments. We spent a bunch of time trying to write a simple parser for mzXML and ended by giving up and using RAMP. This is fine, but makes it difficult for others to use languages/tools that can't use RAMP/JRAMP. There also maybe licensing issues. > My recommendation is that we consider creating a purely binary > representation of the data vectors which can be referenced by a purely > XML document describing the experiment. Zip files would work but you would have to write the individual scans as separate entries in the zip file as there's no way to seek in a file inside a zipfile. I'm not sure what the efficiency of this would be, but it can't be worse than the existing gzipped base64 scans in mzXML. However, you couldn't use standard XML parsing utils to read the metadata without first extracting the XML file from the zip file. Why not use an existing standard like XOP [2] or something similar which combines, in a single file, an initial, text-only XML metadata chunk, followed by a binary blob for the actual data. The metadata contains references or addresses that index into the binary chunk, relative to the start of the binary chunk. Normal XML tools can parse the XML bits and ignore the binary bits. This would remove the need for a separate index because the metadata would be small enough to parse completely into memory, and would contain references to the much larger binary data. You could easily create such a file by writing two files and then combining them, or by padding. One key point would be to use UTF-8 or similar encoding with a BOM [3] so that legacy FTP clients would correctly transfer them. As another idea: you could formulate the references in such a way that it was possible to support both a two file/side-by-side scheme and a single file, one after the other scheme. I'd be happy to help with this; maybe doing some proof of concept work... Thanks for bring this up, -c PS- I've done some benchmarking work to study storing compressed spectra in sqlite databases that might be of interest to others. [1] http://www.xml.com/pub/a/2003/02/26/binaryxml.html [2] http://www.w3.org/TR/xop10/ [3] http://unicode.org/faq/utf_bom.html#22 |
From: Angel P. <an...@ma...> - 2007-06-21 18:42:36
|
don't know enough about the experiment type to give randy a straight answer, so I'll leave that to the-folks-in-the-know. One far out idea for "saving space" would be to define the transitions as a set of binary arrays, then reference that as the parent spectra for any scans that use those transitions.... But boy is that an absolute bastardization of the schema. -angel On 6/21/07, Eric Deutsch <ede...@sy...> wrote: > > Hi Randy, one can have an array for each. In fact, based on the phone > con earlier in the week, I was intending to add a DwellTimeArray. > Essentially, one can have as many arrays as needed for parameters for the > list of transitions, as long as the definition of the array is contained in > the CV. > > > > Another space saving feature that was discussed was not having to repeat > those arrays over and over when they are the same. Unfortunately, the > problem with space-saving tricks is that they can make writing and reading > more complicated. > > > > I have yet to study the detailed emails you sent out in the last two days. > I will get to them and they will be considered carefully next week if not > before. > > > > Thanks, > > Eric > > > > > ------------------------------ > > *From:* Randy Julian [mailto:rkj...@in...] > *Sent:* Thursday, June 21, 2007 11:08 AM > *To:* Angel Pizarro > *Cc:* psi...@li...; Pierre-Alain Binz; Eric > Deutsch; Sean L Seymour; Duchoslav, Eva; Shofstahl, Jim > *Subject:* RE: [Psidev-ms-dev] Representing MRMs in mzML > > > > Angel, > > > > How would three binary arrays get to all the details (dwell time, > collision energies, polarity, etc.)? > > > > Randy > > > > Randall K Julian, Jr. Ph.D. > CEO Indigo BioSystems > (317) 536-2736 x101 > (317) 306-5447 mobile > > www.indigobio.com > > NOTICE: This message may contain confidential or privileged information > that is for the sole use of the intended recipient. Any unauthorized > review, use, disclosure, copying or distribution is strictly prohibited. If > you are not the intended recipient, please contact the sender by reply > e-mail and destroy all copies of the original message > > > > > ------------------------------ > > *From:* del...@gm... [mailto:del...@gm...] *On Behalf Of *Angel > Pizarro > *Sent:* Thursday, June 21, 2007 1:02 PM > *To:* Randy Julian > *Cc:* psi...@li...; Pierre-Alain Binz; Eric > Deutsch; Sean L Seymour; Duchoslav, Eva; Shofstahl, Jim > *Subject:* Re: [Psidev-ms-dev] Representing MRMs in mzML > > Interesting proposition. What's the space trade-off for this approach vs. > repeating the transitions as three binary arrays (Precurs. Ion, Product > Ion, Intensity) and assuming a positional relation between arrays for each > scan (as we currently do with mzData data arrays)? > > I would think that the 3 array approach would use less space (than what is > currently in the example) and be easier to produce and interpret. The only > way I see to reduce the space would be to further group the transitions into > sets and only reference that one annotation from the spectrum, but this adds > yet another level of cv traversal for parsing and complexity in instance > production, which folks seem to frown upon. > > example of two spectra with 100 randomly generated 32-bit double floats > representing the transitions attached. Just the spectra though, not the full > annotations. > > -angel > |
From: Eric D. <ede...@sy...> - 2007-06-21 18:25:44
|
Hi Randy, one can have an array for each. In fact, based on the phone con earlier in the week, I was intending to add a DwellTimeArray. Essentially, one can have as many arrays as needed for parameters for the list of transitions, as long as the definition of the array is contained in the CV. =20 Another space saving feature that was discussed was not having to repeat those arrays over and over when they are the same. Unfortunately, the problem with space-saving tricks is that they can make writing and reading more complicated. =20 I have yet to study the detailed emails you sent out in the last two days. I will get to them and they will be considered carefully next week if not before. =20 Thanks, Eric =20 =20 ________________________________ From: Randy Julian [mailto:rkj...@in...]=20 Sent: Thursday, June 21, 2007 11:08 AM To: Angel Pizarro Cc: psi...@li...; Pierre-Alain Binz; Eric Deutsch; Sean L Seymour; Duchoslav, Eva; Shofstahl, Jim Subject: RE: [Psidev-ms-dev] Representing MRMs in mzML =20 Angel, =20 How would three binary arrays get to all the details (dwell time, collision energies, polarity, etc.)? =20 Randy =20 Randall K Julian, Jr. Ph.D. CEO Indigo BioSystems (317) 536-2736 x101 (317) 306-5447 mobile www.indigobio.com <http://www.indigobio.com/>=20 NOTICE: This message may contain confidential or privileged information that is for the sole use of the intended recipient. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message =20 =20 ________________________________ From: del...@gm... [mailto:del...@gm...] On Behalf Of Angel Pizarro Sent: Thursday, June 21, 2007 1:02 PM To: Randy Julian Cc: psi...@li...; Pierre-Alain Binz; Eric Deutsch; Sean L Seymour; Duchoslav, Eva; Shofstahl, Jim Subject: Re: [Psidev-ms-dev] Representing MRMs in mzML Interesting proposition. What's the space trade-off for this approach vs. repeating the transitions as three binary arrays (Precurs. Ion, Product Ion, Intensity) and assuming a positional relation between arrays for each scan (as we currently do with mzData data arrays)?=20 I would think that the 3 array approach would use less space (than what is currently in the example) and be easier to produce and interpret. The only way I see to reduce the space would be to further group the transitions into sets and only reference that one annotation from the spectrum, but this adds yet another level of cv traversal for parsing and complexity in instance production, which folks seem to frown upon.=20 example of two spectra with 100 randomly generated 32-bit double floats representing the transitions attached. Just the spectra though, not the full annotations. -angel |
From: Randy J. <rkj...@in...> - 2007-06-21 18:00:50
|
Angel, =20 How would three binary arrays get to all the details (dwell time, collision energies, polarity, etc.)? =20 Randy =20 Randall K Julian, Jr. Ph.D. CEO Indigo BioSystems (317) 536-2736 x101 (317) 306-5447 mobile www.indigobio.com <http://www.indigobio.com/>=20 NOTICE: This message may contain confidential or privileged information that is for the sole use of the intended recipient. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message =20 ________________________________ From: del...@gm... [mailto:del...@gm...] On Behalf Of Angel Pizarro Sent: Thursday, June 21, 2007 1:02 PM To: Randy Julian Cc: psi...@li...; Pierre-Alain Binz; Eric Deutsch; Sean L Seymour; Duchoslav, Eva; Shofstahl, Jim Subject: Re: [Psidev-ms-dev] Representing MRMs in mzML Interesting proposition. What's the space trade-off for this approach vs. repeating the transitions as three binary arrays (Precurs. Ion, Product Ion, Intensity) and assuming a positional relation between arrays for each scan (as we currently do with mzData data arrays)?=20 I would think that the 3 array approach would use less space (than what is currently in the example) and be easier to produce and interpret. The only way I see to reduce the space would be to further group the transitions into sets and only reference that one annotation from the spectrum, but this adds yet another level of cv traversal for parsing and complexity in instance production, which folks seem to frown upon.=20 example of two spectra with 100 randomly generated 32-bit double floats representing the transitions attached. Just the spectra though, not the full annotations. -angel |
From: Randy J. <rkj...@in...> - 2007-06-21 16:37:02
|
Another topic on this list and discussed informally at ASMS was the problem of efficiency related to storing binary data directly in mzData and mzXML and therefore presumed to trouble mzML. >From the beginning of the mzData effort, the PSI-MS group attempted to remain consistent in it's representation of data vectors with the efforts of the ASTM and IUPAC. The result was two separate data vectors stored as base64 encoded IEEE-754 floating point numbers. The use of base64 was derived from the desired to keep to a strict use of XML schema and therefore even the very useful binary index used in mzXML was avoided. Further, the group valued a single file to represent information, so the scheme developed by the X-Ray community which involved two files (one XML, one binary) was not considered. Many of us have now had experience with very large data sets represented in XML and I think it would be useful to seriously consider an alternative representation to the embedded base64 data vectors. Anyone who has attempted to open a medium-to-large size mzData file in XMLSpy can tell you that even on a 2GB machine, the program will not open some of the files we can generate, especially from the TOF and FT instruments. To address the speed of using the file operationally, the current proposal includes binary indexes and in-line compression. These ideas are difficult to manage within XML and people pick sides on what is more important - our main use case remains interchange of data, so those positions have historically won out. We like XML because it is cross-platform and can be validated and somehow we think binary is not as portable because it cannot be viewed in notepad or vi. In fact we are entirely dependent on binary already through the use of the base64 encoded IEEE floating point representations. Because we want to embed these binary BLOBs in the XML, we accept the additional cost of encoding, storing and decoding the BLOB as printable characters because that is what XML Schema suggests. The size the data vector, once encoded, can be reduced through compression, and because we accept the cross-platform compatibility of gzip, bzip and other compression methods, we can compress the binary and not lose interoperability. I think we should strongly consider using these components, binary, compression and XML in a different way to meet the practical expectations of the community while maintaining strict adherence to W3C specifications. My recommendation is that we consider creating a purely binary representation of the data vectors which can be referenced by a purely XML document describing the experiment. The binary indexes, checksums and other binary ideas can be incorporated without a need to stretch the XML tools beyond where they are effective. If we are clever about the references, we can also address a question which comes up in the repository (ProDac and others) discussion regarding what data items might be interchanged between repositories - if the binary component can be referenced in a distributed environment (via something like an LSID, or some other URI), then the XML component could be interchanged and the binary left for later consumption. The problem of multiple files could be addressed through the use of the ZIP format - it is the standard for package distribution in the Java platform and the 'native' file format for many programs (like MagicDraw, and others). The use of a zip archive to organize the XML and binary components would allow users to still deal with 'only one file', but it would also allow situations where multiple experiments need to be kept together (I am a fan of putting multiple injections in the SciEx WIFF format, for example). The example from the Java world suggests that these archives can also ensure data integrity and could supply a practical encryption mechanism if desired. The format of the mzML schema would remain basically unchanged with the main difference being a reference in place of the BLOB, and the format of the binary file could be designed quite quickly. This represents a departure from the goal of remaining consistent with the basic designs of other standard efforts, but the improvement in practicality could have a very positive impact on acceptance and use within the -omics community. Thoughts? =20 Randall K Julian, Jr. Ph.D. CEO Indigo BioSystems (317) 536-2736 x101 (317) 306-5447 mobile www.indigobio.com NOTICE: This message may contain confidential or privileged information that is for the sole use of the intended recipient. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message |
From: Randy J. <rkj...@in...> - 2007-06-21 15:36:49
|
At ASMS several people talked to me about MRM representation in mzML. Looking at the schema, it appears that there is a way to do this using the current elements - but maybe not the way we originally thought of using them. For some time now I have been encoding MRM experiments in mzData 1.05 by using the supplemental data vector combined with the intensity element to store transitions as chromatograms. This is not documented in the specification, but if you leave the MZ vector empty, fill the intensity array with the Y-axis and then put the time axis in a supplemental data vector, it is pretty easy to parse. In the proposed mzData 1.1, I replaced this hack with an actual chromatogram element, but this too is problematic and did not make the cut for mzML. The problem stems from the fact that each MS/MS transition in an MRM-type measurement is a unique experiment and needs to be described as fully as possible. Even though we usually view (and even store in the native file) a set of transitions as a 'spectrum' they are really histograms with complex annotation on the 'x-axis'. I would like to suggest that we use the parameterGroup to store the details of each transition and then reference these within the binary vector as allowed by the 0.91 schema. It means that there is no x-axis for the 'spectrum', so we will probably want to define a way of recognizing this representation. For a quantitation experiment of 5 analytes the paramGroupList might look like this: <paramGroupList> <paramGroup id=3D"MRMSettings"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000036" name=3D"ScanMode" value=3D"Selected Reaction Monitoring"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000044" name=3D"Activation Method" value=3D"CID"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Collision Energy" value=3D"25"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Collision Engery Units" value=3D"eV"/> </paramGroup> <paramGroup id=3D"Transition_1"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"positive"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"289.5"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"97.2"/> </paramGroup> <paramGroup id=3D"Transition_2"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"positive"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"287.2"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"97.0"/> </paramGroup> <paramGroup id=3D"Transition_3"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"positive"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"195.5"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"138.1"/> </paramGroup> <paramGroup id=3D"Transition_4"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"negative"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"205.0"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"161.0"/> </paramGroup> <paramGroup id=3D"Transition_5"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000037" name=3D"Polarity" value=3D"negative"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000340" name=3D"Precursor Ion" value=3D"269.0"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000342" name=3D"Product Ion" value=3D"145.2"/> </paramGroup> </paramGroupList> Different instrument types could store different representations of the MRM settings (including MS^n descriptions using cvParams). For high MRM count experiments (100's or thousands of analytes) you could group parameters to further reduce replication. An MRM acquisition could then look like this: <spectrum id=3D"1" scanNumber=3D"1"> <spectrumHeader> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000038" name=3D"Time" value=3D"0.0"/> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:1000038" name=3D"Minutes" value=3D""/> <acquisitionList spectrumType=3D"histogram" methodOfCombination=3D"none" count=3D"1"> <acquisition acqNumber=3D"1" spectrumRef=3D"" sourceFileRef=3D""/> </acquisitionList> <instrumentSettings instrumentRef=3D"TSQ Quantum Ultra"> <paramGroupRef ref=3D"MRMSettings"/> </instrumentSettings> </spectrumHeader> <binaryData precision=3D"32" compressionType=3D"none" length=3D"3" encodedLength=3D"16"> <cvParam cvLabel=3D"PSI-MS" accession=3D"PSI-MS:9999920" name=3D"DataArrayContentType" value=3D"histogram"/> <paramGroupRef ref=3D"Transition_1"/> <paramGroupRef ref=3D"Transition_2"/> <paramGroupRef ref=3D"Transition_3"/> <binary>mMmXQybwI0QuW01E</binary> </binaryData> </spectrum> The DataArrayContentType suggested in the 'tiny' examples the group developed could be used to indicate the meaning of the paramGroupRef's and the only change to the schema would be to order the sequence in binaryData so that the cvParam could be before the paramGroupRef (or that the order is not checked - sequences are ordered...) I've attached the small change in the element ordering needed to validate the example. I've attached the full example file (edited from the examples on the PSI site and validated with the attached schema) which shows time-dependent changes in which transitions are being monitored. To obtain the MRM chromatogram on which to perform peak picking, etc., you would plot the intensity against time for the specific transition (which is what we do in the fixed MRM drug quantitation experiments). Any thoughts on the use of the paramGroupRef in this fashion, or the idea of creating a new data type which is essentially an annotated histogram? Thanks, Randy Randall K Julian, Jr. Ph.D. CEO Indigo BioSystems (317) 536-2736 x101 (317) 306-5447 mobile www.indigobio.com NOTICE: This message may contain confidential or privileged information that is for the sole use of the intended recipient. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message |
From: Matthew C. <mat...@va...> - 2007-06-20 21:08:27
|
mzXML has this, but it seems to be relative to itself (at least according to ReAdW, the start time of a run is 0 and the end time is the time between start and stop). Is this a problem because not all runs are continuous? If there does need to be support for non-continuous runs, perhaps the type of run could be specified and a continuous run would have a start time and a duration (or end time), whereas a non-continuous run would have either a start and duration for each spectrum, or just an acquisition time (assumed to be instantaneous or perhaps read from the instrument settings). Without this feature, it's impossible to say further down the analysis pipeline "When was this data acquired?" without looking outside the mzML/mzXML/mzData. Also, I notice that the <dataProcessingList> element does not seem to guarantee an order of processing (i.e. a history), but the <dataProcessing> element does have explicitly ordered <processMethod> elements. -Matt Chambers |
From: Mike C. <tu...@gm...> - 2007-06-20 19:40:21
|
On 6/20/07, Brian Pratt <bri...@in...> wrote: > Funny how this topic flares up every few months. "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'." - Isaac Asimov :-) |
From: Matthew C. <mat...@va...> - 2007-06-20 19:34:24
|
Is there a reason that a "run" could consist of spectra from more than sourceFile? Is there a precise definition for "run" in the context of mzML? The full spec doc doesn't give a very precise definition. I've always thought of a run as coming from one source, so the run would define which (single) source it was from and then all child spectrum elements would be inherently from that source. -Matt Chambers |
From: Brian P. <bri...@in...> - 2007-06-20 15:28:04
|
Note also that the RAMP library (which will presumably be extended to deal with the new format) automagically deals with missing or broken indexes. You can use its index API and it will just generate an index if needed. Funny how this topic flares up every few months. Patrick's original move back in mzXML to make it an optional wrapper schema was a brilliant compromise - just don't use the index if it offends you. Brian Pratt Insilicos -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Tuesday, June 19, 2007 11:07 PM To: Matthew Chambers; Mike Coleman; psi...@li... Cc: Eric Deutsch Subject: Re: [Psidev-ms-dev] Indexing in mzML > From: psi...@li... [mailto:psidev-ms-dev- > > > Hi everyone, thank you for the good discussion, here is what I take away > > from this discussion (colored with my understanding of the prevailing > > opinion on the various topics): > > > > - Separate index/metadata files will be avoided > > - The mzML index will be *optional* as a wrapper schema (with actual > > index at the end of the file) as currently in mzXML > > This works either way for me. It makes it tricky to get to the index, but > it's not any trickier than managing two files, especially if one is > optional. However, unless I am missing something, is it not much harder > to > use a hash to check if the index is valid (i.e. that the main file has not > been altered) if the index is included in the main file? Even if the hash > is written as the last thing, that change alone would cause the next hash > to > be different, would it not? I believe that the file/index integrity checker for mzXML computes the checksum only until the start of the index. So the index is not part of the checksum and the checksum is part of the index, so there's no conflict. It does mean that a checksum computed for the entire file will not match the checksum of the non-index part. So this is slightly harder in that one can't use an OS binary to compute the checksum, but rather one needs a custom program that is aware of this arrangement. This already exists for mzXML and will be easy to port for mzML. > > - The validator will enforce that scan numbers are in ascending order, > > but not necessarily without gaps > > - The validator will enforce that scan numbers and identifiers must be > > unique within a run (but there could be multiple runs in a file) > > I'm confused about the difference between identifiers and scan numbers. > Since a mzML file can have more than one spectra source (e.g. multiple RAW > files), scan numbers could only be unique within a run, as you say, but I > would expect that the "SpectrumID" identifier, if it is different from the > scan number, should be unique to the whole file. What is the reasoning You are correct, my error. > behind the SpectrumID identifier being unique only to a run, or am I > misunderstanding? What is the purpose of having a separate SpectrumID > identifier anyway? To allow LSIDs for individual spectra or some other non-integer IDs if desired. > > - Regarding *always* correct indexes, users of mzXML have been using > > indexes for years with no reports of problems that I'm aware. Obviously > > if the file is altered in any way, the index should be regenerated. > > There are (for mzXML) / will be (for mzML) index checkers to make sure > > all is well along with reindexing functionality if the index is bad. > > - It should be a requirement for any reading software that uses the > > index (all readers are required to be tolerant of the presence of the > > wrapper schema index, but are not required to use it) to do some basic > > checking that the result is correct. E.g. if scan number 17500 is > > desired and the index is used to jump to that location, it is > > straightforward and necessary to ensure that the first tag read is > > indeed <spectrum scan_number="17500">. If it is not, the software is > > free to do anything except continue as if it didn't know better (e.g., > > stop with error, revert to sequential read, or try to regenerate the > > index and retry). > > This is complicated by multiple sources being in a single mzML file. Will > the index follow the same structure as the main section so that when you > are > looking for a scan number in some source, you first traverse into the > source, and then look for the indexed spectrum to get its offset? > > <source name="someSourceName"> > <spectrum scan="15"> > ... > </spectrum> > </source> > <index> > <indexedSource name="someSourceName" offset="0"> > <indexedSpectrum scan="15" offset="33"> > ... > </indexedSpectrum> > </indexedSource> > </index> This is a very good point that has slipped notice, I think. Thanks for pointing it out, we should think about this more carefully. > > > - While index/data mismatch is a potential source of problem, it has > > been our experience that problems are rare and the benefits huge. > > Agreed. > > > Regards, > Matt Chambers > > > ------------------------------------------------------------------------ - > This SF.net email is sponsored by DB2 Express Download DB2 Express C - > the FREE version of DB2 express and take control of your XML. No > limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Eric D. <ede...@sy...> - 2007-06-20 06:07:13
|
> From: psi...@li... [mailto:psidev-ms-dev- >=20 > > Hi everyone, thank you for the good discussion, here is what I take away > > from this discussion (colored with my understanding of the prevailing > > opinion on the various topics): > > > > - Separate index/metadata files will be avoided > > - The mzML index will be *optional* as a wrapper schema (with actual > > index at the end of the file) as currently in mzXML >=20 > This works either way for me. It makes it tricky to get to the index, but > it's not any trickier than managing two files, especially if one is > optional. However, unless I am missing something, is it not much harder > to > use a hash to check if the index is valid (i.e. that the main file has not > been altered) if the index is included in the main file? Even if the hash > is written as the last thing, that change alone would cause the next hash > to > be different, would it not? I believe that the file/index integrity checker for mzXML computes the checksum only until the start of the index. So the index is not part of the checksum and the checksum is part of the index, so there's no conflict. It does mean that a checksum computed for the entire file will not match the checksum of the non-index part. So this is slightly harder in that one can't use an OS binary to compute the checksum, but rather one needs a custom program that is aware of this arrangement. This already exists for mzXML and will be easy to port for mzML. > > - The validator will enforce that scan numbers are in ascending order, > > but not necessarily without gaps > > - The validator will enforce that scan numbers and identifiers must be > > unique within a run (but there could be multiple runs in a file) >=20 > I'm confused about the difference between identifiers and scan numbers. > Since a mzML file can have more than one spectra source (e.g. multiple RAW > files), scan numbers could only be unique within a run, as you say, but I > would expect that the "SpectrumID" identifier, if it is different from the > scan number, should be unique to the whole file. What is the reasoning You are correct, my error. > behind the SpectrumID identifier being unique only to a run, or am I > misunderstanding? What is the purpose of having a separate SpectrumID > identifier anyway? To allow LSIDs for individual spectra or some other non-integer IDs if desired. > > - Regarding *always* correct indexes, users of mzXML have been using > > indexes for years with no reports of problems that I'm aware. Obviously > > if the file is altered in any way, the index should be regenerated. > > There are (for mzXML) / will be (for mzML) index checkers to make sure > > all is well along with reindexing functionality if the index is bad. > > - It should be a requirement for any reading software that uses the > > index (all readers are required to be tolerant of the presence of the > > wrapper schema index, but are not required to use it) to do some basic > > checking that the result is correct. E.g. if scan number 17500 is > > desired and the index is used to jump to that location, it is > > straightforward and necessary to ensure that the first tag read is > > indeed <spectrum scan_number=3D"17500">. If it is not, the software = is > > free to do anything except continue as if it didn't know better (e.g., > > stop with error, revert to sequential read, or try to regenerate the > > index and retry). >=20 > This is complicated by multiple sources being in a single mzML file. Will > the index follow the same structure as the main section so that when you > are > looking for a scan number in some source, you first traverse into the > source, and then look for the indexed spectrum to get its offset? >=20 > <source name=3D"someSourceName"> > <spectrum scan=3D"15"> > ... > </spectrum> > </source> > <index> > <indexedSource name=3D"someSourceName" offset=3D"0"> > <indexedSpectrum scan=3D"15" offset=3D"33"> > ... > </indexedSpectrum> > </indexedSource> > </index> This is a very good point that has slipped notice, I think. Thanks for pointing it out, we should think about this more carefully. >=20 > > - While index/data mismatch is a potential source of problem, it has > > been our experience that problems are rare and the benefits huge. >=20 > Agreed. >=20 >=20 > Regards, > Matt Chambers >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |