You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(1) |
Aug
(5) |
Sep
|
Oct
(5) |
Nov
(1) |
Dec
(2) |
2005 |
Jan
(2) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(5) |
Jun
(2) |
Jul
(3) |
Aug
(7) |
Sep
(18) |
Oct
(22) |
Nov
(10) |
Dec
(15) |
2006 |
Jan
(15) |
Feb
(8) |
Mar
(16) |
Apr
(8) |
May
(2) |
Jun
(5) |
Jul
(3) |
Aug
(1) |
Sep
(34) |
Oct
(21) |
Nov
(14) |
Dec
(2) |
2007 |
Jan
|
Feb
(17) |
Mar
(10) |
Apr
(25) |
May
(11) |
Jun
(30) |
Jul
(1) |
Aug
(38) |
Sep
|
Oct
(119) |
Nov
(18) |
Dec
(3) |
2008 |
Jan
(34) |
Feb
(202) |
Mar
(57) |
Apr
(76) |
May
(44) |
Jun
(33) |
Jul
(33) |
Aug
(32) |
Sep
(41) |
Oct
(49) |
Nov
(84) |
Dec
(216) |
2009 |
Jan
(102) |
Feb
(126) |
Mar
(112) |
Apr
(26) |
May
(91) |
Jun
(54) |
Jul
(39) |
Aug
(29) |
Sep
(16) |
Oct
(18) |
Nov
(12) |
Dec
(23) |
2010 |
Jan
(29) |
Feb
(7) |
Mar
(11) |
Apr
(22) |
May
(9) |
Jun
(13) |
Jul
(7) |
Aug
(10) |
Sep
(9) |
Oct
(20) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
(4) |
Mar
(27) |
Apr
(15) |
May
(23) |
Jun
(13) |
Jul
(15) |
Aug
(11) |
Sep
(23) |
Oct
(18) |
Nov
(10) |
Dec
(7) |
2012 |
Jan
(23) |
Feb
(19) |
Mar
(7) |
Apr
(20) |
May
(16) |
Jun
(4) |
Jul
(6) |
Aug
(6) |
Sep
(14) |
Oct
(16) |
Nov
(31) |
Dec
(23) |
2013 |
Jan
(14) |
Feb
(19) |
Mar
(7) |
Apr
(25) |
May
(8) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(20) |
Oct
(19) |
Nov
(10) |
Dec
(12) |
2014 |
Jan
(6) |
Feb
(15) |
Mar
(6) |
Apr
(4) |
May
(16) |
Jun
(6) |
Jul
(4) |
Aug
(2) |
Sep
(3) |
Oct
(3) |
Nov
(7) |
Dec
(3) |
2015 |
Jan
(3) |
Feb
(8) |
Mar
(14) |
Apr
(3) |
May
(17) |
Jun
(9) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
(13) |
Nov
|
Dec
(6) |
2016 |
Jan
(8) |
Feb
(1) |
Mar
(20) |
Apr
(16) |
May
(11) |
Jun
(6) |
Jul
(5) |
Aug
|
Sep
(2) |
Oct
(5) |
Nov
(7) |
Dec
(2) |
2017 |
Jan
(10) |
Feb
(3) |
Mar
(17) |
Apr
(7) |
May
(5) |
Jun
(11) |
Jul
(4) |
Aug
(12) |
Sep
(9) |
Oct
(7) |
Nov
(2) |
Dec
(4) |
2018 |
Jan
(7) |
Feb
(2) |
Mar
(5) |
Apr
(6) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(1) |
Sep
(9) |
Oct
(5) |
Nov
(3) |
Dec
(5) |
2019 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(4) |
May
(2) |
Jun
(8) |
Jul
(2) |
Aug
(2) |
Sep
|
Oct
(2) |
Nov
(9) |
Dec
(1) |
2020 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
|
May
(3) |
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Angel P. <an...@ma...> - 2007-08-08 13:00:35
|
On 8/7/07, Brian Pratt <bri...@in...> wrote: > > Hi Angel, > > If I understand your question to be about identifying current mismatches > between terminology in the schema and the ontology, I'm not sure there are > any - but probably only because the schema has so little actual terminology > in it. > My question was more of a pragmatic one, about where would you add specificity into the mzML schema. Your selecitonWindow example below is a good one, in that the specification of of selectWindow is probably a range value and we should have two sub-elements that corresponding to type the cvParam values to define the window (or just a well defined range sub-element, skipping cvParam altogether). I don't think your second example is a good one tho, since there are so many permutations of an ionSelection protocol and that more are certainly one the way, t is better handled by an ontology specification. Yes this does make parsers slightly harder, since now you must pay attention to the incoming ontology, but it is the same amount of work as if everything was in the schema. mzXML could get away with tight specification of these complex and changing annotations, since its sole purpose was support of the ISB pipeline. Its open source status only served to increase the user base, but the schema changes were solely driven by the needs of that pipeline and solely by the community that used it. Tryin to build consensus across many different groups has led to the current version of mzML and that major structure of mzML will not change at this point, so please let's just get to the specifics of going through the schema and identifying where you think an annotation should be promoted to the level of a schema element, and we'll discuss as a group. -angel Consider this example: > > <xs:element name="selectionWindow" maxOccurs="unbounded"> > <xs:complexType> > <xs:sequence> > <xs:element name="cvParam" type="dx:CVParamType" minOccurs="2" > maxOccurs="unbounded"/> > </xs:sequence> > </xs:complexType> > </xs:element> > > which says absolutely nothing at all about what a selectionWindow > element can be expected to contain when you encounter it. It just says it > will contain at least two "parameters". Not much of an aid to software > development. > > The schema, if we can call it that, doesn't even specify what some of the > most fundamental information about a scan looks like. For example, it > specifies that a scan may have a list of precursors, each of which will > contain an ionSelection, but stops short of telling you what an > ionSelection looks like: > > <xs:element name="ionSelection" type="dx:ParamGroupType"> > <xs:annotation> > <xs:documentation>This captures the type of ion selection being performed, > and trigger m/z (or m/z's), neutral loss criteria etc. for tandem-MS or data > dependent scans.</xs:documentation> > </xs:annotation> > </xs:element> > Nearly all the details of nearly all the elements are just unspecified > blobs. Normally with an XML format you can expect to at least start your > work by running it through something like XMLSpy that will autogenerate a > reader and a writer that you can then polish up (to handle, for example, the > necessary weirdness of base64+zlib in the peaklists). But with this, you > get no kind of a head start at all, since the vast majority of the syntax is > hidden behind blobs like dx:CVParamType and dx:ParamGroupType. It's just > not a specification. > > The statement that led to your question, I think, was just me saying that > if we *did* create an actual schema, we'd want its terminology to agree with > the ontology where ever possible. But it has to actually contain > some terminology, unlike the current schema. > > Brian > > > ------------------------------ > *From:* del...@gm... [mailto:del...@gm...] *On Behalf Of *Angel > Pizarro > *Sent:* Tuesday, August 07, 2007 1:10 PM > *To:* Brian Pratt > *Cc:* psi...@li... > *Subject:* Re: [Psidev-ms-dev] cvParams using name attribute as value > > > > On 8/7/07, Brian Pratt <bri...@in...> wrote: > > > > > > Hey, the horse just twitched: by placing CVparam information in > > attributes of the elements of a conventionally structured XML schema (ala > > mzXML) we can make use of the OBO work without adding a lot of unwanted > > complexity to software systems that aren't really interested in it. An > > mzML that integrates well with OBO-aware systems is an excellent idea, but > > an mzML that demands you BE an OBO-aware system seems less likely to achieve > > widespread adoption. > > > > Can you name specific attributes that you want to have cv terms be the > value for that are currently not in the schema? > -angel > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Eric D. <ede...@sy...> - 2007-08-08 06:35:22
|
Thank you all for the lively discussion. =20 One proposal I once made in Lyon (which was roundly dismissed I believe) was something like this: instead of: =20 <cvParam cvLabel=3D"MS" accession=3D"MS:1000554" name=3D"LCQ Deca" = value=3D""/> =20 Have: =20 <cvParam cvLabel=3D"MS" parentAccession=3D"MS:1000031" accession=3D"MS:1000554" name=3D"LCQ Deca" value=3D""/> =20 Thus the parser can easily be coded to know that any cvParam with a parentAccession=3D"MS:1000031" is going to be an instrument model = whether or not it's in the CV. The mzML semantic validator tool would, of course, check all this. The main argument against this was the potential for inconsistency, I seem to recall. =20 The decision was made to make individual models cv terms to avoid problems like: =20 <cvParam cvLabel=3D"MS" accession=3D"MS:1000031" name=3D"instrument = model" value=3D"LCQ Deca"/> <cvParam cvLabel=3D"MS" accession=3D"MS:1000031" name=3D"instrument = model" value=3D"LCQ DECA"/> <cvParam cvLabel=3D"MS" accession=3D"MS:1000031" name=3D"instrument = model" value=3D"LTQ FT"/> <cvParam cvLabel=3D"MS" accession=3D"MS:1000031" name=3D"instrument = model" value=3D"LTQ-FT"/> <cvParam cvLabel=3D"MS" accession=3D"MS:1000031" name=3D"instrument = model" value=3D"LTQFT"/> =20 I would argue that your code snippet below would better look like: =20 #define MS_CV_POLARITY_TYPE "MS:1000037" if( element.parent =3D=3D "spectrumDescription" ) { for each child { if (child.name=3D=3D"cvParam") then { if( cv.isChildOf(child.attrs['accession], MS_CV_POLARITY_TYPE) ) // if a polarity type spectrum.polarity =3D cv.getName(child.attrs['accession']); } } =20 Note that the cvParam name (should that be "positive" or "Positive" or "positive polarity" or "Polarity" or "polarity"?) is not in the code, just MS:1000037 which can be considered final. =20 This does require a CV class and some methods: cv.loadFromFile() cv.isChildOf() cv.getName() =20 but this is not really complicated. =20 Take cover! Eric =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, August 07, 2007 1:43 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value =20 =20 As long as the name/value paradigm is used, the loop doesn't get much more complicated than: if( element.parent =3D=3D "spectrumDescription" ) { for each child { if (child.name=3D=3D"cvParam") then { if( child.attrs['name'] =3D=3D "Polarity" ) spectrum.polarity =3D child.attrs['value']; } } =20 But if you have to do: if( element.parent =3D=3D "spectrumDescription" ) { for each child { if (child.name=3D=3D"cvParam") then { if( child.attrs['name'] =3D=3D "Positive" ) spectrum.polarity =3D "positive"; else if( child.attrs['name'] =3D=3D "Negative" ) spectrum.polarity =3D "negative"; } } ...parsers will be painful to write and adoption will suffer because of it I think. Not to mention the fact that the idea of adding these things that should really be values as "terms" in the vocabulary is indeed not future-proof. In the future, there might be another IS_A relationship for "LCQ Deca" so that merely by seeing LCQ Deca you won't know that you're looking at an instrument model parameter. Of course, the accession number would tell you uniquely, but then you'll have two accession numbers in the vocabulary with the name "LCQ Deca." Yuck! =20 I think values for terms should be given a special relationship in the CV, they shouldn't be given an "IS_A" relationship and expect the parser to look up the implication of that relationship every time a value-as-term is encountered. =20 -Matt =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Tuesday, August 07, 2007 3:00 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value =20 Upon reflection, I realize that this is, for me, actually a new objection to mzML. My original problem with the reliance on CV/OBO is that an XML parser for it looks something like this: =20 for each element { if (element.name=3D=3D"cvParam") then { a whole bunch of handrolled logic to pick this apart } else { there isn't much else } } =20 That's not really an XML parser, therefore I conclude that mzML isn't really XML. But I have previously beaten that horse to death. =20 =20 Now we have something new not to like: it's impossible to write a parser that's even remotely future-proof. Or maybe it's not new, and I just missed it before. Either way, this all looks increasingly ill conceived to me. Sorry to be such a downer. =20 Hey, the horse just twitched: by placing CVparam information in attributes of the elements of a conventionally structured XML schema (ala mzXML) we can make use of the OBO work without adding a lot of unwanted complexity to software systems that aren't really interested in it. An mzML that integrates well with OBO-aware systems is an excellent idea, but an mzML that demands you BE an OBO-aware system seems less likely to achieve widespread adoption. =20 I do understand the desire to maintain an ontology instead of an ontology and an XML schema, but I'm not sure we can really get away with it. By having a schema that offloads most of its work to an external ontology, we're just pushing the work that having a proper schema saves onto the folks creating the readers and writers, making their job much more complicated that it ought to be - you can't autogenerate a parser or serializer without a fully realized schema. I think we risk them deciding that mzXML and mzData aren't really all that broken after all. =20 Brian =20 |
From: Brian P. <bri...@in...> - 2007-08-07 21:20:28
|
Hi Angel, If I understand your question to be about identifying current mismatches between terminology in the schema and the ontology, I'm not sure there are any - but probably only because the schema has so little actual terminology in it. Consider this example: <xs:element name="selectionWindow" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="cvParam" type="dx:CVParamType" minOccurs="2" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> which says absolutely nothing at all about what a selectionWindow element can be expected to contain when you encounter it. It just says it will contain at least two "parameters". Not much of an aid to software development. The schema, if we can call it that, doesn't even specify what some of the most fundamental information about a scan looks like. For example, it specifies that a scan may have a list of precursors, each of which will contain an ionSelection, but stops short of telling you what an ionSelection looks like: <xs:element name="ionSelection" type="dx:ParamGroupType"> <xs:annotation> <xs:documentation>This captures the type of ion selection being performed, and trigger m/z (or m/z's), neutral loss criteria etc. for tandem-MS or data dependent scans.</xs:documentation> </xs:annotation> </xs:element> Nearly all the details of nearly all the elements are just unspecified blobs. Normally with an XML format you can expect to at least start your work by running it through something like XMLSpy that will autogenerate a reader and a writer that you can then polish up (to handle, for example, the necessary weirdness of base64+zlib in the peaklists). But with this, you get no kind of a head start at all, since the vast majority of the syntax is hidden behind blobs like dx:CVParamType and dx:ParamGroupType. It's just not a specification. The statement that led to your question, I think, was just me saying that if we *did* create an actual schema, we'd want its terminology to agree with the ontology where ever possible. But it has to actually contain some terminology, unlike the current schema. Brian _____ From: del...@gm... [mailto:del...@gm...] On Behalf Of Angel Pizarro Sent: Tuesday, August 07, 2007 1:10 PM To: Brian Pratt Cc: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value On 8/7/07, Brian Pratt <bri...@in...> wrote: Hey, the horse just twitched: by placing CVparam information in attributes of the elements of a conventionally structured XML schema (ala mzXML) we can make use of the OBO work without adding a lot of unwanted complexity to software systems that aren't really interested in it. An mzML that integrates well with OBO-aware systems is an excellent idea, but an mzML that demands you BE an OBO-aware system seems less likely to achieve widespread adoption. Can you name specific attributes that you want to have cv terms be the value for that are currently not in the schema? -angel |
From: Matthew C. <mat...@va...> - 2007-08-07 20:44:18
|
As long as the name/value paradigm is used, the loop doesn't get much more complicated than: if( element.parent == "spectrumDescription" ) { for each child { if (child.name=="cvParam") then { if( child.attrs['name'] == "Polarity" ) spectrum.polarity = child.attrs['value']; } } But if you have to do: if( element.parent == "spectrumDescription" ) { for each child { if (child.name=="cvParam") then { if( child.attrs['name'] == "Positive" ) spectrum.polarity = "positive"; else if( child.attrs['name'] == "Negative" ) spectrum.polarity = "negative"; } } ...parsers will be painful to write and adoption will suffer because of it I think. Not to mention the fact that the idea of adding these things that should really be values as "terms" in the vocabulary is indeed not future-proof. In the future, there might be another IS_A relationship for "LCQ Deca" so that merely by seeing LCQ Deca you won't know that you're looking at an instrument model parameter. Of course, the accession number would tell you uniquely, but then you'll have two accession numbers in the vocabulary with the name "LCQ Deca." Yuck! I think values for terms should be given a special relationship in the CV, they shouldn't be given an "IS_A" relationship and expect the parser to look up the implication of that relationship every time a value-as-term is encountered. -Matt _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Tuesday, August 07, 2007 3:00 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value Upon reflection, I realize that this is, for me, actually a new objection to mzML. My original problem with the reliance on CV/OBO is that an XML parser for it looks something like this: for each element { if (element.name=="cvParam") then { a whole bunch of handrolled logic to pick this apart } else { there isn't much else } } That's not really an XML parser, therefore I conclude that mzML isn't really XML. But I have previously beaten that horse to death. Now we have something new not to like: it's impossible to write a parser that's even remotely future-proof. Or maybe it's not new, and I just missed it before. Either way, this all looks increasingly ill conceived to me. Sorry to be such a downer. Hey, the horse just twitched: by placing CVparam information in attributes of the elements of a conventionally structured XML schema (ala mzXML) we can make use of the OBO work without adding a lot of unwanted complexity to software systems that aren't really interested in it. An mzML that integrates well with OBO-aware systems is an excellent idea, but an mzML that demands you BE an OBO-aware system seems less likely to achieve widespread adoption. I do understand the desire to maintain an ontology instead of an ontology and an XML schema, but I'm not sure we can really get away with it. By having a schema that offloads most of its work to an external ontology, we're just pushing the work that having a proper schema saves onto the folks creating the readers and writers, making their job much more complicated that it ought to be - you can't autogenerate a parser or serializer without a fully realized schema. I think we risk them deciding that mzXML and mzData aren't really all that broken after all. Brian |
From: Angel P. <an...@ma...> - 2007-08-07 20:10:15
|
On 8/7/07, Brian Pratt <bri...@in...> wrote: > > > Hey, the horse just twitched: by placing CVparam information in > attributes of the elements of a conventionally structured XML schema (ala > mzXML) we can make use of the OBO work without adding a lot of unwanted > complexity to software systems that aren't really interested in it. An > mzML that integrates well with OBO-aware systems is an excellent idea, but > an mzML that demands you BE an OBO-aware system seems less likely to achieve > widespread adoption. > Can you name specific attributes that you want to have cv terms be the value for that are currently not in the schema? -angel |
From: Brian P. <bri...@in...> - 2007-08-07 20:00:51
|
Upon reflection, I realize that this is, for me, actually a new objection to mzML. My original problem with the reliance on CV/OBO is that an XML parser for it looks something like this: for each element { if (element.name=="cvParam") then { a whole bunch of handrolled logic to pick this apart } else { there isn't much else } } That's not really an XML parser, therefore I conclude that mzML isn't really XML. But I have previously beaten that horse to death. Now we have something new not to like: it's impossible to write a parser that's even remotely future-proof. Or maybe it's not new, and I just missed it before. Either way, this all looks increasingly ill conceived to me. Sorry to be such a downer. Hey, the horse just twitched: by placing CVparam information in attributes of the elements of a conventionally structured XML schema (ala mzXML) we can make use of the OBO work without adding a lot of unwanted complexity to software systems that aren't really interested in it. An mzML that integrates well with OBO-aware systems is an excellent idea, but an mzML that demands you BE an OBO-aware system seems less likely to achieve widespread adoption. I do understand the desire to maintain an ontology instead of an ontology and an XML schema, but I'm not sure we can really get away with it. By having a schema that offloads most of its work to an external ontology, we're just pushing the work that having a proper schema saves onto the folks creating the readers and writers, making their job much more complicated that it ought to be - you can't autogenerate a parser or serializer without a fully realized schema. I think we risk them deciding that mzXML and mzData aren't really all that broken after all. Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, August 07, 2007 11:57 AM To: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value In addition to Mike's and Brian's concerns, I am wondering how "LCQ Deca" is called a "term/concept?" "Instrument model" is the closest relevant term/concept as I understand those words. Is the cvParam not capable of controlling both the name and possible values of its definitions? Also, why are the different instrument models part of the CV anyway? It seems that the CV should support controlling both terms and the values (or instances) of those terms: "LCQ Deca" IS A VALID INSTANCE OF "thermo finnigan" IS A "thermo fisher scientific" IS A "instrument model" I don't really understand the middle two jumps either, i.e. why are they redundant? _____ From: Eric Deutsch [mailto:ede...@sy...] Sent: Tuesday, August 07, 2007 12:13 PM To: Matthew Chambers; psi...@li... Subject: RE: [Psidev-ms-dev] cvParams using name attribute as value Hi Matt, the agree-upon rule here is that the cvParams should always refer to the most detailed concept, and the value attribute should *only* be filled if there is a scalar value associated with the concept that cannot be in the CV itself. So: <cvParam cvLabel="MS" accession="MS:1000554" name="LCQ Deca" value=""/> <cvParam cvLabel="MS" accession="MS:1000529" name="Instrument Serial Number" value="23433"/> So for the first, the term/concept is "LCQ Deca". For the CV, one can learn that an "LCQ Deca" IS A "instrument model", and so there's no need (and is perhaps a little dangerous) to put "LCQ Deca" as a value of "instrument model". However, "instrument serial number" is the most specific concept in the CV, and thus the actual SN is the value. This was discussed at some length and this is the new way of doing things, that will be uniform across all PSI and FuGE implementations. At least, that is my understanding. This does mean that parsers need to be a little smarter and be "CV-aware". The parser/interpreter can no longer assume that there will be a term "instrument model" and look for its value. But rather, the parser/interpreter must now look to see if any of the terms provided are a child of "instrument model" in the CV. Regards, Eric _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, August 07, 2007 9:40 AM To: psi...@li... Subject: [Psidev-ms-dev] cvParams using name attribute as value I'm a little confused about the parameters which use the accession number as a kind of value instead of the accession number identifying a variable and then using the value attribute to assign the value. I don't understand why: <cvParam cvLabel="MS" accession="MS:1000130" name="Positive Scan" value=""/> (from mzML) Is preferable to: <cvParam cvLabel="psi" accession="PSI:1000037" name="Polarity" value="positive"/> (from mzData) There are other examples of this as well. What's the logic here? -Matt Chambers |
From: Matthew C. <mat...@va...> - 2007-08-07 18:57:21
|
In addition to Mike's and Brian's concerns, I am wondering how "LCQ Deca" is called a "term/concept?" "Instrument model" is the closest relevant term/concept as I understand those words. Is the cvParam not capable of controlling both the name and possible values of its definitions? Also, why are the different instrument models part of the CV anyway? It seems that the CV should support controlling both terms and the values (or instances) of those terms: "LCQ Deca" IS A VALID INSTANCE OF "thermo finnigan" IS A "thermo fisher scientific" IS A "instrument model" I don't really understand the middle two jumps either, i.e. why are they redundant? _____ From: Eric Deutsch [mailto:ede...@sy...] Sent: Tuesday, August 07, 2007 12:13 PM To: Matthew Chambers; psi...@li... Subject: RE: [Psidev-ms-dev] cvParams using name attribute as value Hi Matt, the agree-upon rule here is that the cvParams should always refer to the most detailed concept, and the value attribute should *only* be filled if there is a scalar value associated with the concept that cannot be in the CV itself. So: <cvParam cvLabel="MS" accession="MS:1000554" name="LCQ Deca" value=""/> <cvParam cvLabel="MS" accession="MS:1000529" name="Instrument Serial Number" value="23433"/> So for the first, the term/concept is "LCQ Deca". For the CV, one can learn that an "LCQ Deca" IS A "instrument model", and so there's no need (and is perhaps a little dangerous) to put "LCQ Deca" as a value of "instrument model". However, "instrument serial number" is the most specific concept in the CV, and thus the actual SN is the value. This was discussed at some length and this is the new way of doing things, that will be uniform across all PSI and FuGE implementations. At least, that is my understanding. This does mean that parsers need to be a little smarter and be "CV-aware". The parser/interpreter can no longer assume that there will be a term "instrument model" and look for its value. But rather, the parser/interpreter must now look to see if any of the terms provided are a child of "instrument model" in the CV. Regards, Eric _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, August 07, 2007 9:40 AM To: psi...@li... Subject: [Psidev-ms-dev] cvParams using name attribute as value I'm a little confused about the parameters which use the accession number as a kind of value instead of the accession number identifying a variable and then using the value attribute to assign the value. I don't understand why: <cvParam cvLabel="MS" accession="MS:1000130" name="Positive Scan" value=""/> (from mzML) Is preferable to: <cvParam cvLabel="psi" accession="PSI:1000037" name="Polarity" value="positive"/> (from mzData) There are other examples of this as well. What's the logic here? -Matt Chambers |
From: Brian P. <bri...@in...> - 2007-08-07 18:45:49
|
Piling on with Mike, here: So the first thing any parser must do is load up the OBO file. In practice, such a software system will need to bundle an OBO in some fashion, in the extremely likely event that the OBO used by the mzML file in question is not present. Don't forget to update your distro each time the OBO gets updated, and make sure that in the event the OBO used by the mzML file IS present, you use that intead. Then, read: <cvParam cvLabel="MS" accession="MS:1000554" name="LCQ Deca" value=""/> then ask yourself, "whazzat?", and look up: id: MS:1000554 name: LCQ Deca def: "ThermoFinnigan LCQ Deca." [PSI:MS] is_a: MS:1000125 ! thermo finnigan which leads you to: id: MS:1000125 name: thermo finnigan def: "ThermoFinnigan from Thermo Electron Corporation" [PSI:MS] is_a: MS:1000483 ! thermo fisher scientific which leads you to: id: MS:1000483 name: thermo fisher scientific def: "Thermo Fisher Scientific. Also known as Thermo Finnigan corporation." [PSI:MS] related_synonym: "Thermo Scientific" [] is_a: MS:1000031 ! model by vendor which leads you to: id: MS:1000031 name: model by vendor def: "Instrument's model name (everything but the vendor's name) ---Free text ?" [PSI:MS] relationship: part_of MS:1000463 ! instrument description which leads you to: id: MS:1000463 name: instrument description def: "Device which performs a measurement." [PSI:MS] relationship: part_of MS:0000000 ! mzOntology aha! now populate the "instrument description" element in your database. Which is all fine, in its way, until a new instrument "LCQ Spiff-o" comes out and the OBO isn't immediately updated to match, in which case the parser can't even tell that it's an instrument declaration. This is a curiously upside down way to write XML. If I were designing it I'd make the CV stuff an attribute of the instrument info, for anyone that cares to dive into the OBO, but allow the XML to stand alone in the absence of a suitable OBO. I'd make an effort to use the same terminology in the XML element and attribute names as in the OBO just to reduce confusion. I guess what I'm describing is something like mzXML with the addition of CV info as attributes of the existing element types to aid those interested in using OBO to unify data from different sources, without annoying those uninterested in unifying data from different systems. But, some of you will recall that the use of the CV stuff in lieu of proper XML (in the sense that you have no real hope of making full sense of mzML without access to an external file) is a longstanding crank of mine, and I don't really expect to change it this late in the game. - Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Tuesday, August 07, 2007 10:13 AM To: Matthew Chambers; psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value Hi Matt, the agree-upon rule here is that the cvParams should always refer to the most detailed concept, and the value attribute should *only* be filled if there is a scalar value associated with the concept that cannot be in the CV itself. So: <cvParam cvLabel="MS" accession="MS:1000554" name="LCQ Deca" value=""/> <cvParam cvLabel="MS" accession="MS:1000529" name="Instrument Serial Number" value="23433"/> So for the first, the term/concept is "LCQ Deca". For the CV, one can learn that an "LCQ Deca" IS A "instrument model", and so there's no need (and is perhaps a little dangerous) to put "LCQ Deca" as a value of "instrument model". However, "instrument serial number" is the most specific concept in the CV, and thus the actual SN is the value. This was discussed at some length and this is the new way of doing things, that will be uniform across all PSI and FuGE implementations. At least, that is my understanding. This does mean that parsers need to be a little smarter and be "CV-aware". The parser/interpreter can no longer assume that there will be a term "instrument model" and look for its value. But rather, the parser/interpreter must now look to see if any of the terms provided are a child of "instrument model" in the CV. Regards, Eric _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, August 07, 2007 9:40 AM To: psi...@li... Subject: [Psidev-ms-dev] cvParams using name attribute as value I'm a little confused about the parameters which use the accession number as a kind of value instead of the accession number identifying a variable and then using the value attribute to assign the value. I don't understand why: <cvParam cvLabel="MS" accession="MS:1000130" name="Positive Scan" value=""/> (from mzML) Is preferable to: <cvParam cvLabel="psi" accession="PSI:1000037" name="Polarity" value="positive"/> (from mzData) There are other examples of this as well. What's the logic here? -Matt Chambers |
From: Mike C. <tu...@gm...> - 2007-08-07 18:06:11
|
On 8/7/07, Eric Deutsch <ede...@sy...> wrote: > <cvParam cvLabel="MS" accession="MS:1000554" name="LCQ Deca" value=""/> > > <cvParam cvLabel="MS" accession="MS:1000529" name="Instrument Serial Number" > value="23433"/> > > > So for the first, the term/concept is "LCQ Deca". For the CV, one can learn > that an "LCQ Deca" IS A "instrument model", and so there's no need (and is > perhaps a little dangerous) to put "LCQ Deca" as a value of "instrument > model". > > > However, "instrument serial number" is the most specific concept in the CV, > and thus the actual SN is the value. > > > This was discussed at some length and this is the new way of doing things, > that will be uniform across all PSI and FuGE implementations. At least, that > is my understanding. This does mean that parsers need to be a little smarter > and be "CV-aware". The parser/interpreter can no longer assume that there > will be a term "instrument model" and look for its value. But rather, the > parser/interpreter must now look to see if any of the terms provided are a > child of "instrument model" in the CV. Actually, the parser really should not only check whether the term provided *is* a child in the current CV, but also whether it ever *will be* in a future version of the CV. Unfortunately, the technology required to make such a check is not yet available. :-) I'm not very familiar with how CV is supposed to work, but from this example it appears that the namespaces for different kinds of things have been merged together, and that there is an assumption that there will be no collisions. And that anything that doesn't currently have a name basically doesn't exist. In the example given of writing a parser, the task of extracting the name of the instrument, given just the mzML file, is changed from being trivial to being essentially impossible. The mzML file becomes meaningless in itself, and only has meaning relative to a particular version of the CV, which the parser must have access to. Am I misunderstanding something? Mike |
From: Eric D. <ede...@sy...> - 2007-08-07 17:12:46
|
Hi Matt, the agree-upon rule here is that the cvParams should always refer to the most detailed concept, and the value attribute should *only* be filled if there is a scalar value associated with the concept that cannot be in the CV itself. So: =20 <cvParam cvLabel=3D"MS" accession=3D"MS:1000554" name=3D"LCQ Deca" = value=3D""/> <cvParam cvLabel=3D"MS" accession=3D"MS:1000529" name=3D"Instrument = Serial Number" value=3D"23433"/> =20 So for the first, the term/concept is "LCQ Deca". For the CV, one can learn that an "LCQ Deca" IS A "instrument model", and so there's no need (and is perhaps a little dangerous) to put "LCQ Deca" as a value of "instrument model". =20 However, "instrument serial number" is the most specific concept in the CV, and thus the actual SN is the value. =20 This was discussed at some length and this is the new way of doing things, that will be uniform across all PSI and FuGE implementations. At least, that is my understanding. This does mean that parsers need to be a little smarter and be "CV-aware". The parser/interpreter can no longer assume that there will be a term "instrument model" and look for its value. But rather, the parser/interpreter must now look to see if any of the terms provided are a child of "instrument model" in the CV. =20 Regards, Eric =20 =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, August 07, 2007 9:40 AM To: psi...@li... Subject: [Psidev-ms-dev] cvParams using name attribute as value =20 I'm a little confused about the parameters which use the accession number as a kind of value instead of the accession number identifying a variable and then using the value attribute to assign the value. I don't understand why: <cvParam cvLabel=3D"MS" accession=3D"MS:1000130" name=3D"Positive Scan" value=3D""/> (from mzML) Is preferable to: <cvParam cvLabel=3D"psi" accession=3D"PSI:1000037" name=3D"Polarity" value=3D"positive"/> (from mzData) =20 There are other examples of this as well. What's the logic here? =20 -Matt Chambers |
From: Matthew C. <mat...@va...> - 2007-08-07 16:40:18
|
I'm a little confused about the parameters which use the accession number as a kind of value instead of the accession number identifying a variable and then using the value attribute to assign the value. I don't understand why: <cvParam cvLabel="MS" accession="MS:1000130" name="Positive Scan" value=""/> (from mzML) Is preferable to: <cvParam cvLabel="psi" accession="PSI:1000037" name="Polarity" value="positive"/> (from mzData) There are other examples of this as well. What's the logic here? -Matt Chambers |
From: Randy J. <rkj...@in...> - 2007-08-06 04:14:21
|
There are many reasons why people might want to put multiple runs in a = single file - and mostly it is the MRM-type experiments where this makes = sense. In reality, it is just a convenience mechanism for dealing with = the large number files created by high throughput experiments. I probably reminded the group about this use case, but adopting the = schema to d this can be thought of as one of many possible = optimizations. Personally, I like the tarball (zip, jar, etc.) approach, because it = allows us to use standard approaches for managing the multiple files. = There was a thread earlier that reminded us that while you can stream = from zip files, you cannot easily parse XML from inside a zip file - so = maybe we have to think through this. My vote today would be to leave the "multiple run" use case out for now, = since broad adoption is helped by the existence of APIs so much (as was = mentioned earlier in this thread). Randy -----Original Message----- From: psi...@li... = [mailto:psi...@li...] On Behalf Of Eric = Deutsch Sent: Friday, August 03, 2007 3:00 AM To: psi...@li... Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review > From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Matthew Chambers >=20 > What's wrong with the schema supporting multiple runs per file and = letting > implementers gradually add support for it? There are many features of > mzML that will require substantial rewrites of the existing parser = APIs. > Parameter groups, multiple runs, multiple precursors, and compressed > binary data are all major "completely predictable trouble spots." As = long > as the file readers develop faster than the file writers, there won't = be a > problem. ;) I very much doubt that writers (e.g. ReAdW) will be = writing > multiple instrument files into one mzML file any time soon (unless > somebody is itching to do this without saying so?). The parameter = groups > and multiple precursors are more problematic, IMO, but still good > improvements. If I recall correct the feature of multiple runs crept in at the end of = the Seattle meeting last fall. Can anyone articulate a compelling use = case for multiple runs per file? I seem to recall a scenario where at = least one vendor encodes multiple runs in a single (wiff??) file, but I = don't know about any of that for sure. Anyone have such a case? > I have a few comments: >=20 > - There seems to be a timestamp on the run element now (maybe I just > missed it before), of type xs:dateTime. It's an optional attribute = and it > has an ambiguous meaning. Why isn't this expanded into a start and = stop > timestamp for the run? Also, why is it optional? The believe the intent was that the timestamp is the UT at the start of = the run. We should clarify this. Is it useful to encode the stop timestamp? As for why optional, we imagined that in the real world this value might = not be known properly. Imagine a scenario were someone is converting a = legacy mzXML file to mzML. This information may not be available, sadly. = It is certainly encouraged that modern converters/writers include it. > - Most every cvParam has a "cvLabel" attribute that is "MS" but the > accession attribute of each cvParam seems to include the cvLabel in it > ("MS:xxxxxxxx"). If that is just a coincidence, I think it should be > changed so that it is required and the cvLabel can be eliminated. If = it's > not a coincidence, why is like that? If the parser needs to know = which > vocabulary an accession number is from, it can parse until the colon > delimiter. Alternatively, keeping cvLabel and getting rid of the = "MS:" in > the accession attribute would allow somewhat more efficient parsing. = In > the alternative case, I suggest a required default cvLabel somewhere = in > the header, similar to setting the default XML namespace. cvLabel is really just an id to indicate which CV (as more completely = defined within <cvList> above) the term comes from. It seems to be = current best practice that life science CV accession numbers begin with = an OBO namespace, :, and a number. But not all CVs will necessarily = follow this convention as far as I'm aware. > - I see a TODO item is giving the binaryDataArray's "dataType" = attribute a > CV entry. I agree with this. But I think the values should be more > machine-oriented, like "float32", "float64", "int32", "uint64", etc. You mean "float32" is preferred over "32-bit float" > - Parameter groups are good, especially since the spectrum headers = seem to > have ballooned to be more flexible. Anything that makes the file- > dominating spectrum elements smaller and faster to parse is nice - > indexing the shared parameters is a good way to do this. >=20 > - I'd still like to see a clear definition of "run" relative to = "sample" > and "source file." Seems like these three are all tightly coupled. For LC-MSn ion trap data, this is relatively straightforward and is what = is depicted in the examples. A run is a series of scans, usually counted = consecutively by the instrument, obtained as a sample is injected into = the instrument. A sample is the biomaterial that is injected into an = instrument over a run. The source file is the one or more files from = which the mzML was generated. It will usually be a single vendor-format = raw file. It could be an mzXML file. It could (unfortunately) be a = series of dta files. For MALDI or gel spot processing, however, this might be quite = different. We had previously entertained "analyte identifier" or "MALDI = spot identifier" CV terms to allow annotating each spot. This might = make the run-based sample undefined. I think LC-MSn heavily colored our = thinking during development. It would be extremely nice to have a = detailed example of data where individual scan refer to different = "analyte identifiers" or the like. Would someone contribute this? Regards, Eric >=20 >=20 > -Matt Chambers >=20 > Vanderbilt MSRC >=20 >=20 >=20 > ________________________________ >=20 > From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Thursday, August 02, 2007 2:47 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review >=20 >=20 >=20 > (Note: I know I'm late to the party with this comment, but I think = it's > important) >=20 >=20 >=20 > I noticed this in the todo file: >=20 > " - Now that we're allowing multiple runs in a file, how will the = index > look to handle this?" >=20 >=20 >=20 > Better question: what will software that uses such an index look like? >=20 >=20 >=20 > Answer: it won't look much like anything that currently reads mzXML = and > mzData - including X!Tandem or anything using RAMP (TPP and others) or > JRAP (CPAS and others). These programs easily deal with both mzData = and > mzXML in their various versions by using APIs which, as it happens, = assume > one file per run and one run per file. Breaking this one to one > correspondence in mzML means you can't just slide mzML support in = behind > the API, and of course also violates a fundamental assumption which = flows > through the code that calls these APIs, right out to the user = interface in > most cases. This means extensive surgery to any program that wants to > read mzML properly, and my guess is that means mzML is DOA. At a = minimum > it becomes a completely predictable trouble spot since you can now = write > legal mzML files that the majority of mzML readers will simply not = know > how to handle. They'll be OK with RunList::count =3D=3D 1, but no = more - so, > why set ourselves up for trouble? >=20 >=20 >=20 > Multiple runs per file are probably useful in some cases, but if the > stated goal of mzML is to replace mzXML and mzData then I think this > feature is actually scope creep which threatens the mission and should = be > dropped. Let those who really want this feature come up with a = wrapper > schema, but don't call it mzML lest you force the vast majority of = mzML > consuming software to be broken from the start. >=20 >=20 >=20 > - Brian >=20 >=20 >=20 > ________________________________ >=20 > From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Eric Deutsch > Sent: Thursday, August 02, 2007 1:02 AM > To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet = Souda; > Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; = Andy > Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo; > Jam...@wa...; Mike Coleman; Matthew Chambers; Helen = Jenkins; > Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas R=F6mpp; Kent = Laursen; > Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick = Gino > Angelo; Hans Vissers; Eric Deutsch; cl...@br...; > dav...@ag...; rb...@be...; psidev-ms- > de...@li... > Cc: Rolf Apweiler; Ruedi Aebersold > Subject: [Psidev-ms-dev] mzML 0.93 ready for first review >=20 > Hi everyone, after considerable hard work from many people, we have a > prerelease of mzML (the union of mzData and mzXML) available for = comment > by you, a major stakeholder in mzML. >=20 > You may download a kit of material to examine at: >=20 > http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip >=20 > The general mzML development page is at: >=20 > http://psidev.info/index.php?q=3Dnode/257 >=20 > Please send feedback to: >=20 > psi...@li... >=20 > We ask that you respond by August 20. >=20 > Additional releases with more information may be provided during the > coming month. >=20 > The current format has been guided by these principles: >=20 > - Keep the format simple >=20 > - Minimize alternate ways of encoding the same information >=20 > - Allow some flexibility for encoding new important information >=20 > - Support the features of mzData and mzXML but not a lot more >=20 > - But do provide clear support for SRM data >=20 > - Finish the format soon with the resources available >=20 > There are many enhancements that have been suggested, but the small = group > of volunteers that have actively developed this format have opted to = focus > on the primary goal set before us: develop a single format that the > vendors and current software can easily support and thereby obsolete > mzData and mzXML. The enhancements not considered compatible with this > goal will be entertained for mzML 2.0 >=20 > We are committed to providing not just the format, but also a set of > working implementations, converters and readers, as well as a format > validator, all to ensure that mzML is a format that will be adopted > quickly and implemented uniformly. Prior to submission to the PSI = document > process, the following software will implement mzML: >=20 > - 2 or more converters from vendor formats to mzML >=20 > - the popular reader library RAMP that currently supports mzData and = mzXML >=20 > - an mzML semantic validator that checks for correct implementation >=20 > We hope to follow this schedule: >=20 > 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment >=20 > 2007-08-20 Comments from major stakeholders received >=20 > 2007-09-01 Revised mzML 1.0 submitted to PSI document process, = beginning > 30 days internal review >=20 > 2007-10-01 Revised mzML 1.01 begins 60 days community review >=20 > 2007-10-06 Formal announcement that feedback is sought at HUPO world > congress >=20 > 2007-12-01 Formal 60 days community review closes >=20 > 2008-01-01 Revised mzML 1.02 officially released >=20 > Thank you for your help! Feel free to forward this message to someone = whom > you think should review the format at this stage. >=20 > Regards, >=20 > Eric -------------------------------------------------------------------------= This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matt C. <mat...@va...> - 2007-08-05 00:44:28
|
Brian, I'm glad to hear discussion is going on. We know the index style has to change if multi-run support is kept, so it ought to be done properly. When this was last discussed on this list (right after I subscribed), I thought the same as you: that a hierarchical index would work best for a multi-run file. This is what I suggested in the earlier list discussion about the index: <source name="someSourceName"> <spectrum scan="15"> ... </spectrum> </source> <index> <indexedSource name="someSourceName" offset="0"> <indexedSpectrum scan="15" offset="33"> ... </indexedSpectrum> </indexedSource> </index> It was exactly the same structure as yours, obviously more verbose than necessary. I have changed my mind about the structure since then though. :) I do not think the index should have a separate XML element for each offset. That actually makes it HARDER to parse. There is no question that a space delimited list is easier and faster to parse than fully structured XML. You may be right about some XML parser libs not appreciating several kilobytes of characters for a single attribute. That could be solved by the following style of index: <index run_id="runA" size="15"> <scanNumberList>2 3 4 6 7 8 10 11 12 14 15 16</scanNumberList> <offsetList>42 142 242 342 442 542 642 742 842 942 1042 1142</offsetList> </index> <index run_id="runB" size="15"> <scanNumberList>2 3 4 6 7 8 10 11 12</scanNumberList> <offsetList>1242 1342 1442 1542 1642 1742 1842 1942 2042 2142 2242 2342</offsetList> </index> Even if the rest of the parsing is done with an XML lib like expat or xerces, simple delimiter/token parsing (like RAMP uses) is the best way to read this style of index, and nothing could be faster (without going binary). As we both agree, XML purists dislike any of these index styles, so we needn't try to please. :) As I already mentioned, once the scan numbers are in an array (which is by definition sorted since scan numbers are guaranteed to be in ascending order), a binary search can find the index into the array for a queried scan number. Alternatively, the query could be done in a streaming fashion, without storing the index in memory, by incrementing a counter while tokenizing the scan number list, and then using that counter to know how many tokens to advance in the offset list. But I don't see the point of not reading the entire index whenever an mzML file is opened (assuming the index exists and it's trusted to be correct), so I would opt for the binary search idea. If we wanted to be even faster, save space, and reuse the base64 functions, both lists could be binary and base64 encoded. I don't think that binary encoding would be terribly useful here, but it's an option. My main idea is to get away from the overhead of having each offset be wrapped by an XML element. It all depends how much you adhere to the motto: "If you're gonna go, go all out." -Matt Brian Pratt wrote: > Matt - > > Thanks for the info. Hopefully it shows up in the schema comments. > > Josh and I were kicking around the index idea off list (by accident, the > reply-to default tripped us up...). I agree that my initial idea is more > verbose than it should be. I worry though that those lists you propose > could get very long very fast and become troublesome to some XML parser > libs. Josh also suggested a more structured approach, here's my tweak of > that: (same example as before, but with runs named "Bob" and "fizzle" to > avoid implying any structured name conventions) > > <index run_id="Bob"> > ... > <offset scan="1041" 4212696/> > <offset scan="1042" 4218791/> > </index> > <index run_id="fizzle"> > <offset scan="1" 4221806/> > <offset scan="2" 4227580/> > <offset scan="3" 4231174/> > ... > </index> > > Not nearly as tight as yours, but easier to write an XML handler for it, and > less likely to upset XML purists (who are already outraged by the idea of an > index, so maybe that's not a problem...). We could use smaller names, too, > "fpos" for "offset", etc. > > Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Matt > Chambers > Sent: Friday, August 03, 2007 7:44 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] compressionType (RE: mzML 0.93 ready for first > review) > > Brian Pratt wrote: > >> Oh, and to revisit the original question, a snippet of a multi-run >> index would look like something like this, I think, assuming the >> runList contained a couple of runs with id="runA" and id="runB" >> respectively: >> >> <offset run_id="runA" spectrum_id="1041">4212696</offset><offset >> run_id="runA" spectrum_id="1042">4218791</offset> <offset >> run_id="runB" spectrum_id="1">4221806</offset> <offset run_id="runB" >> spectrum_id="2">4227580</offset> <offset run_id="runB" >> spectrum_id="3">4231174</offset> >> >> > Wouldn't a simpler, faster, and easier index be something like: > <index run_id="runA" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14 > 15 16 18 19 20" offsetList="42 142 242 342 442 542 642 742 842 942 1042 > 1142 1242 1342 1442" /> > <index run_id="runB" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14 > 15 16 18 19 20" offsetList="1542 1642 1742 1842 1942 2042 2142 2242 2342 > 2442 2542 2642 2742 2842 2942" /> > > These space delimited lists seem much easier to parse and deal with, unless > there is some attribute length limit I'm ignorant of? Certainly human > readability of the index is not an issue because a human can just do a find > on the scan number and jump straight to it (even with, heaven forbid it, > Notepad.exe). Adherence to XML principles needn't be a concern because the > whole concept of the index is outside the realm of pure XML principles. :) > I like the idea of reading in two sorted arrays, doing a binary search on > the first one to find the index of the desired scan number, and then using > that index to look up the offset in the second sorted array. I feel the > need... the need for speed. > |
From: Brian P. <bri...@in...> - 2007-08-04 23:18:01
|
Matt - Thanks for the info. Hopefully it shows up in the schema comments. Josh and I were kicking around the index idea off list (by accident, the reply-to default tripped us up...). I agree that my initial idea is more verbose than it should be. I worry though that those lists you propose could get very long very fast and become troublesome to some XML parser libs. Josh also suggested a more structured approach, here's my tweak of that: (same example as before, but with runs named "Bob" and "fizzle" to avoid implying any structured name conventions) <index run_id="Bob"> ... <offset scan="1041" 4212696/> <offset scan="1042" 4218791/> </index> <index run_id="fizzle"> <offset scan="1" 4221806/> <offset scan="2" 4227580/> <offset scan="3" 4231174/> ... </index> Not nearly as tight as yours, but easier to write an XML handler for it, and less likely to upset XML purists (who are already outraged by the idea of an index, so maybe that's not a problem...). We could use smaller names, too, "fpos" for "offset", etc. Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Matt Chambers Sent: Friday, August 03, 2007 7:44 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] compressionType (RE: mzML 0.93 ready for first review) Brian Pratt wrote: > Oh, and to revisit the original question, a snippet of a multi-run > index would look like something like this, I think, assuming the > runList contained a couple of runs with id="runA" and id="runB" > respectively: > > <offset run_id="runA" spectrum_id="1041">4212696</offset><offset > run_id="runA" spectrum_id="1042">4218791</offset> <offset > run_id="runB" spectrum_id="1">4221806</offset> <offset run_id="runB" > spectrum_id="2">4227580</offset> <offset run_id="runB" > spectrum_id="3">4231174</offset> > Wouldn't a simpler, faster, and easier index be something like: <index run_id="runA" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14 15 16 18 19 20" offsetList="42 142 242 342 442 542 642 742 842 942 1042 1142 1242 1342 1442" /> <index run_id="runB" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14 15 16 18 19 20" offsetList="1542 1642 1742 1842 1942 2042 2142 2242 2342 2442 2542 2642 2742 2842 2942" /> These space delimited lists seem much easier to parse and deal with, unless there is some attribute length limit I'm ignorant of? Certainly human readability of the index is not an issue because a human can just do a find on the scan number and jump straight to it (even with, heaven forbid it, Notepad.exe). Adherence to XML principles needn't be a concern because the whole concept of the index is outside the realm of pure XML principles. :) I like the idea of reading in two sorted arrays, doing a binary search on the first one to find the index of the desired scan number, and then using that index to look up the offset in the second sorted array. I feel the need... the need for speed. Regarding the spectrumID stuff: > Unsure, actually, whether spectrum_id (the spectrum::id string value) > or spectrum_scanNumber (the spectrum::scanNumber int value) is better > since we don't specify a uniqueness constraint for either. > > These could use a bit of commenting, I think: > > <xs:complexType name="SpectrumType"> ... > > > <xs:attribute name="id" type="xs:string" use="required"/> > <xs:attribute name="scanNumber" type="xs:int" use="required"/> > > I'm guessing that in practice these will tend to be one and the same, > which leads one to wonder why we have them both, which suggests a bit > more documentation would be in order. Also, any uniqueness > constraints should be documented (we probably don't want either > scanNumber or id being reused within a Run). > > Perhaps the "id" attribute could be made optional, for those cases > where it's just going to be a repeat of scanNumber? Or the other way > around (for a relative savings of 8 characters per scan!). > > Out of curiousity, what's the scenario in which id isn't the same as > scanNumber? > > - Brian > To quote from an earlier mailing list posting by Eric Deutsch, responding to me (after which there was strangely over a month of no list activity): > >> - The validator will enforce that scan numbers are in ascending > >> order, but not necessarily without gaps - The validator will > >> enforce that scan numbers and identifiers must be unique within a > >> run (but there could be multiple runs in a file) > >> > > I'm confused about the difference between identifiers and scan > > numbers. Since a mzML file can have more than one spectra source > > (e.g. multiple RAW files), scan numbers could only be unique within > > a run, as you say, but I would expect that the "SpectrumID" > > identifier, if it is different from the scan number, should be > > unique to the whole file. What is the reasoning > > You are correct, my error. > > > behind the SpectrumID identifier being unique only to a run, or am I > > misunderstanding? What is the purpose of having a separate > > SpectrumID identifier anyway? > > To allow LSIDs for individual spectra or some other non-integer IDs > if desired. I like the capability of having arbitrary spectrum ids, but it must be made clear whether they are unique to a FILE or unique to a RUN. It has already been given that the scan number will be unique within a run and so I assume that means the id must be unique in some way as well. If the id is just an S and then a scan number, then I agree with Brian. I suggest that unless the file writer has a compelling reason to do so (e.g. the user told it to or it comes from the source file as something special), the id attribute should be left out (so it should be optional). File readers should treat the id as bonus information unless there is a compelling reason to require it. -Matt ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matt C. <mat...@va...> - 2007-08-04 02:45:54
|
Brian Pratt wrote: > Oh, and to revisit the original question, a snippet of a multi-run > index would look like something like this, I think, assuming the > runList contained a couple of runs with id="runA" and id="runB" > respectively: > > <offset run_id="runA" spectrum_id="1041">4212696</offset><offset > run_id="runA" spectrum_id="1042">4218791</offset> <offset > run_id="runB" spectrum_id="1">4221806</offset> <offset run_id="runB" > spectrum_id="2">4227580</offset> <offset run_id="runB" > spectrum_id="3">4231174</offset> > Wouldn't a simpler, faster, and easier index be something like: <index run_id="runA" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14 15 16 18 19 20" offsetList="42 142 242 342 442 542 642 742 842 942 1042 1142 1242 1342 1442" /> <index run_id="runB" size="15" scanNumberList="2 3 4 6 7 8 10 11 12 14 15 16 18 19 20" offsetList="1542 1642 1742 1842 1942 2042 2142 2242 2342 2442 2542 2642 2742 2842 2942" /> These space delimited lists seem much easier to parse and deal with, unless there is some attribute length limit I'm ignorant of? Certainly human readability of the index is not an issue because a human can just do a find on the scan number and jump straight to it (even with, heaven forbid it, Notepad.exe). Adherence to XML principles needn't be a concern because the whole concept of the index is outside the realm of pure XML principles. :) I like the idea of reading in two sorted arrays, doing a binary search on the first one to find the index of the desired scan number, and then using that index to look up the offset in the second sorted array. I feel the need... the need for speed. Regarding the spectrumID stuff: > Unsure, actually, whether spectrum_id (the spectrum::id string value) > or spectrum_scanNumber (the spectrum::scanNumber int value) is > better since we don't specify a uniqueness constraint for either. > > These could use a bit of commenting, I think: > > <xs:complexType name="SpectrumType"> ... > > > <xs:attribute name="id" type="xs:string" use="required"/> > <xs:attribute name="scanNumber" type="xs:int" use="required"/> > > I'm guessing that in practice these will tend to be one and the same, > which leads one to wonder why we have them both, which suggests a > bit more documentation would be in order. Also, any uniqueness > constraints should be documented (we probably don't want either > scanNumber or id being reused within a Run). > > Perhaps the "id" attribute could be made optional, for those cases > where it's just going to be a repeat of scanNumber? Or the other way > around (for a relative savings of 8 characters per scan!). > > Out of curiousity, what's the scenario in which id isn't the same as > scanNumber? > > - Brian > To quote from an earlier mailing list posting by Eric Deutsch, responding to me (after which there was strangely over a month of no list activity): > >> - The validator will enforce that scan numbers are in ascending > >> order, but not necessarily without gaps - The validator will > >> enforce that scan numbers and identifiers must be unique within a > >> run (but there could be multiple runs in a file) > >> > > I'm confused about the difference between identifiers and scan > > numbers. Since a mzML file can have more than one spectra source > > (e.g. multiple RAW files), scan numbers could only be unique within > > a run, as you say, but I would expect that the "SpectrumID" > > identifier, if it is different from the scan number, should be > > unique to the whole file. What is the reasoning > > You are correct, my error. > > > behind the SpectrumID identifier being unique only to a run, or am > > I misunderstanding? What is the purpose of having a separate > > SpectrumID identifier anyway? > > To allow LSIDs for individual spectra or some other non-integer IDs > if desired. I like the capability of having arbitrary spectrum ids, but it must be made clear whether they are unique to a FILE or unique to a RUN. It has already been given that the scan number will be unique within a run and so I assume that means the id must be unique in some way as well. If the id is just an S and then a scan number, then I agree with Brian. I suggest that unless the file writer has a compelling reason to do so (e.g. the user told it to or it comes from the source file as something special), the id attribute should be left out (so it should be optional). File readers should treat the id as bonus information unless there is a compelling reason to require it. -Matt |
From: Brian P. <bri...@in...> - 2007-08-04 00:48:42
|
These could use a bit of commenting, I think: <xs:complexType name="SpectrumType"> ... <xs:attribute name="id" type="xs:string" use="required"/> <xs:attribute name="scanNumber" type="xs:int" use="required"/> I'm guessing that in practice these will tend to be one and the same, which leads one to wonder why we have them both, which suggests a bit more documentation would be in order. Also, any uniqueness constraints should be documented (we probably don't want either scanNumber or id being reused within a Run). Perhaps the "id" attribute could be made optional, for those cases where it's just going to be a repeat of scanNumber? Or the other way around (for a relative savings of 8 characters per scan!). Out of curiousity, what's the scenario in which id isn't the same as scanNumber? - Brian |
From: Brian P. <bri...@in...> - 2007-08-04 00:26:46
|
Do we want to place any constraints on compressionType? I'd at least start with "none" and "zlib". And it seems kind of a shame to make writers declare compressionType="none", can we default it to that? At a minimum, we should place this info in a comment if we don't want to get fancy with an actual contrained type. - Brian |
From: Brian P. <bri...@in...> - 2007-08-04 00:05:52
|
I note in mzML0.93 .xsd that use of AcquisitionListType ::spectrumType is "prohibited" - probably an errant mouse click in XMLSpy. - Brian |
From: Brian P. <bri...@in...> - 2007-08-03 23:58:41
|
Thanks all for clarifying points and thoughtful discussion. And = apparently I *was* there when the decision was made... :) Chris, as he so often does, has hit the nail on the head in mentioning = the need to "try to get people to move away from informal mechanisms (like = use of folder trees or zips, ad hoc file naming 'formalisms' etc.)". The = use of that sort of tribal knowledge (especially ad hoc file naming = formalisms) is deeply ingrained in the current mass spec software ecosystem, and the less we have of it in future the better. But I'm left wondering whether mzML is meant to be evolutionary (it = seems to be described that way, as a merging of the best aspects of mzData and mzXML), or revolutionary ("...get people to move away from..."). The thing with APIs, systems, data format standards etc, is that you can = try to evolve and extend them toward something better, but you risk winding = up with something like Windows Vista that's just an utter mess, compared to Apple's OSX where there was a pretty clean break with the past, to good effect. I guess what's bugging me is that I'm not sure this feature = breaks things thoroughly enough. While it's certainly easy enough to slip mzML = in behind existing mass spec reader APIs and just throw an exception when = the run count is greater than one, it feels half baked. I worry about going = out the gate with features we don't think anyone will actually support (not = that RAMP and its users are the whole world of mzML consumers, it's just the = part of the world I'm familiar with, so this fear may be unfounded). =20 Of course if we wait to do anything new until the Grand Unified Schema = is ready, we won't get anything new done. In the end I defer to the wisdom = of the group. Oh, and to revisit the original question, a snippet of a multi-run index would look like something like this, I think, assuming the runList = contained a couple of runs with id=3D"runA" and id=3D"runB" respectively: <offset run_id=3D"runA" spectrum_id=3D"1041">4212696</offset> <offset run_id=3D"runA" spectrum_id=3D"1042">4218791</offset> <offset run_id=3D"runB" spectrum_id=3D"1">4221806</offset> <offset run_id=3D"runB" spectrum_id=3D"2">4227580</offset> <offset run_id=3D"runB" spectrum_id=3D"3">4231174</offset> Unsure, actually, whether spectrum_id (the spectrum::id string value) or spectrum_scanNumber (the spectrum::scanNumber int value) is better since = we don't specify a uniqueness constraint for either. Thanks, Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Chris Taylor Sent: Friday, August 03, 2007 2:19 AM Cc: psi...@li... Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review Hi all. It's verging on tangential, but in my group we recently drafted up a = schema to index multi-omics data sets in three (omics-specific) repositories. That meant we had to have a concept of an 'assaying process' (shortened to assay). For a microarray that was = basically a single hybridisation to one array; some chips come with four arrays on = one slide =3D four assays; technical replicates =3D separate arrays. = Basically the rationale was to address the smallest (atomic, if you like) unit. For MS (proteomics first, but also metabolomics) we settled on one run = =3D one assay for inline LCMS (i.e. a run lasting many minutes). This = automatically gives offline a cardinalilty of one assay per (previously collected) fraction run. For MALDI (the point of all this mumbling), we settled on one spot (no matter how many shots) =3D one assay, because although the various spots = will likely have a connection, that isn't guaranteed. So one plate (if one = were recording plates) links to n assays (in reality the fact that there was = a set of spots on a particular plate is kind of ignorable but for QC and = the like). Where I see a further problem is for MS 'imaging' (i.e. shooting lots at tissue slices to map protein distributions) and isn't there a MALDI-like source that isn't discrete spots? Maybe I misremembered that one = though... In either (the latter admittedly imaginary) case though I suspect all = one could robustly consider as a 'run' or assay would be the analysis of one = set of coordinates for the laser. Don't know if that helps much :) Btw although I take the point that was made about tarballs, to play = devil's advocate for a sec, the point of some of the standardisation stuff is to = try to get people to move away from informal mechanisms (like use of folder trees or zips, ad hoc file naming 'formalisms' etc.) to associate files (/datasets).=20 If the ability to combine multiple runs/assays in one mzML file _is_ = more trouble than it is worth, a compromise might be to leverage whatever can = be used to uniquely ID an mzML file (have LSIDs died yet or what?) to x-ref = one file to another (i.e.=20 insert a [0..*] element somewhere near the top with an attribute to hold = one an external file ref and a sibling string attribute to hold a free text description)? Or, as was suggested, produce a wrapper schema. This = should probably be a (lightweight) FuGE-based thing as it is all in there = already (CPAS does this iirc, although with an earlier version of FuGE). Cheers, Chris. Eric Deutsch wrote: >> From: psi...@li...=20 >> [mailto:psidev-ms-dev- bo...@li...] On Behalf Of=20 >> Matthew Chambers >> >> What's wrong with the schema supporting multiple runs per file and=20 >> letting implementers gradually add support for it? There are many=20 >> features of mzML that will require substantial rewrites of the = existing parser APIs. >> Parameter groups, multiple runs, multiple precursors, and compressed=20 >> binary data are all major "completely predictable trouble spots." As = >> long as the file readers develop faster than the file writers, there=20 >> won't be a problem. ;) I very much doubt that writers (e.g. ReAdW)=20 >> will be writing multiple instrument files into one mzML file any time = >> soon (unless somebody is itching to do this without saying so?). The = >> parameter groups and multiple precursors are more problematic, IMO,=20 >> but still good improvements. >=20 > If I recall correct the feature of multiple runs crept in at the end = of the Seattle meeting last fall. Can anyone articulate a compelling use = case for multiple runs per file? I seem to recall a scenario where at least = one vendor encodes multiple runs in a single (wiff??) file, but I don't know about any of that for sure. Anyone have such a case? >=20 >> I have a few comments: >> >> - There seems to be a timestamp on the run element now (maybe I just=20 >> missed it before), of type xs:dateTime. It's an optional attribute=20 >> and it has an ambiguous meaning. Why isn't this expanded into a=20 >> start and stop timestamp for the run? Also, why is it optional? >=20 > The believe the intent was that the timestamp is the UT at the start = of the run. We should clarify this. >=20 > Is it useful to encode the stop timestamp? >=20 > As for why optional, we imagined that in the real world this value = might not be known properly. Imagine a scenario were someone is converting a legacy mzXML file to mzML. This information may not be available, sadly. = It is certainly encouraged that modern converters/writers include it. >=20 >> - Most every cvParam has a "cvLabel" attribute that is "MS" but the=20 >> accession attribute of each cvParam seems to include the cvLabel in=20 >> it ("MS:xxxxxxxx"). If that is just a coincidence, I think it should = >> be changed so that it is required and the cvLabel can be eliminated. = >> If it's not a coincidence, why is like that? If the parser needs to=20 >> know which vocabulary an accession number is from, it can parse until = >> the colon delimiter. Alternatively, keeping cvLabel and getting rid=20 >> of the "MS:" in the accession attribute would allow somewhat more=20 >> efficient parsing. In the alternative case, I suggest a required=20 >> default cvLabel somewhere in the header, similar to setting the = default XML namespace. >=20 > cvLabel is really just an id to indicate which CV (as more completely defined within <cvList> above) the term comes from. It seems to be = current best practice that life science CV accession numbers begin with an OBO namespace, :, and a number. But not all CVs will necessarily follow this convention as far as I'm aware. >=20 >> - I see a TODO item is giving the binaryDataArray's "dataType"=20 >> attribute a CV entry. I agree with this. But I think the values=20 >> should be more machine-oriented, like "float32", "float64", "int32", "uint64", etc. >=20 > You mean "float32" is preferred over "32-bit float" >=20 >> - Parameter groups are good, especially since the spectrum headers=20 >> seem to have ballooned to be more flexible. Anything that makes the=20 >> file- dominating spectrum elements smaller and faster to parse is=20 >> nice - indexing the shared parameters is a good way to do this. >> >> - I'd still like to see a clear definition of "run" relative to = "sample" >> and "source file." Seems like these three are all tightly coupled. >=20 > For LC-MSn ion trap data, this is relatively straightforward and is = what is depicted in the examples. A run is a series of scans, usually counted consecutively by the instrument, obtained as a sample is injected into = the instrument. A sample is the biomaterial that is injected into an = instrument over a run. The source file is the one or more files from which the = mzML was generated. It will usually be a single vendor-format raw file. It = could be an mzXML file. It could (unfortunately) be a series of dta files. >=20 > For MALDI or gel spot processing, however, this might be quite = different. We had previously entertained "analyte identifier" or "MALDI spot identifier" CV terms to allow annotating each spot. This might make the run-based sample undefined. I think LC-MSn heavily colored our thinking during development. It would be extremely nice to have a detailed = example of data where individual scan refer to different "analyte identifiers" or = the like. Would someone contribute this? >=20 > Regards, > Eric >=20 >> >> -Matt Chambers >> >> Vanderbilt MSRC >> >> >> >> ________________________________ >> >> From: psi...@li...=20 >> [mailto:psidev-ms-dev- bo...@li...] On Behalf Of=20 >> Brian Pratt >> Sent: Thursday, August 02, 2007 2:47 PM >> To: psi...@li... >> Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review >> >> >> >> (Note: I know I'm late to the party with this comment, but I think=20 >> it's >> important) >> >> >> >> I noticed this in the todo file: >> >> " - Now that we're allowing multiple runs in a file, how will the=20 >> index look to handle this?" >> >> >> >> Better question: what will software that uses such an index look = like? >> >> >> >> Answer: it won't look much like anything that currently reads mzXML=20 >> and mzData - including X!Tandem or anything using RAMP (TPP and=20 >> others) or JRAP (CPAS and others). These programs easily deal with=20 >> both mzData and mzXML in their various versions by using APIs which, = as it happens, assume >> one file per run and one run per file. Breaking this one to one >> correspondence in mzML means you can't just slide mzML support in=20 >> behind the API, and of course also violates a fundamental assumption=20 >> which flows through the code that calls these APIs, right out to the=20 >> user interface in most cases. This means extensive surgery to any=20 >> program that wants to read mzML properly, and my guess is that means=20 >> mzML is DOA. At a minimum it becomes a completely predictable=20 >> trouble spot since you can now write legal mzML files that the = majority of mzML readers will simply not know >> how to handle. They'll be OK with RunList::count =3D=3D 1, but no = more - so, >> why set ourselves up for trouble? >> >> >> >> Multiple runs per file are probably useful in some cases, but if the=20 >> stated goal of mzML is to replace mzXML and mzData then I think this=20 >> feature is actually scope creep which threatens the mission and=20 >> should be dropped. Let those who really want this feature come up=20 >> with a wrapper schema, but don't call it mzML lest you force the vast = >> majority of mzML consuming software to be broken from the start. >> >> >> >> - Brian >> >> >> >> ________________________________ >> >> From: psi...@li...=20 >> [mailto:psidev-ms-dev- bo...@li...] On Behalf Of=20 >> Eric Deutsch >> Sent: Thursday, August 02, 2007 1:02 AM >> To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet=20 >> Souda; Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy=20 >> Julian; Andy Jones; David Creasy; Sean L Seymour; Angel Pizarro;=20 >> David Fenyo; Jam...@wa...; Mike Coleman; Matthew=20 >> Chambers; Helen Jenkins; Philip Jones; Shofstahl, Jim; Brian Pratt;=20 >> Andreas R=F6mpp; Kent Laursen; Martin Eisenacher; Fredrik Levander;=20 >> Jayson Falkner; Pedrioli Patrick Gino Angelo; Hans Vissers; Eric=20 >> Deutsch; cl...@br...; dav...@ag...;=20 >> rb...@be...; psidev-ms- de...@li... >> Cc: Rolf Apweiler; Ruedi Aebersold >> Subject: [Psidev-ms-dev] mzML 0.93 ready for first review >> >> Hi everyone, after considerable hard work from many people, we have a = >> prerelease of mzML (the union of mzData and mzXML) available for=20 >> comment by you, a major stakeholder in mzML. >> >> You may download a kit of material to examine at: >> >> http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip >> >> The general mzML development page is at: >> >> http://psidev.info/index.php?q=3Dnode/257 >> >> Please send feedback to: >> >> psi...@li... >> >> We ask that you respond by August 20. >> >> Additional releases with more information may be provided during the=20 >> coming month. >> >> The current format has been guided by these principles: >> >> - Keep the format simple >> >> - Minimize alternate ways of encoding the same information >> >> - Allow some flexibility for encoding new important information >> >> - Support the features of mzData and mzXML but not a lot more >> >> - But do provide clear support for SRM data >> >> - Finish the format soon with the resources available >> >> There are many enhancements that have been suggested, but the small=20 >> group of volunteers that have actively developed this format have=20 >> opted to focus on the primary goal set before us: develop a single=20 >> format that the vendors and current software can easily support and=20 >> thereby obsolete mzData and mzXML. The enhancements not considered=20 >> compatible with this goal will be entertained for mzML 2.0 >> >> We are committed to providing not just the format, but also a set of=20 >> working implementations, converters and readers, as well as a format=20 >> validator, all to ensure that mzML is a format that will be adopted=20 >> quickly and implemented uniformly. Prior to submission to the PSI=20 >> document process, the following software will implement mzML: >> >> - 2 or more converters from vendor formats to mzML >> >> - the popular reader library RAMP that currently supports mzData and=20 >> mzXML >> >> - an mzML semantic validator that checks for correct implementation >> >> We hope to follow this schedule: >> >> 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment >> >> 2007-08-20 Comments from major stakeholders received >> >> 2007-09-01 Revised mzML 1.0 submitted to PSI document process,=20 >> beginning 30 days internal review >> >> 2007-10-01 Revised mzML 1.01 begins 60 days community review >> >> 2007-10-06 Formal announcement that feedback is sought at HUPO world=20 >> congress >> >> 2007-12-01 Formal 60 days community review closes >> >> 2008-01-01 Revised mzML 1.02 officially released >> >> Thank you for your help! Feel free to forward this message to someone = >> whom you think should review the format at this stage. >> >> Regards, >> >> Eric >=20 >=20 > ---------------------------------------------------------------------- > --- This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a = browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/=20 > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------------------------------------= This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matthew C. <mat...@va...> - 2007-08-03 17:27:08
|
> -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Eric Deutsch > Sent: Friday, August 03, 2007 2:00 AM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review > > If I recall correct the feature of multiple runs crept in at the end of > the Seattle meeting last fall. Can anyone articulate a compelling use case > for multiple runs per file? I seem to recall a scenario where at least > one vendor encodes multiple runs in a single (wiff??) file, but I don't > know about any of that for sure. Anyone have such a case? > Yes. Our protein assembly software compiles and validates IDs from multiple runs into a single, assembled file. At the same time, it supports an arbitrary hierarchical organization of those runs using the file system tree paradigm: / all IDs ("root") /foo IDs in /foo /foo/run1 /foo/run2 /bar IDs in /bar /bar/run1 /bar/run2 We put the entire assembly into one XML file (our own schema, but ideally analysisXML will eventually supplant it), but we currently have no way of linking the IDs back to the source spectra. Copying the peak lists and metadata into our protein assembly file seems absurd, and linking each run to a separate mzData or mzXML file that must be in a relative path to that file is also not a pleasing idea. On the other hand, if we could assemble a single spectra data file consisting of only the identified spectra and containing all the runs that are in the protein assembly file, it allows us to keep a one-to-one correspondence between analysis and spectra - without having dozens of files lying around to support that single analysis. This would allow us to easily support spectrum/ID visualization without a lot of file organization overhead. As long as analysisXML will also support multiple runs per file, this feature in mzML makes a lot of sense - at least to me. > > I have a few comments: > > > > - There seems to be a timestamp on the run element now (maybe I just > > missed it before), of type xs:dateTime. It's an optional attribute and it > > has an ambiguous meaning. Why isn't this expanded into a start and stop > > timestamp for the run? Also, why is it optional? > > The believe the intent was that the timestamp is the UT at the start of > the run. We should clarify this. > > Is it useful to encode the stop timestamp? > > As for why optional, we imagined that in the real world this value might > not be known properly. Imagine a scenario were someone is converting a > legacy mzXML file to mzML. This information may not be available, sadly. > It is certainly encouraged that modern converters/writers include it. Yes the stop time is useful. It clarifies the meaning of the timestamp and otherwise, the stop time would have to be inferred from the last spectrum's retention time, which seems unnecessarily messy. Also, I think it should be required, but if a writer doesn't know or the time is inapplicable, some equivalent of "NaN" should be acceptable (like 01/01/0001@00:00:00 or something). > cvLabel is really just an id to indicate which CV (as more completely > defined within <cvList> above) the term comes from. It seems to be current > best practice that life science CV accession numbers begin with an OBO > namespace, :, and a number. But not all CVs will necessarily follow this > convention as far as I'm aware. Further clarification and/or standardization would be good on this point. > > - I see a TODO item is giving the binaryDataArray's "dataType" attribute a > > CV entry. I agree with this. But I think the values should be more > > machine-oriented, like "float32", "float64", "int32", "uint64", etc. > > You mean "float32" is preferred over "32-bit float" Yes. Anybody who knows what "32-bit float" means also knows what "float32" means and the latter is easier to parse and looks nicer as an attribute IMO. :) > For LC-MSn ion trap data, this is relatively straightforward and is what > is depicted in the examples. A run is a series of scans, usually counted > consecutively by the instrument, obtained as a sample is injected into the > instrument. A sample is the biomaterial that is injected into an > instrument over a run. The source file is the one or more files from > which the mzML was generated. It will usually be a single vendor-format > raw file. It could be an mzXML file. It could (unfortunately) be a series > of dta files. > > For MALDI or gel spot processing, however, this might be quite different. > We had previously entertained "analyte identifier" or "MALDI spot > identifier" CV terms to allow annotating each spot. This might make the > run-based sample undefined. I think LC-MSn heavily colored our thinking > during development. It would be extremely nice to have a detailed example > of data where individual scan refer to different "analyte identifiers" or > the like. Would someone contribute this? > Are you saying that the source file list could be a list of 10,000 DTA files? Oh my aching heart. In that case, a run element's sourceFileRefList would have to point to each sourceFile element in turn. That would be absurd. Multiple sourceFileLists would have to be supported and the run's reference to it would need to reference the sourceFileList instead of each individual sourceFile. Nevertheless, you see what I mean about them being tightly coupled? It seems like each run could have its own source file (list) and sample element/attribute instead of references to a source file (list) and/or sample. The references concept seems to represent these tightly coupled files/IDs as loosely coupled. Unless multiple runs can be generated from the exact same sample and/or file? I think I'm bordering on hypocritical here (vs. multiple runs per file) but that's probably because I'm ignorant of the mass spec concepts in play. -Matt |
From: Matthew C. <mat...@va...> - 2007-08-03 15:23:02
|
> -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Thursday, August 02, 2007 7:03 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review > > So why all the excitement about API stability? Consider this: originally, > RAMP read mzXML only. Then we added the ability to read mzData. Now, all > of the many programs that employ RAMP suddenly could read both mzData and > mzXML with nothing more than a recompilation (OK, that first time actually > required a small RAMP API tweak - using a RAMPFILE handle instead of a FILE > handle). Later we added mzXML 3.0 with its compressed peak lists, and RAMP > users only needed to recompile to get this additional capability - no > "downstream" changes needed. There have even been unreleased versions of > RAMP that read intermediate proposed forms of mzML. Such ease of adoption > is very powerful when trying to establish a new data standard. But guess > what? RAMP can't be made to transparently handle the current proposed mzML > format due to the breaking of the one file / one run mapping. OK. So when changes are made to RAMP/JRAP's internal parser to support mzML, they can by default operate on only the first run in a file. Crisis averted. :) Existing programs depending on RAMP/JRAP will then work with mzML (assuming the input files are using the highly probable one run per file paradigm) and the API developers are then free to work on a new or extended API to support multi-run files, which RAMP/JRAP users can adopt at their leisure (i.e. whenever they need the multi-run support). To be honest, looking at the RAMP header, it doesn't seem like it would be that daunting to extend support to multi-runs. From the point of view of the program, instead of just asking for a scan number, an extended API function might require both the run id (by index or by name) and the scan number. Or there might be a function to change between the multiple runs and the existing by-scan-number functions could work on the current run only. After that is implemented, supporting multiple runs in an input file is all up to the API users to support changing between the multiple runs. That could be as simple as a selection box to choose the run (or an outer loop to iterate through all runs in a file) in addition to the existing selection box to choose the scan number (or an inner loop to iterate through all scans in a run). > Truly new mass spec behaviors will eventually make it necessary to expand or > even break the current mass spec data reader APIs. Multiple precursors are > actually a good example of this (as an expansion, hopefully). But, breaking > the one run / one file relationship isn't driven by new mass spec behaviors > that I know of. Getting API users accustomed to API extensions to support new features is a good thing, don't you agree? The multiple precursors feature will break RAMP more than multiple runs will as far as I can tell. No longer would API users be able to look at the precursorMz field of the scan header struct, it would be an array. Even in this case though, you can implement it as an extension so that existing programs can use the precursorMz field which always refers to the first precursor in the list, while updated API users can use the array field, which would contain all the precursors in the list. You could even deprecate the precursorMz field and say it'll go away eventually, that's always fun. > What is the use case for this feature, anyway? What's so > compelling about having multiple runs in a single mzML file that everyone > will want to massively rejigger their code to implement this? Seems like > we're just creating an orphan feature that will only serve to trip up unwary > mzML output writers ("nice multi-run output ya got there - too bad nobody > can read it"), which I think is exactly the kind of thing the committee said > they wanted to avoid. > > - Brian I don't think I knew what a mass spectrometer was 10 years ago or so when the DTA format was coming around (I wasn't in high school yet!), but I imagine that the argument for DTA files (i.e. there aren't very many spectra, so why concatenate them?) was not very unlike your argument here. It is of course a completely different scale and I doubt anyone will ever have I/O overhead problems due to reading too many (relatively) tiny mzML files, but it's the principle of the thing. :) I can definitely see reasons for grouping multiple SCX fractions or IEP bands from a single sample into one file, if only for organizational purposes. -Matt |
From: Chris T. <chr...@eb...> - 2007-08-03 09:19:09
|
Hi all. It's verging on tangential, but in my group we recently drafted up a schema to index multi-omics data sets in three (omics-specific) repositories. That meant we had to have a concept of an 'assaying process' (shortened to assay). For a microarray that was basically a single hybridisation to one array; some chips come with four arrays on one slide = four assays; technical replicates = separate arrays. Basically the rationale was to address the smallest (atomic, if you like) unit. For MS (proteomics first, but also metabolomics) we settled on one run = one assay for inline LCMS (i.e. a run lasting many minutes). This automatically gives offline a cardinalilty of one assay per (previously collected) fraction run. For MALDI (the point of all this mumbling), we settled on one spot (no matter how many shots) = one assay, because although the various spots will likely have a connection, that isn't guaranteed. So one plate (if one were recording plates) links to n assays (in reality the fact that there was a set of spots on a particular plate is kind of ignorable but for QC and the like). Where I see a further problem is for MS 'imaging' (i.e. shooting lots at tissue slices to map protein distributions) and isn't there a MALDI-like source that isn't discrete spots? Maybe I misremembered that one though... In either (the latter admittedly imaginary) case though I suspect all one could robustly consider as a 'run' or assay would be the analysis of one set of coordinates for the laser. Don't know if that helps much :) Btw although I take the point that was made about tarballs, to play devil's advocate for a sec, the point of some of the standardisation stuff is to try to get people to move away from informal mechanisms (like use of folder trees or zips, ad hoc file naming 'formalisms' etc.) to associate files (/datasets). If the ability to combine multiple runs/assays in one mzML file _is_ more trouble than it is worth, a compromise might be to leverage whatever can be used to uniquely ID an mzML file (have LSIDs died yet or what?) to x-ref one file to another (i.e. insert a [0..*] element somewhere near the top with an attribute to hold one an external file ref and a sibling string attribute to hold a free text description)? Or, as was suggested, produce a wrapper schema. This should probably be a (lightweight) FuGE-based thing as it is all in there already (CPAS does this iirc, although with an earlier version of FuGE). Cheers, Chris. Eric Deutsch wrote: >> From: psi...@li... [mailto:psidev-ms-dev- >> bo...@li...] On Behalf Of Matthew Chambers >> >> What's wrong with the schema supporting multiple runs per file and letting >> implementers gradually add support for it? There are many features of >> mzML that will require substantial rewrites of the existing parser APIs. >> Parameter groups, multiple runs, multiple precursors, and compressed >> binary data are all major "completely predictable trouble spots." As long >> as the file readers develop faster than the file writers, there won't be a >> problem. ;) I very much doubt that writers (e.g. ReAdW) will be writing >> multiple instrument files into one mzML file any time soon (unless >> somebody is itching to do this without saying so?). The parameter groups >> and multiple precursors are more problematic, IMO, but still good >> improvements. > > If I recall correct the feature of multiple runs crept in at the end of the Seattle meeting last fall. Can anyone articulate a compelling use case for multiple runs per file? I seem to recall a scenario where at least one vendor encodes multiple runs in a single (wiff??) file, but I don't know about any of that for sure. Anyone have such a case? > >> I have a few comments: >> >> - There seems to be a timestamp on the run element now (maybe I just >> missed it before), of type xs:dateTime. It's an optional attribute and it >> has an ambiguous meaning. Why isn't this expanded into a start and stop >> timestamp for the run? Also, why is it optional? > > The believe the intent was that the timestamp is the UT at the start of the run. We should clarify this. > > Is it useful to encode the stop timestamp? > > As for why optional, we imagined that in the real world this value might not be known properly. Imagine a scenario were someone is converting a legacy mzXML file to mzML. This information may not be available, sadly. It is certainly encouraged that modern converters/writers include it. > >> - Most every cvParam has a "cvLabel" attribute that is "MS" but the >> accession attribute of each cvParam seems to include the cvLabel in it >> ("MS:xxxxxxxx"). If that is just a coincidence, I think it should be >> changed so that it is required and the cvLabel can be eliminated. If it's >> not a coincidence, why is like that? If the parser needs to know which >> vocabulary an accession number is from, it can parse until the colon >> delimiter. Alternatively, keeping cvLabel and getting rid of the "MS:" in >> the accession attribute would allow somewhat more efficient parsing. In >> the alternative case, I suggest a required default cvLabel somewhere in >> the header, similar to setting the default XML namespace. > > cvLabel is really just an id to indicate which CV (as more completely defined within <cvList> above) the term comes from. It seems to be current best practice that life science CV accession numbers begin with an OBO namespace, :, and a number. But not all CVs will necessarily follow this convention as far as I'm aware. > >> - I see a TODO item is giving the binaryDataArray's "dataType" attribute a >> CV entry. I agree with this. But I think the values should be more >> machine-oriented, like "float32", "float64", "int32", "uint64", etc. > > You mean "float32" is preferred over "32-bit float" > >> - Parameter groups are good, especially since the spectrum headers seem to >> have ballooned to be more flexible. Anything that makes the file- >> dominating spectrum elements smaller and faster to parse is nice - >> indexing the shared parameters is a good way to do this. >> >> - I'd still like to see a clear definition of "run" relative to "sample" >> and "source file." Seems like these three are all tightly coupled. > > For LC-MSn ion trap data, this is relatively straightforward and is what is depicted in the examples. A run is a series of scans, usually counted consecutively by the instrument, obtained as a sample is injected into the instrument. A sample is the biomaterial that is injected into an instrument over a run. The source file is the one or more files from which the mzML was generated. It will usually be a single vendor-format raw file. It could be an mzXML file. It could (unfortunately) be a series of dta files. > > For MALDI or gel spot processing, however, this might be quite different. We had previously entertained "analyte identifier" or "MALDI spot identifier" CV terms to allow annotating each spot. This might make the run-based sample undefined. I think LC-MSn heavily colored our thinking during development. It would be extremely nice to have a detailed example of data where individual scan refer to different "analyte identifiers" or the like. Would someone contribute this? > > Regards, > Eric > >> >> -Matt Chambers >> >> Vanderbilt MSRC >> >> >> >> ________________________________ >> >> From: psi...@li... [mailto:psidev-ms-dev- >> bo...@li...] On Behalf Of Brian Pratt >> Sent: Thursday, August 02, 2007 2:47 PM >> To: psi...@li... >> Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review >> >> >> >> (Note: I know I'm late to the party with this comment, but I think it's >> important) >> >> >> >> I noticed this in the todo file: >> >> " - Now that we're allowing multiple runs in a file, how will the index >> look to handle this?" >> >> >> >> Better question: what will software that uses such an index look like? >> >> >> >> Answer: it won't look much like anything that currently reads mzXML and >> mzData - including X!Tandem or anything using RAMP (TPP and others) or >> JRAP (CPAS and others). These programs easily deal with both mzData and >> mzXML in their various versions by using APIs which, as it happens, assume >> one file per run and one run per file. Breaking this one to one >> correspondence in mzML means you can't just slide mzML support in behind >> the API, and of course also violates a fundamental assumption which flows >> through the code that calls these APIs, right out to the user interface in >> most cases. This means extensive surgery to any program that wants to >> read mzML properly, and my guess is that means mzML is DOA. At a minimum >> it becomes a completely predictable trouble spot since you can now write >> legal mzML files that the majority of mzML readers will simply not know >> how to handle. They'll be OK with RunList::count == 1, but no more - so, >> why set ourselves up for trouble? >> >> >> >> Multiple runs per file are probably useful in some cases, but if the >> stated goal of mzML is to replace mzXML and mzData then I think this >> feature is actually scope creep which threatens the mission and should be >> dropped. Let those who really want this feature come up with a wrapper >> schema, but don't call it mzML lest you force the vast majority of mzML >> consuming software to be broken from the start. >> >> >> >> - Brian >> >> >> >> ________________________________ >> >> From: psi...@li... [mailto:psidev-ms-dev- >> bo...@li...] On Behalf Of Eric Deutsch >> Sent: Thursday, August 02, 2007 1:02 AM >> To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet Souda; >> Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; Andy >> Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo; >> Jam...@wa...; Mike Coleman; Matthew Chambers; Helen Jenkins; >> Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas Römpp; Kent Laursen; >> Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick Gino >> Angelo; Hans Vissers; Eric Deutsch; cl...@br...; >> dav...@ag...; rb...@be...; psidev-ms- >> de...@li... >> Cc: Rolf Apweiler; Ruedi Aebersold >> Subject: [Psidev-ms-dev] mzML 0.93 ready for first review >> >> Hi everyone, after considerable hard work from many people, we have a >> prerelease of mzML (the union of mzData and mzXML) available for comment >> by you, a major stakeholder in mzML. >> >> You may download a kit of material to examine at: >> >> http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip >> >> The general mzML development page is at: >> >> http://psidev.info/index.php?q=node/257 >> >> Please send feedback to: >> >> psi...@li... >> >> We ask that you respond by August 20. >> >> Additional releases with more information may be provided during the >> coming month. >> >> The current format has been guided by these principles: >> >> - Keep the format simple >> >> - Minimize alternate ways of encoding the same information >> >> - Allow some flexibility for encoding new important information >> >> - Support the features of mzData and mzXML but not a lot more >> >> - But do provide clear support for SRM data >> >> - Finish the format soon with the resources available >> >> There are many enhancements that have been suggested, but the small group >> of volunteers that have actively developed this format have opted to focus >> on the primary goal set before us: develop a single format that the >> vendors and current software can easily support and thereby obsolete >> mzData and mzXML. The enhancements not considered compatible with this >> goal will be entertained for mzML 2.0 >> >> We are committed to providing not just the format, but also a set of >> working implementations, converters and readers, as well as a format >> validator, all to ensure that mzML is a format that will be adopted >> quickly and implemented uniformly. Prior to submission to the PSI document >> process, the following software will implement mzML: >> >> - 2 or more converters from vendor formats to mzML >> >> - the popular reader library RAMP that currently supports mzData and mzXML >> >> - an mzML semantic validator that checks for correct implementation >> >> We hope to follow this schedule: >> >> 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment >> >> 2007-08-20 Comments from major stakeholders received >> >> 2007-09-01 Revised mzML 1.0 submitted to PSI document process, beginning >> 30 days internal review >> >> 2007-10-01 Revised mzML 1.01 begins 60 days community review >> >> 2007-10-06 Formal announcement that feedback is sought at HUPO world >> congress >> >> 2007-12-01 Formal 60 days community review closes >> >> 2008-01-01 Revised mzML 1.02 officially released >> >> Thank you for your help! Feel free to forward this message to someone whom >> you think should review the format at this stage. >> >> Regards, >> >> Eric > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ |
From: Eric D. <ede...@sy...> - 2007-08-03 06:59:40
|
> From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Matthew Chambers >=20 > What's wrong with the schema supporting multiple runs per file and = letting > implementers gradually add support for it? There are many features of > mzML that will require substantial rewrites of the existing parser = APIs. > Parameter groups, multiple runs, multiple precursors, and compressed > binary data are all major "completely predictable trouble spots." As = long > as the file readers develop faster than the file writers, there won't = be a > problem. ;) I very much doubt that writers (e.g. ReAdW) will be = writing > multiple instrument files into one mzML file any time soon (unless > somebody is itching to do this without saying so?). The parameter = groups > and multiple precursors are more problematic, IMO, but still good > improvements. If I recall correct the feature of multiple runs crept in at the end of = the Seattle meeting last fall. Can anyone articulate a compelling use = case for multiple runs per file? I seem to recall a scenario where at = least one vendor encodes multiple runs in a single (wiff??) file, but I = don't know about any of that for sure. Anyone have such a case? > I have a few comments: >=20 > - There seems to be a timestamp on the run element now (maybe I just > missed it before), of type xs:dateTime. It's an optional attribute = and it > has an ambiguous meaning. Why isn't this expanded into a start and = stop > timestamp for the run? Also, why is it optional? The believe the intent was that the timestamp is the UT at the start of = the run. We should clarify this. Is it useful to encode the stop timestamp? As for why optional, we imagined that in the real world this value might = not be known properly. Imagine a scenario were someone is converting a = legacy mzXML file to mzML. This information may not be available, sadly. = It is certainly encouraged that modern converters/writers include it. > - Most every cvParam has a "cvLabel" attribute that is "MS" but the > accession attribute of each cvParam seems to include the cvLabel in it > ("MS:xxxxxxxx"). If that is just a coincidence, I think it should be > changed so that it is required and the cvLabel can be eliminated. If = it's > not a coincidence, why is like that? If the parser needs to know = which > vocabulary an accession number is from, it can parse until the colon > delimiter. Alternatively, keeping cvLabel and getting rid of the = "MS:" in > the accession attribute would allow somewhat more efficient parsing. = In > the alternative case, I suggest a required default cvLabel somewhere = in > the header, similar to setting the default XML namespace. cvLabel is really just an id to indicate which CV (as more completely = defined within <cvList> above) the term comes from. It seems to be = current best practice that life science CV accession numbers begin with = an OBO namespace, :, and a number. But not all CVs will necessarily = follow this convention as far as I'm aware. > - I see a TODO item is giving the binaryDataArray's "dataType" = attribute a > CV entry. I agree with this. But I think the values should be more > machine-oriented, like "float32", "float64", "int32", "uint64", etc. You mean "float32" is preferred over "32-bit float" > - Parameter groups are good, especially since the spectrum headers = seem to > have ballooned to be more flexible. Anything that makes the file- > dominating spectrum elements smaller and faster to parse is nice - > indexing the shared parameters is a good way to do this. >=20 > - I'd still like to see a clear definition of "run" relative to = "sample" > and "source file." Seems like these three are all tightly coupled. For LC-MSn ion trap data, this is relatively straightforward and is what = is depicted in the examples. A run is a series of scans, usually counted = consecutively by the instrument, obtained as a sample is injected into = the instrument. A sample is the biomaterial that is injected into an = instrument over a run. The source file is the one or more files from = which the mzML was generated. It will usually be a single vendor-format = raw file. It could be an mzXML file. It could (unfortunately) be a = series of dta files. For MALDI or gel spot processing, however, this might be quite = different. We had previously entertained "analyte identifier" or "MALDI = spot identifier" CV terms to allow annotating each spot. This might = make the run-based sample undefined. I think LC-MSn heavily colored our = thinking during development. It would be extremely nice to have a = detailed example of data where individual scan refer to different = "analyte identifiers" or the like. Would someone contribute this? Regards, Eric >=20 >=20 > -Matt Chambers >=20 > Vanderbilt MSRC >=20 >=20 >=20 > ________________________________ >=20 > From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Thursday, August 02, 2007 2:47 PM > To: psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review >=20 >=20 >=20 > (Note: I know I'm late to the party with this comment, but I think = it's > important) >=20 >=20 >=20 > I noticed this in the todo file: >=20 > " - Now that we're allowing multiple runs in a file, how will the = index > look to handle this?" >=20 >=20 >=20 > Better question: what will software that uses such an index look like? >=20 >=20 >=20 > Answer: it won't look much like anything that currently reads mzXML = and > mzData - including X!Tandem or anything using RAMP (TPP and others) or > JRAP (CPAS and others). These programs easily deal with both mzData = and > mzXML in their various versions by using APIs which, as it happens, = assume > one file per run and one run per file. Breaking this one to one > correspondence in mzML means you can't just slide mzML support in = behind > the API, and of course also violates a fundamental assumption which = flows > through the code that calls these APIs, right out to the user = interface in > most cases. This means extensive surgery to any program that wants to > read mzML properly, and my guess is that means mzML is DOA. At a = minimum > it becomes a completely predictable trouble spot since you can now = write > legal mzML files that the majority of mzML readers will simply not = know > how to handle. They'll be OK with RunList::count =3D=3D 1, but no = more - so, > why set ourselves up for trouble? >=20 >=20 >=20 > Multiple runs per file are probably useful in some cases, but if the > stated goal of mzML is to replace mzXML and mzData then I think this > feature is actually scope creep which threatens the mission and should = be > dropped. Let those who really want this feature come up with a = wrapper > schema, but don't call it mzML lest you force the vast majority of = mzML > consuming software to be broken from the start. >=20 >=20 >=20 > - Brian >=20 >=20 >=20 > ________________________________ >=20 > From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Eric Deutsch > Sent: Thursday, August 02, 2007 1:02 AM > To: len...@eb...; Jimmy Eng; lu...@eb...; Puneet = Souda; > Joshua Tasman; Pierre-Alain Binz; Henning Hermjakob; Randy Julian; = Andy > Jones; David Creasy; Sean L Seymour; Angel Pizarro; David Fenyo; > Jam...@wa...; Mike Coleman; Matthew Chambers; Helen = Jenkins; > Philip Jones; Shofstahl, Jim; Brian Pratt; Andreas R=F6mpp; Kent = Laursen; > Martin Eisenacher; Fredrik Levander; Jayson Falkner; Pedrioli Patrick = Gino > Angelo; Hans Vissers; Eric Deutsch; cl...@br...; > dav...@ag...; rb...@be...; psidev-ms- > de...@li... > Cc: Rolf Apweiler; Ruedi Aebersold > Subject: [Psidev-ms-dev] mzML 0.93 ready for first review >=20 > Hi everyone, after considerable hard work from many people, we have a > prerelease of mzML (the union of mzData and mzXML) available for = comment > by you, a major stakeholder in mzML. >=20 > You may download a kit of material to examine at: >=20 > http://db.systemsbiology.net/projects/PSI/mzML/mzML_beta1R1.zip >=20 > The general mzML development page is at: >=20 > http://psidev.info/index.php?q=3Dnode/257 >=20 > Please send feedback to: >=20 > psi...@li... >=20 > We ask that you respond by August 20. >=20 > Additional releases with more information may be provided during the > coming month. >=20 > The current format has been guided by these principles: >=20 > - Keep the format simple >=20 > - Minimize alternate ways of encoding the same information >=20 > - Allow some flexibility for encoding new important information >=20 > - Support the features of mzData and mzXML but not a lot more >=20 > - But do provide clear support for SRM data >=20 > - Finish the format soon with the resources available >=20 > There are many enhancements that have been suggested, but the small = group > of volunteers that have actively developed this format have opted to = focus > on the primary goal set before us: develop a single format that the > vendors and current software can easily support and thereby obsolete > mzData and mzXML. The enhancements not considered compatible with this > goal will be entertained for mzML 2.0 >=20 > We are committed to providing not just the format, but also a set of > working implementations, converters and readers, as well as a format > validator, all to ensure that mzML is a format that will be adopted > quickly and implemented uniformly. Prior to submission to the PSI = document > process, the following software will implement mzML: >=20 > - 2 or more converters from vendor formats to mzML >=20 > - the popular reader library RAMP that currently supports mzData and = mzXML >=20 > - an mzML semantic validator that checks for correct implementation >=20 > We hope to follow this schedule: >=20 > 2007-08-02 Release of mzML beta1R1 to major stakeholders for comment >=20 > 2007-08-20 Comments from major stakeholders received >=20 > 2007-09-01 Revised mzML 1.0 submitted to PSI document process, = beginning > 30 days internal review >=20 > 2007-10-01 Revised mzML 1.01 begins 60 days community review >=20 > 2007-10-06 Formal announcement that feedback is sought at HUPO world > congress >=20 > 2007-12-01 Formal 60 days community review closes >=20 > 2008-01-01 Revised mzML 1.02 officially released >=20 > Thank you for your help! Feel free to forward this message to someone = whom > you think should review the format at this stage. >=20 > Regards, >=20 > Eric |
From: Brian P. <bri...@in...> - 2007-08-03 06:12:54
|
Mike - You're dead right, and AFAIK there isn't anything in the schema that's there just to serve RAMP or any other API. Well, you might say that about the (optional) index element, I suppose, which is actually where we started ("what does the index look like?")... I just don't want to see us adding stuff to the schema that breaks at least two popular mass spec reader APIs and gums up the internals of a lot of other code if we don't absolutely have to, and so far nobody has said why we absolutely have to (there may be an excellent reason, of course - I wasn't there when the decision was made - but it has the ring of "hey, wouldn't it be neat if related runs could travel in the same file?" [maybe it would - but that's scope creep and best handled by a wrapper schema, or just a tarball]). Brian -----Original Message----- From: Mike Coleman [mailto:tu...@gm...] Sent: Thursday, August 02, 2007 7:58 PM To: Brian Pratt Cc: psi...@li... Subject: Re: [Psidev-ms-dev] mzML 0.93 ready for first review <snip> I'm not sure what to think about the multiple runs idea, but I would like to say something about API's. I think that it's very important that this spec stand alone without any assumed API or library code. I'm glad that you are working on RAMP, which I'm sure has helped find problems with the spec and will be useful to many people. But, I think the planning must be for many implementations of readers and writers of the spec (I will probably end up writing both myself at our site). I wouldn't consider the spec a success unless many programmers feel like they understand it well enough to implement their own readers and writers. (Whether or not they *should* do so will, as always, depend on many factors.) Mike |
From: Mike C. <tu...@gm...> - 2007-08-03 02:57:52
|
On 8/2/07, Brian Pratt <bri...@in...> wrote: > OK, I see the disconnect - you aren't using an API for reading mass spec > data, you're using an API for reading XML (expat - an excellent choice). > You're speaking in terms of "the parser", but the APIs we're concerned with > (RAMP, JRAP) are front ends to multiple parsers and they abstract the mass > spec file format choice away from the logic that deals with mass spec data, > which keeps us from needing to change a couple dozen programs (along with > others we don't even know about, since RAMP and JRAP are open source) when a > new format pops up. I'm not sure what to think about the multiple runs idea, but I would like to say something about API's. I think that it's very important that this spec stand alone without any assumed API or library code. I'm glad that you are working on RAMP, which I'm sure has helped find problems with the spec and will be useful to many people. But, I think the planning must be for many implementations of readers and writers of the spec (I will probably end up writing both myself at our site). I wouldn't consider the spec a success unless many programmers feel like they understand it well enough to implement their own readers and writers. (Whether or not they *should* do so will, as always, depend on many factors.) Mike |