You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(1) |
Aug
(5) |
Sep
|
Oct
(5) |
Nov
(1) |
Dec
(2) |
2005 |
Jan
(2) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(5) |
Jun
(2) |
Jul
(3) |
Aug
(7) |
Sep
(18) |
Oct
(22) |
Nov
(10) |
Dec
(15) |
2006 |
Jan
(15) |
Feb
(8) |
Mar
(16) |
Apr
(8) |
May
(2) |
Jun
(5) |
Jul
(3) |
Aug
(1) |
Sep
(34) |
Oct
(21) |
Nov
(14) |
Dec
(2) |
2007 |
Jan
|
Feb
(17) |
Mar
(10) |
Apr
(25) |
May
(11) |
Jun
(30) |
Jul
(1) |
Aug
(38) |
Sep
|
Oct
(119) |
Nov
(18) |
Dec
(3) |
2008 |
Jan
(34) |
Feb
(202) |
Mar
(57) |
Apr
(76) |
May
(44) |
Jun
(33) |
Jul
(33) |
Aug
(32) |
Sep
(41) |
Oct
(49) |
Nov
(84) |
Dec
(216) |
2009 |
Jan
(102) |
Feb
(126) |
Mar
(112) |
Apr
(26) |
May
(91) |
Jun
(54) |
Jul
(39) |
Aug
(29) |
Sep
(16) |
Oct
(18) |
Nov
(12) |
Dec
(23) |
2010 |
Jan
(29) |
Feb
(7) |
Mar
(11) |
Apr
(22) |
May
(9) |
Jun
(13) |
Jul
(7) |
Aug
(10) |
Sep
(9) |
Oct
(20) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
(4) |
Mar
(27) |
Apr
(15) |
May
(23) |
Jun
(13) |
Jul
(15) |
Aug
(11) |
Sep
(23) |
Oct
(18) |
Nov
(10) |
Dec
(7) |
2012 |
Jan
(23) |
Feb
(19) |
Mar
(7) |
Apr
(20) |
May
(16) |
Jun
(4) |
Jul
(6) |
Aug
(6) |
Sep
(14) |
Oct
(16) |
Nov
(31) |
Dec
(23) |
2013 |
Jan
(14) |
Feb
(19) |
Mar
(7) |
Apr
(25) |
May
(8) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(20) |
Oct
(19) |
Nov
(10) |
Dec
(12) |
2014 |
Jan
(6) |
Feb
(15) |
Mar
(6) |
Apr
(4) |
May
(16) |
Jun
(6) |
Jul
(4) |
Aug
(2) |
Sep
(3) |
Oct
(3) |
Nov
(7) |
Dec
(3) |
2015 |
Jan
(3) |
Feb
(8) |
Mar
(14) |
Apr
(3) |
May
(17) |
Jun
(9) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
(13) |
Nov
|
Dec
(6) |
2016 |
Jan
(8) |
Feb
(1) |
Mar
(20) |
Apr
(16) |
May
(11) |
Jun
(6) |
Jul
(5) |
Aug
|
Sep
(2) |
Oct
(5) |
Nov
(7) |
Dec
(2) |
2017 |
Jan
(10) |
Feb
(3) |
Mar
(17) |
Apr
(7) |
May
(5) |
Jun
(11) |
Jul
(4) |
Aug
(12) |
Sep
(9) |
Oct
(7) |
Nov
(2) |
Dec
(4) |
2018 |
Jan
(7) |
Feb
(2) |
Mar
(5) |
Apr
(6) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(1) |
Sep
(9) |
Oct
(5) |
Nov
(3) |
Dec
(5) |
2019 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(4) |
May
(2) |
Jun
(8) |
Jul
(2) |
Aug
(2) |
Sep
|
Oct
(2) |
Nov
(9) |
Dec
(1) |
2020 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
|
May
(3) |
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Angel P. <an...@ma...> - 2007-10-04 17:59:32
|
On 10/4/07, Matthew Chambers <mat...@va...> wrote: > > I'll comment here on the mzML schema and validation of mzML instances. > I do not see why a proper XML schema with semantic significance could > not be generated for mzML. XML schema have the capability to provide > robust restrictions on both elements and attributes, and such a schema > could be automatically generated from the CV itself (when combined with > a skeleton model of mzML). This is an interesting idea, but as you mention below there are no tools for doing this, so if you have a CS masters student available .... ;) Some people complain that mzML is not true > XML. That's rather misleading. +1 on that. mzML is valid and real XML. It just isn't using the enumerated values of XML. -angel |
From: Angel P. <an...@ma...> - 2007-10-04 17:56:12
|
On 10/4/07, Mike Coleman <tu...@gm...> wrote: > > On 10/4/07, Matthew Chambers <mat...@va...> wrote: > > Oh yes, the userParam. A synonym for the <comment> element ;). Please > > tell me how to use such an element in a meaningful and deterministic > > way. If I write a value into a cvParam with the category "instrument > > model" where the value text is "Super Duper Ion Trap" and the value's > > accession number is a special accession number which means "not yet in > > CV", ANY reader software should be able to interpret that parameter and > > ultimately say that it has no idea what to do with data from such an > > instrument. > > I agree with Matt here. In particular, if I encounter this new "Super > Duper Ion Trap" for the first time, it would be completely > unacceptable for my software to indicate this by saying that my mzML > file is invalid. My software needs to be able to parse this file and > tell me that the data came from a new instrument called "Super Duper > Ion Trap" that it doesn't know how to deal with. WRT to my point about operational vs. repository data formats. For a repository, it is completely valid (and desirable) for the software to parse this new value and add it to the list of possible values for the ontology category. -angel |
From: Mike C. <tu...@gm...> - 2007-10-04 17:08:07
|
On 10/4/07, Matthew Chambers <mat...@va...> wrote: > Oh yes, the userParam. A synonym for the <comment> element ;). Please > tell me how to use such an element in a meaningful and deterministic > way. If I write a value into a cvParam with the category "instrument > model" where the value text is "Super Duper Ion Trap" and the value's > accession number is a special accession number which means "not yet in > CV", ANY reader software should be able to interpret that parameter and > ultimately say that it has no idea what to do with data from such an > instrument. I agree with Matt here. In particular, if I encounter this new "Super Duper Ion Trap" for the first time, it would be completely unacceptable for my software to indicate this by saying that my mzML file is invalid. My software needs to be able to parse this file and tell me that the data came from a new instrument called "Super Duper Ion Trap" that it doesn't know how to deal with. Mike |
From: Matthew C. <mat...@va...> - 2007-10-04 17:05:45
|
I'll comment here on the mzML schema and validation of mzML instances. I do not see why a proper XML schema with semantic significance could not be generated for mzML. XML schema have the capability to provide robust restrictions on both elements and attributes, and such a schema could be automatically generated from the CV itself (when combined with a skeleton model of mzML). Some people complain that mzML is not true XML. That's rather misleading. Others say it needs a special "semantic" validator with its own mapping file. I say that is duplicative and even overkill. Existing schema technology can handle the format specified here, but I grant that the schema WILL have to be very complicated (you won't just have a single cvParam type or ParamGroupType, each part of the schema will have its own cvParam elements with semantically relevant restrictions on the accession numbers) and almost certainly should be machine-generated. I see nothing wrong with a complicated schema though, because the variety of data that we are intending to represent is also very complicated! I don't know if existing automatic code generators work for very complicated schema, but the automatic XML validators definitely should and thus the need for a separate "semantic" validator is unclear to me when the semantic relationships can be encapsulated in an automatically generated XML schema. For example, the <contact> element could be defined semantically in XML schema like this: <xs:complexType name="ContactParamGroupType"> <xs:sequence> <xs:element name="paramGroupRef" type="dx:ContactParamGroupRefType" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="cvParam" minOccurs="0" maxOccurs="1"> <xs:complexType> <xs:attribute name="cvLabel" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS"/> </xs:restriction> </xs:attribute> <xs:attribute name="accession" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS:1000586"/> </xs:restriction> </xs:attribute> <xs:attribute name="name" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="contact name"/> </xs:restriction> </xs:attribute> <xs:attribute name="value" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="cvParam" minOccurs="0" maxOccurs="1"> <xs:complexType> <xs:attribute name="cvLabel" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS"/> </xs:restriction> </xs:attribute> <xs:attribute name="accession" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS:1000587"/> </xs:restriction> </xs:attribute> <xs:attribute name="name" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="contact address"/> </xs:restriction> </xs:attribute> <xs:attribute name="value" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="cvParam" minOccurs="0" maxOccurs="1"> <xs:complexType> <xs:attribute name="cvLabel" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS"/> </xs:restriction> </xs:attribute> <xs:attribute name="accession" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS:1000588"/> </xs:restriction> </xs:attribute> <xs:attribute name="name" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="contact URL"/> </xs:restriction> </xs:attribute> <xs:attribute name="value" type="xs:anyURI"/> </xs:complexType> </xs:element> <xs:element name="cvParam" minOccurs="0" maxOccurs="1"> <xs:complexType> <xs:attribute name="cvLabel" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS"/> </xs:restriction> </xs:attribute> <xs:attribute name="accession" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="MS:1000589"/> </xs:restriction> </xs:attribute> <xs:attribute name="name" type="xs:string"> <xs:restriction base="xs:string"> <xs:pattern value="contact email"/> </xs:restriction> </xs:attribute> <xs:attribute name="value" type="dx:email"/> </xs:complexType> </xs:element> <xs:element name="userParam" type="dx:UserParamType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:element name="contact" type="dx:ContactParamGroupType" minOccurs="0" maxOccurs="unbounded"/> Like I said, this needs to be machine generated, but it would create a XML schema that removes the need for any other kind of semantic mapping and any new tool to do the validation with that mapping. Now that I think about it again, this kind of often-updated schema would violate the unchangedness requirement from the specification: "It was hoped that the actual xsd schema could remain stable for many years while the accompanying controlled vocabulary could be frequently updated to support new technologies, instruments, and methods of acquiring data." But what is the different between a frequently updated mapping file which is REQUIRED to get semantic validation, and a frequently updated primary schema which is REQUIRED to get semantic validation? -Matt Lennart Martens wrote: > That mapping file is effectively in use by our mzML semantic validator, > for exactly the reasons you outlined above! > So yes - this has been made available in the larger mzML kit and has > also been implemented online (your above example indeed does not validate). > > |
From: Sean L S. <Sey...@ap...> - 2007-10-04 17:02:05
|
I will be out of the office starting 09/13/2007 and will not return until 10/12/2007. I will have no access to email. |
From: Matthew C. <mat...@va...> - 2007-10-04 16:57:25
|
Thanks Angel, I didn't intend for the discussion to get heated, it just seemed to me that Lennart didn't understand what I posted (which may be my fault, it's hard to know without other replies). Remember I posted that I agree with cvParams and appreciate the flexibility they provide. But there is a difference between cvParams that have meaning without the CV and cvParams that aren't. I much prefer the latter. So neither of us are arguing for cvParams to go away. You must be talking to somebody else. :) -Matt Angel Pizarro wrote: > Lennert and Matt, > > While I appreciate that this is a topic of great interest to everyone > in the community, let's turn the heat down a bit. Let me see if I can > play the arbiter here: > > cvParams since their introduction have always been contentious. Given > the choice for design of a data formate where attributes (or sub > elements or inner text) could be encoded with a tight set of > enumerated sets of values vs. empty slots, a developer will always > choose the former. > > Why then did the mzML group choose cvParams? The answer is two fold: > 1) the audience, and 2) the intent of the standard > > 1) Name one standard that has received industry support across > multiple vendors/tools/institutions that is tightly controlled with > enumerated values. Prove me wrong, but I can't think of any. > > The reasons for this is that consensus building is a slow process and > approval of any change in a data format can take months if not years. > You need flexible data formats for standards. This already rules out > enumerated values, but you can also make the case that vendors are > unwilling to tie their development efforts to projects that are not > under their complete control (essentially motivated by risk > management). As a vendor, if you officially support even on release of > a fast moving data format, customer expectations are such that you are > now expected to support all future releases of that format. > > 2) The intent of mzML is data transfer and vendor independent storage > of mass spec experimental data. It is not (officially) meant to be an > operational format. Operational formats would put much more weight on > the side of enumerated values. > > > So for theses reasons (there are more though) cvParams are not going > to go away. As for actually doing work with mzML files, Matt is > absolutely right, this is going to be way more difficult than working > with mzXML 2.x (as a developer) While OLS is a fine andd dandy > project, it is not the end-all be-all solution to our problems. It > assumes network connectivity, which is a dubious assumption. Even > assuming very fast connectivity, the overhead of SOAP protocols are > waaaayyy too big to except in your typical use of mzML files, which > are signal processing and searches. Please stop equating OLS with > mzML (or any other ML) since for most uses outside of a repository it > just won't work. -a |
From: Mike C. <tu...@gm...> - 2007-10-04 16:52:41
|
On 10/4/07, Lennart Martens <len...@gm...> wrote: > This is no use. It imnmediately breaks down in the face of synonyms. > Accession numbers are the way to go. Everybody in the life sciences > knows and understands this principle ('9606' is 'human' or 'Homo > sapiens' or 'man' or ...) Hmm. I think what you are saying is that end users are not always able to properly distinguish between canonical *identifiers* (e.g., '9606' or 'human') and descriptive text unless the former happens to look a meaningless string, such as a string of digits. That may be, but strings of digits have their own problems. It's a lot easier to see that 'humaZ' is probably an invalid identifier than that '9607' is, when looking for (the inevitable) problems. I think that biologists understand the value of having semi-meaningful identifiers. They don't use digit strings for gene identifiers, for example. > That would make for very poor mzML documents then, as we semantically > validate these files now (see the semantic validator in the beefier mzML > kit). Your CV-less files would surely not validate, and would NOT be > mzML files. Hmm. How complex is a minimal valid mzML file? If they're not fairly easy to generate, without knowing much about CV, this seems like a problem. > Sorry, but you are erroneously jumping to conclusions. The CV allows > children to be added dynamically, correct usage of these can be > validated and the list of children can be updated on-the-fly from web > resources like the OLS (which auto-update every night). I'm not sure what this means. A nightly update of terms from the web cannot be on our critical path for processing of spectra. We need to be able to proceed even if the OLS disappears forever. > Again, you fail to see the point. The corrrect usage of CV terms can be > validated. So if you mistype a number or its prefix, this will be > considered an error. We need numbers because we want to be able to deal > with synonyms (or even outright changes in the term names; it has > happened before). Numbers are robust, numbers are convenient, numbers > are strong. Text is not. Actually, it's the other way around. Characters strings are robust and convenient, numbers are not. The string 'human' is clearly not equal to 'humaZ'. The string '123' is clearly not equal to the string '0123'. Is the number 123 the same or different than 0123? How about 0 and -0, not to mention 123.4 and 123.40 or 0.999999999999 and 1.0? The use of numbers in a context like this seems to be mostly due to history. They may be a little more convenient for programmers, but that's negligible. > Remember that powerful and extremely user-friendly tools like the OLS > take care of updating new terms for you fully automatically. This phrase "powerful and extremely user-friendly tools" is a little scary. This implies having to learn, debug, etc., another piece of software--one not necessarily under our control. To be truly useful, the spec really has to stand on its own (possibly referencing other specs and data). > I seem to read in your comments so far that there is a certain > reluctance to the use of CV terms because this is new, and doesn't fit > well with what you are good at right now. I would ask that you have a > look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer > documentation on how to access the OLS web services using your favourite > programming language. After playing with it a bit, you'll notice that > incorporating CV's into the parsing is not that much work, yet yields > very clear benefits. I don't even have time to keep up with this list, and the benefits of OLS are far from clear. Mike |
From: Angel P. <an...@ma...> - 2007-10-04 16:44:05
|
Lennert and Matt, While I appreciate that this is a topic of great interest to everyone in the community, let's turn the heat down a bit. Let me see if I can play the arbiter here: cvParams since their introduction have always been contentious. Given the choice for design of a data formate where attributes (or sub elements or inner text) could be encoded with a tight set of enumerated sets of values vs. empty slots, a developer will always choose the former. Why then did the mzML group choose cvParams? The answer is two fold: 1) the audience, and 2) the intent of the standard 1) Name one standard that has received industry support across multiple vendors/tools/institutions that is tightly controlled with enumerated values. Prove me wrong, but I can't think of any. The reasons for this is that consensus building is a slow process and approval of any change in a data format can take months if not years. You need flexible data formats for standards. This already rules out enumerated values, but you can also make the case that vendors are unwilling to tie their development efforts to projects that are not under their complete control (essentially motivated by risk management). As a vendor, if you officially support even on release of a fast moving data format, customer expectations are such that you are now expected to support all future releases of that format. 2) The intent of mzML is data transfer and vendor independent storage of mass spec experimental data. It is not (officially) meant to be an operational format. Operational formats would put much more weight on the side of enumerated values. So for theses reasons (there are more though) cvParams are not going to go away. As for actually doing work with mzML files, Matt is absolutely right, this is going to be way more difficult than working with mzXML 2.x (as a developer) While OLS is a fine andd dandy project, it is not the end-all be-all solution to our problems. It assumes network connectivity, which is a dubious assumption. Even assuming very fast connectivity, the overhead of SOAP protocols are waaaayyy too big to except in your typical use of mzML files, which are signal processing and searches. Please stop equating OLS with mzML (or any other ML) since for most uses outside of a repository it just won't work. -a |
From: Matthew C. <mat...@va...> - 2007-10-04 16:12:23
|
Hi Lennart, Lennart Martens wrote: >> As for attributes vs. cvParams, I have a compromise to propose >> between methods A, B and C. I earlier proposed an extension to the >> structure of the CV which would be intended to force format writers >> to use certain well-defined values instead of whatever kind of >> capitalization and spacing they wish. That proposal still stands and >> I'd like to hear feedback on it. > > This is no use. It imnmediately breaks down in the face of synonyms. > Accession numbers are the way to go. Everybody in the life sciences > knows and understands this principle ('9606' is 'human' or 'Homo > sapiens' or 'man' or ...) > I am a mere computer scientist, and to me all characters on computers are numbers. ;) But I know what you are saying, and I have taken that into consideration. That is why my suggestion was for the CV to CONTROL the synonyms and not let the synonyms be written but one way in VALID mzML. From a technical perspective, this is no different than controlling the accession numbers. From a practical perspective, I appreciate that some users might not be comfortable with having their options for text-based value attributes be controlled like they are for accession numbers, and if that's the majority perspective then I'm fine with using accession numbers for values. >> But I think we should agree on some basic requirements and then >> evaluate proposals from there (this was probably done in one of your >> meetings or teleconferences, but I don't recall such a requirements >> list being posted on this mailing list). According to the >> specification document, there is a requirement to have a long-term, >> unchanging specification, mainly due to vendor interests it seems, >> which of course in the changing field of MS also means a requirement >> of a companion CV. I happen to agree with the idea of having a >> long-term, unchanging specification with a CV, even though I don't >> intend to use the CV very much, if at all. > > That would make for very poor mzML documents then, as we semantically > validate these files now (see the semantic validator in the beefier > mzML kit). Your CV-less files would surely not validate, and would NOT > be mzML files. > Um, excuse me but I'm perfectly capable of writing and reading valid mzML without using a CV web service or any kind of external validation. It may take a /bit/ of manual effort, but it's entirely possible. Of course, if you go with method A for the cvParams and the manual parser has to have an else/if for every possible value's accession number, then you're talking about a LOT of manual effort. But with method B or C, not much at all. >> From a previous post by Eric Deutsch in this thread: >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LCQ Deca"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LCQ DECA"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LTQ FT"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LTQ-FT"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LTQFT"/> >> OK, so because of this legitimate concern we have another >> requirement: the spec must allow defining a restricted value set for >> categories like "instrument model." > > Sorry, but you are erroneously jumping to conclusions. The CV allows > children to be added dynamically, correct usage of these can be > validated and the list of children can be updated on-the-fly from web > resources like the OLS (which auto-update every night). > I don't understand what you're saying here. Are you saying that we do NOT have a requirement of the spec needing to restrict the values set for a given cvParam category? I don't understand the relevance of the updatability of the CV in this context. >> I do not see a reason for a requirement that the spec must use >> accession numbers to enumerate those values. Consider, for example, >> that we have not specified whether the cvLabel parameter is case >> sensitive or not. Suppose a naughty writer starts using lowercase >> instead of uppercase for the cvLabel, or for the cvLabel prefix on >> the accession number. Even worse, suppose the case sensitivity >> between the accession number's prefix and the cvLabel don't match. >> The best we can do is specify things like case sensitivity for these >> issues or force a certain case in certain contexts. We can't prevent >> people from writing broken instances of the specification. > > Again, you fail to see the point. The corrrect usage of CV terms can > be validated. So if you mistype a number or its prefix, this will be > considered an error. We need numbers because we want to be able to > deal with synonyms (or even outright changes in the term names; it has > happened before). Numbers are robust, numbers are convenient, numbers > are strong. Text is not. > Ah, so you WANT support for synonyms. I don't really understand that in the context of writing a standard data-representation format, but ok. >> Based on the above requirement, one concern that I have (and I think >> many others do too, because frankly I get a strong impression that >> many people who want to use this spec don't care about being CV >> aware) is that a writer should be able to write a cvParam with a >> value that is not in the allowed value set of the CV without making >> readers have no clue what the value is actually indicating. In other >> words, regardless of whether a reader is CV aware or not, a (machine >> OR human) reader should be able to glean the purpose of an unknown >> value in a cvParam via some kind of category specification (e.g. >> "instrument model", or by the category's accession number). If this >> is accepted as a requirement, it practically eliminates method A as >> an option because it provides no indication of what category the >> unknown cvParam's value belongs to. > > There is the option to include userparams. Alternatively, you take the > productive approach and signal the need to add the term to the CV. > Remember that powerful and extremely user-friendly tools like the OLS > take care of updating new terms for you fully automatically. If you > need to know the context of a term, referring to the CV should be your > first and most prominent approach. Oh yes, the userParam. A synonym for the <comment> element ;). Please tell me how to use such an element in a meaningful and deterministic way. If I write a value into a cvParam with the category "instrument model" where the value text is "Super Duper Ion Trap" and the value's accession number is a special accession number which means "not yet in CV", ANY reader software should be able to interpret that parameter and ultimately say that it has no idea what to do with data from such an instrument. The reader software can even be updated to know how to deal with that instrument by its value text instead of the value accession number, and once that's done some usable data already exists. Nobody had to wait for that instrument model to be added to the CV for the data to be usable. After that instrument model is added to the CV, of course, the writer should be updated to use the proper accession number. If a reader is using the CV tools, their parser will be capable of reading such data automatically, and any reader that chose to manually update in order to deal with the value text while the value accession indicated "not yet in CV" can then choose whether to keep that support intact in order to deal with the data that was already generated, or it can remove it and return to using the pure CV. If a primary goal is "flexibility," then forcing people to add a web service to their XML parser in order to get the CV is seriously stretching that goal. >> There are perhaps other requirements for the cvParam, but I'll let >> others fill them in. My new proposed compromise is to split values >> into a valueAccession and a valueName, just like the optional >> unitAccession and unitName. The two value attributes would not be >> optional like the unit attributes, though. A special CV accession >> number could be allocated to indicate an "unrestricted" value, in >> which case the reader would use the valueName as the value. >> Alternatively, the reader could read the accession attribute (which >> in this compromise would always indicate a category's accession >> number) and choose based on that whether to look up the >> valueAccession in the CV or to use the valueName verbatim. So the SRM >> spectrum example would become: >> <cvParam cvLabel="MS" accession="MS:1000035" name="spectrum type" >> valueAccession="MS:1000583" valueName="SRM spectrum"/> > > For various complex reasons, this will wreck havoc. Because now the > two (accession and value accession) run the (unnecessary!) risk of > being able to go out of sync. > I see you have not elected to enumerate these various complex reasons, or describe what on earth you mean by having the accession numbers go out of sync. I think you failed to notice that this compromise is very similar to method C, which in a recent post you put in your (tied) vote for. In my opinion, it looks better and is more intuitive than the syntax in method C, but the semantics are exactly the same. In method C it would look like: <cvParam cvLabel="MS" categoryAccession="MS:1000035" categoryName="spectrum type" accession="MS:1000583" name="SRM spectrum"/> You see? Straight out of the specification document. Were you perhaps referring to the special accession numbers? I proposed one that would mean that the value is "unrestricted" and another that would mean that the current value is not yet added to the cvParam but has been (or soon will be) submitted for adding (patent pending, if you will). > I seem to read in your comments so far that there is a certain > reluctance to the use of CV terms because this is new, and doesn't fit > well with what you are good at right now. I would ask that you have a > look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer > documentation on how to access the OLS web services using your > favourite programming language. After playing with it a bit, you'll > notice that incorporating CV's into the parsing is not that much work, > yet yields very clear benefits. > You read correctly. The clear benefits that the CV provides are not having to update the parser manually to deal with new CV terms and having a unified set of categories and values from which to generate data models. Excuse my rudeness, but: Whoopdeedoo! The vast majority of development effort is NOT in the parser, regardless of whether the parser is automatically or manually written. The vast majority of development is in the PROCESSING of the data that gets parsed, and unless I'm missing something big, the CV provides no benefit at all for processing new kinds of data. I'm NOT suggesting that the CV should provide such a benefit, of course, only trying to convey the reason for my reluctance. In other words, I have no qualms about writing a new "else if" block to my parser every time a new kind of data comes out, considering that I will always have to add 500 other lines of code elsewhere in my software to actually process the new kind of data in a meaningful way. -Matt |
From: Angel P. <an...@ma...> - 2007-10-04 13:22:05
|
so where this mzML kit that you mention? With the OLS? -angel On 10/4/07, Lennart Martens <len...@gm...> wrote: > > > So yes - this has been made available in the larger mzML kit and has > also been implemented online (your above example indeed does not > validate). > > |
From: Lennart M. <len...@gm...> - 2007-10-04 10:53:35
|
Hi Andy, > The decision about how to implement CV terms is pretty important and we should try to come up with a coherent policy across PSI if possible. Here are my thoughts: > > A while back Luisa and myself drafted a proposal for mapping model elements to CV terms that may simplify some of the problems currently being worked through. The draft and sample instance are here: http://www.psidev.info/index.php?q=node/159 (see Mapping between exchange schema and CVs). > > I would strongly vote for option A, and in addition maintain a mapping file. This is more work for the CV coordinators (but hopefully can be mainly automated), and would force software implementers to interact with the CV WG when they need new terms, but given the heavy reliance on CV terms in the mzML schema I see no way around this. > > If a mapping file is kept updated in parallel to the CV, software can check whether a valid term has been provided for a particular model element. In the example of spectrumType, the mapping file would specify that only child terms of spectrumType are allowed (e.g. for the model element fileContent). If a vendor publishes a file with: > > <fileContent> > <cvParam cvLabel="MS" accession="MS:9999999" name="SRM spectrum" value=""/> > </fileContent> > > This would automatically be rejected by the validator (or at least a warning output), as it should be, since there's no point having a CV where the terms are not controlled! That mapping file is effectively in use by our mzML semantic validator, for exactly the reasons you outlined above! So yes - this has been made available in the larger mzML kit and has also been implemented online (your above example indeed does not validate). Cheers, lnnrt. |
From: Lennart M. <len...@gm...> - 2007-10-04 10:45:24
|
Hi Marc, > (2) Semantic validator > The semantic validator is a nice feature, but i think you must publish a > file that defines the mapping of CV terms to the schema. > This file must answer questions like: Where can i use which term? How > often can i repeat a term? etc. > With the heavy use of CV terms such a file is a non-optional part of the > format definition. > What happened to that format Luisa proposed? It is included :). Look in the 'ms-mapping.xml' file. It is (quite literally so) Luisa's file. The whole validator relies on a role-based 'separation of concerns', so that the application is nearly 100% dynamically configured. It is a nice piece of work that we are currently writing up in order to publish it. Meanwhile, I'd be happy to provide more information on how the whole thing works. Just let me know what you want to learn. > (4) General > Finally i'd like to say that i agree with Brian Pratt. There is too much > CV and too little XML in the format for my taste. > I don't argue against CV in general it's a nice technique that allows > the schema to be stable for a long time. > But now everything is in the CV and there are hardly any XML attributes > left. This makes the format hard to implement and impossible to check > with an XML validator. > And i don't see the advantage in most cases: I have to adapt the > software to new terms just as i would adapt it to new XML elements. If you could use software that answered simple CV questions like 'what is the parent of X', or 'get children for X', or 'is X one of the children of Y (optionally with maximum Z generations)' (for instance); and if this software is on the net and always up-to-date, would that still mean you always have to redo everything? I at least wouldn't expect so. It just requires a new way of dealing with the content of the file (which again, is what matters). Also remember that the semantic validator, in series after a schema validator, provides maximum validation for a file like an mzML file - both structure and content are thoroughly verfied (and nearly 100% dynamically configured - zero recoding necessary when new children get added, for instance). Cheers, lnnrt. |
From: Jones, A. [jonesar] <And...@li...> - 2007-10-04 10:40:05
|
Hi all, The decision about how to implement CV terms is pretty important and we = should try to come up with a coherent policy across PSI if possible. = Here are my thoughts: A while back Luisa and myself drafted a proposal for mapping model = elements to CV terms that may simplify some of the problems currently = being worked through. The draft and sample instance are here: = http://www.psidev.info/index.php?q=3Dnode/159 (see Mapping between = exchange schema and CVs). I would strongly vote for option A, and in addition maintain a mapping = file. This is more work for the CV coordinators (but hopefully can be = mainly automated), and would force software implementers to interact = with the CV WG when they need new terms, but given the heavy reliance on = CV terms in the mzML schema I see no way around this.=20 If a mapping file is kept updated in parallel to the CV, software can = check whether a valid term has been provided for a particular model = element. In the example of spectrumType, the mapping file would specify = that only child terms of spectrumType are allowed (e.g. for the model = element fileContent). If a vendor publishes a file with: <fileContent> <cvParam cvLabel=3D"MS" accession=3D"MS:9999999" name=3D"SRM spectrum" = value=3D""/> </fileContent> This would automatically be rejected by the validator (or at least a = warning output), as it should be, since there's no point having a CV = where the terms are not controlled! =20 Option B <cvParam cvLabel=3D"MS" accession=3D"MS:1000035" = name=3D"spectrum type" value=3D"SRM spectrum"/> looks particular bad to = me, since there is no check that correct values are given. As was = mentioned elsewhere on the list, you run into problems with upper/lower = case, spacing etc. If software is going to rely on particular values = being present, those values must be in the CV with persistent = identifiers.=20 I believe OBO does not have the ability to distinguish between = ontological classes (i.e. there as branch structure) and = instances/individuals (i.e. leaf nodes used as values to annotate data). = Again, this could be handled by the mapping file that specifies which = terms can be used to annotate model elements. A related point, in mzData, there is inconsistent usage of the value = slot, since the specification has no ability to say whether a value (and = a unit) should be given or not e.g. for term "sample mass (MS:1000004)" = software should know that a value and unit must be given. It is = reasonable that software should be able to check whether to expect a = value or not for particular CV terms. Logically, this should be part of = the CV itself, but as far as I'm aware OBO does not have this = capability. One solution would be to add this to the mapping file as two = Booleans on the cvTerm (allowsValue =3D "true/false" and requiresUnit = =3D "true/false"). Cheers Andy > -----Original Message----- > From: psi...@li... = [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Marc Sturm > Sent: 04 October 2007 09:06 > To: psi...@li... > Subject: Re: [Psidev-ms-dev] mzML 0.99.0 submitted to document process >=20 > Hi all, >=20 > first of all i would like to thank Eric and all the others in the > working group for their effort. > Here are my comments: >=20 > (1) The new CV term problem > A is clear and simple. > B is simply a bad idea in my opinion. Why not use the child accession = if > we have it? > C helps the software to know where the new term belongs, but the > software does not know what to do with it in most cases. I think most = of > software implements these enum-like CV terms as enum types and thus > cannot handle new values anyway. Additionally it is error prone > (mismatching parent and child). >=20 > As C is an extension of A, i vote for A or C, but i don't think that C > helps very much. >=20 > (2) Semantic validator > The semantic validator is a nice feature, but i think you must publish = a > file that defines the mapping of CV terms to the schema. > This file must answer questions like: Where can i use which term? How > often can i repeat a term? etc. > With the heavy use of CV terms such a file is a non-optional part of = the > format definition. > What happened to that format Luisa proposed? >=20 > (3) Comments to CV / Schema > - The term MS:1000543 "data processing action" is missing some child > terms i think. What about smoothing, baseline reduction and removal = low > intensity data points? > - Putting the software name in a CV will cause much trouble i think. > Where are way to many upcoming tools and you will be constantly = updating > that obo file. I really think we should put that into a string = attribute > - I would add a new optional and unbounded element "parameter" with > attributes "name", "type", value" to the dx:dataProcessing element to > store the parameters of the software that were used for processing. >=20 > (4) General > Finally i'd like to say that i agree with Brian Pratt. There is too = much > CV and too little XML in the format for my taste. > I don't argue against CV in general it's a nice technique that allows > the schema to be stable for a long time. > But now everything is in the CV and there are hardly any XML = attributes > left. This makes the format hard to implement and impossible to check > with an XML validator. > And i don't see the advantage in most cases: I have to adapt the > software to new terms just as i would adapt it to new XML elements. >=20 > Best regards, > Marc >=20 > = -------------------------------------------------------------------------= > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a = browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Lennart M. <len...@gm...> - 2007-10-04 10:36:06
|
Hi Brian, > This is specious. The fact that mzData hasn’t revved only says to me > that it’s badly underspecified, which the paragraph in fact goes on to > illustrate. The occasional revision of the mzXML schema, to my mind, > indicates a well maintained standard*. A stable schema and evolving > ontology produce as much or more reader/writer code maintenance work as > an evolving schema-only does. PRIDE has a stable schema, yet a rapidly evolving CV. We did not need to recode PRIDE whenever we changed the CV. So from experience: a stable schema + evolving (but initially well-organized) CV is not a problem in terms of maintenance. Having to redo the schema every other month is also possible, but nevertheless more hassle. > It’s not like mzData readers don’t have > to be updated every time something gets added to the ontology. At least > with a schema there are ways to generate code for these kinds of changes > automatically, and to easily validate the results. Frankly when it > comes to data formats I think the term “flexible” is synonymous for > “trouble” – convenient for the writers, hell for the readers, and often > a dead end for that reason. Let me make a black and white scenario for you - you have everything as attributes in the schema, and you auto-generate parsing code every week since you keep adding or changing attributes. Fine, no worries. Zero backwards compatibility, but hey - who cares about yesterdays data, right? And your generated code will swallow anything that is remotely using the right glyphs in those attributes (e.g.: 'I'm not providing sensible information here' as the value for the 'instrument_name' attribute). If your objective is convenience for the programmers (whose job it should be to program), you choose the 'everything in schema' path. If your objective is to transmit meaningful and validated/validatable data, you go the current mzML path. Now which one would make the most sense for a standard? > I really think mzML will just perpetuate the issues mzData presented. > Better we should figure out a way to generate a proper XML schema based > on the ontology document. The rest of the world uses proper XML, I > really don’t see what makes us special. I do not believe that (a) mzData presents more issues than uses, (b) even if (a) were true, that mzML blatantly propagates these, and (c) that starting from scratch with a far too rigid, implicitly non-backwards compatible and unvalidatable (content-wise, which is where it matters) data transmission format is the way to go forward. > *note that most of the mzXML revisions had to do with things like adding > data compression to peaklists. It wasn’t getting banged around every > time somebody came out with a new mass spec, like the ontology will. mzML will not get 'banged about' every time a new mass spec is added. That is the whole point. Please do try to understand the relatively simple concept - an addition to the instruments is completely and utterly transparant. Cheers, lnnrt. |
From: Lennart M. <len...@gm...> - 2007-10-04 10:24:52
|
Hi Matt, > Time to reopen this can of worms! I like the specification document. > It's clearly written. Unfortunately there is no clear way that I know > of to capture the semantically valid cvParam relationships in a flat > written document, but that can be done externally and it doesn't bother > me. I have one comment before discussing cvParams though: where is the > rationale for having "referenceable" paramGroups? I'm not disagreeing > with the idea, I think it's good, but it does need a rationale because > it's not typical XML practice. For example, why not use the xlink > standard to do the referencing? Also, do we guarantee the order of the > elements so that "referenceableParamGroupList" is always known to come > before the first "run" element (which if I read correctly is the first > element to make use of "paramGroupRef"s)? The order of the elements is fixed. ReferenceableParamGroups can be referenced from any 'normal' paramgroup (which consists of any number of such refs, user params and cv params), as is clearly evident from the schema and schemadoc. > As for attributes vs. cvParams, I have a compromise to propose between > methods A, B and C. I earlier proposed an extension to the structure of > the CV which would be intended to force format writers to use certain > well-defined values instead of whatever kind of capitalization and > spacing they wish. That proposal still stands and I'd like to hear > feedback on it. This is no use. It imnmediately breaks down in the face of synonyms. Accession numbers are the way to go. Everybody in the life sciences knows and understands this principle ('9606' is 'human' or 'Homo sapiens' or 'man' or ...) > But I think we should agree on some basic requirements and then evaluate > proposals from there (this was probably done in one of your meetings or > teleconferences, but I don't recall such a requirements list being > posted on this mailing list). According to the specification document, > there is a requirement to have a long-term, unchanging specification, > mainly due to vendor interests it seems, which of course in the changing > field of MS also means a requirement of a companion CV. I happen to > agree with the idea of having a long-term, unchanging specification with > a CV, even though I don't intend to use the CV very much, if at all. That would make for very poor mzML documents then, as we semantically validate these files now (see the semantic validator in the beefier mzML kit). Your CV-less files would surely not validate, and would NOT be mzML files. > From a previous post by Eric Deutsch in this thread: > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LCQ Deca"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LCQ DECA"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQ FT"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQ-FT"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQFT"/> > OK, so because of this legitimate concern we have another requirement: > the spec must allow defining a restricted value set for categories like > "instrument model." Sorry, but you are erroneously jumping to conclusions. The CV allows children to be added dynamically, correct usage of these can be validated and the list of children can be updated on-the-fly from web resources like the OLS (which auto-update every night). > I do not see a reason for a requirement that the > spec must use accession numbers to enumerate those values. Consider, for > example, that we have not specified whether the cvLabel parameter is > case sensitive or not. Suppose a naughty writer starts using lowercase > instead of uppercase for the cvLabel, or for the cvLabel prefix on the > accession number. Even worse, suppose the case sensitivity between the > accession number's prefix and the cvLabel don't match. The best we can > do is specify things like case sensitivity for these issues or force a > certain case in certain contexts. We can't prevent people from writing > broken instances of the specification. Again, you fail to see the point. The corrrect usage of CV terms can be validated. So if you mistype a number or its prefix, this will be considered an error. We need numbers because we want to be able to deal with synonyms (or even outright changes in the term names; it has happened before). Numbers are robust, numbers are convenient, numbers are strong. Text is not. > Based on the above requirement, one concern that I have (and I think > many others do too, because frankly I get a strong impression that many > people who want to use this spec don't care about being CV aware) is > that a writer should be able to write a cvParam with a value that is not > in the allowed value set of the CV without making readers have no clue > what the value is actually indicating. In other words, regardless of > whether a reader is CV aware or not, a (machine OR human) reader should > be able to glean the purpose of an unknown value in a cvParam via some > kind of category specification (e.g. "instrument model", or by the > category's accession number). If this is accepted as a requirement, it > practically eliminates method A as an option because it provides no > indication of what category the unknown cvParam's value belongs to. There is the option to include userparams. Alternatively, you take the productive approach and signal the need to add the term to the CV. Remember that powerful and extremely user-friendly tools like the OLS take care of updating new terms for you fully automatically. If you need to know the context of a term, referring to the CV should be your first and most prominent approach. > There are perhaps other requirements for the cvParam, but I'll let > others fill them in. My new proposed compromise is to split values into > a valueAccession and a valueName, just like the optional unitAccession > and unitName. The two value attributes would not be optional like the > unit attributes, though. A special CV accession number could be > allocated to indicate an "unrestricted" value, in which case the reader > would use the valueName as the value. Alternatively, the reader could > read the accession attribute (which in this compromise would always > indicate a category's accession number) and choose based on that whether > to look up the valueAccession in the CV or to use the valueName > verbatim. So the SRM spectrum example would become: > <cvParam cvLabel="MS" accession="MS:1000035" name="spectrum type" > valueAccession="MS:1000583" valueName="SRM spectrum"/> For various complex reasons, this will wreck havoc. Because now the two (accession and value accession) run the (unnecessary!) risk of being able to go out of sync. I seem to read in your comments so far that there is a certain reluctance to the use of CV terms because this is new, and doesn't fit well with what you are good at right now. I would ask that you have a look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer documentation on how to access the OLS web services using your favourite programming language. After playing with it a bit, you'll notice that incorporating CV's into the parsing is not that much work, yet yields very clear benefits. Cheers, lnnrt. |
From: Marc S. <st...@in...> - 2007-10-04 08:06:21
|
Hi all, first of all i would like to thank Eric and all the others in the working group for their effort. Here are my comments: (1) The new CV term problem A is clear and simple. B is simply a bad idea in my opinion. Why not use the child accession if we have it? C helps the software to know where the new term belongs, but the software does not know what to do with it in most cases. I think most of software implements these enum-like CV terms as enum types and thus cannot handle new values anyway. Additionally it is error prone (mismatching parent and child). As C is an extension of A, i vote for A or C, but i don't think that C helps very much. (2) Semantic validator The semantic validator is a nice feature, but i think you must publish a file that defines the mapping of CV terms to the schema. This file must answer questions like: Where can i use which term? How often can i repeat a term? etc. With the heavy use of CV terms such a file is a non-optional part of the format definition. What happened to that format Luisa proposed? (3) Comments to CV / Schema - The term MS:1000543 "data processing action" is missing some child terms i think. What about smoothing, baseline reduction and removal low intensity data points? - Putting the software name in a CV will cause much trouble i think. Where are way to many upcoming tools and you will be constantly updating that obo file. I really think we should put that into a string attribute - I would add a new optional and unbounded element "parameter" with attributes "name", "type", value" to the dx:dataProcessing element to store the parameters of the software that were used for processing. (4) General Finally i'd like to say that i agree with Brian Pratt. There is too much CV and too little XML in the format for my taste. I don't argue against CV in general it's a nice technique that allows the schema to be stable for a long time. But now everything is in the CV and there are hardly any XML attributes left. This makes the format hard to implement and impossible to check with an XML validator. And i don't see the advantage in most cases: I have to adapt the software to new terms just as i would adapt it to new XML elements. Best regards, Marc |
From: Brian P. <bri...@in...> - 2007-10-03 16:11:27
|
Looks like most commenting happens on this list, so here goes: >From the spec: "The mzData format was a far more flexible format than mzXML. The support of new technologies could be added to mzData files by adding new controlled vocabulary terms, while mzXML often required a full schema revision. This is evidenced by mzData still at version 1.05 while mzXML is currently at version 3.1. However, mzData did suffer from a problem of inconsistently used vocabulary terms and there appeared several different dialects of mzData, encoding the same information in subtly different ways. This was not usually a problem for human inspection of the file, but caused difficulty writing and maintaining reader software." This is specious. The fact that mzData hasn't revved only says to me that it's badly underspecified, which the paragraph in fact goes on to illustrate. The occasional revision of the mzXML schema, to my mind, indicates a well maintained standard*. A stable schema and evolving ontology produce as much or more reader/writer code maintenance work as an evolving schema-only does. It's not like mzData readers don't have to be updated every time something gets added to the ontology. At least with a schema there are ways to generate code for these kinds of changes automatically, and to easily validate the results. Frankly when it comes to data formats I think the term "flexible" is synonymous for "trouble" - convenient for the writers, hell for the readers, and often a dead end for that reason. I really think mzML will just perpetuate the issues mzData presented. Better we should figure out a way to generate a proper XML schema based on the ontology document. The rest of the world uses proper XML, I really don't see what makes us special. Well, hey, you asked. - Brian *note that most of the mzXML revisions had to do with things like adding data compression to peaklists. It wasn't getting banged around every time somebody came out with a new mass spec, like the ontology will. _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Tuesday, October 02, 2007 3:32 PM To: psi...@li... Cc: Eric Deutsch Subject: [Psidev-ms-dev] mzML 0.99.0 submitted to document process Hi everyone, I am happy to announce that the mzML 0.99.0 specification document has been submitted to the PSI document process. This is an important milestone in the completion of mzML, but it is most certainly not the end of development and feedback. The specification document and all related materials are publicly available at: http://psidev.info/index.php?q=node/257 There are various kits of instance documents, xsds, the controlled vocabulary, validators, etc. listed at that site. Please examine and respond. The actual specification document is posted at: http://psidev.info/index.php?q=node/300 You may post comments at that site, or you may send them to this list. We addressed nearly all issues brought up in the preview period in August. The one main issue that remains unresolved is the problem of cvParams and how to handle the inevitable scenario of new terms and older software. This is an important issue. There is a discussion of it in the specification document. Your input is sought. We encourage you to begin developing (or adapting) software that implements the format if you are comfortable knowing that there will be changes before the 1.0.0 release. I believe that it is primarily by attempting to implement the format that the community will test the format most rigorously and reveal issues that still need to be resolved; this is far more effective than gazing at the specification document. Regards, Eric ---------------------------------- Eric Deutsch, Ph.D. Institute for Systems Biology 1441 North 34th Street Seattle WA 98103 Tel: 206-732-1397 Fax: 206-732-1260 Email: ede...@sy... WWW: http://www.systemsbiology.org/Senior_Research_Scientists/Eric_Deutsch |
From: Matthew C. <mat...@va...> - 2007-10-03 15:34:18
|
Hi all, Time to reopen this can of worms! I like the specification document. It's clearly written. Unfortunately there is no clear way that I know of to capture the semantically valid cvParam relationships in a flat written document, but that can be done externally and it doesn't bother me. I have one comment before discussing cvParams though: where is the rationale for having "referenceable" paramGroups? I'm not disagreeing with the idea, I think it's good, but it does need a rationale because it's not typical XML practice. For example, why not use the xlink standard to do the referencing? Also, do we guarantee the order of the elements so that "referenceableParamGroupList" is always known to come before the first "run" element (which if I read correctly is the first element to make use of "paramGroupRef"s)? As for attributes vs. cvParams, I have a compromise to propose between methods A, B and C. I earlier proposed an extension to the structure of the CV which would be intended to force format writers to use certain well-defined values instead of whatever kind of capitalization and spacing they wish. That proposal still stands and I'd like to hear feedback on it. But I think we should agree on some basic requirements and then evaluate proposals from there (this was probably done in one of your meetings or teleconferences, but I don't recall such a requirements list being posted on this mailing list). According to the specification document, there is a requirement to have a long-term, unchanging specification, mainly due to vendor interests it seems, which of course in the changing field of MS also means a requirement of a companion CV. I happen to agree with the idea of having a long-term, unchanging specification with a CV, even though I don't intend to use the CV very much, if at all. From a previous post by Eric Deutsch in this thread: <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LCQ Deca"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LCQ DECA"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQ FT"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQ-FT"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQFT"/> OK, so because of this legitimate concern we have another requirement: the spec must allow defining a restricted value set for categories like "instrument model." I do not see a reason for a requirement that the spec must use accession numbers to enumerate those values. Consider, for example, that we have not specified whether the cvLabel parameter is case sensitive or not. Suppose a naughty writer starts using lowercase instead of uppercase for the cvLabel, or for the cvLabel prefix on the accession number. Even worse, suppose the case sensitivity between the accession number's prefix and the cvLabel don't match. The best we can do is specify things like case sensitivity for these issues or force a certain case in certain contexts. We can't prevent people from writing broken instances of the specification. Based on the above requirement, one concern that I have (and I think many others do too, because frankly I get a strong impression that many people who want to use this spec don't care about being CV aware) is that a writer should be able to write a cvParam with a value that is not in the allowed value set of the CV without making readers have no clue what the value is actually indicating. In other words, regardless of whether a reader is CV aware or not, a (machine OR human) reader should be able to glean the purpose of an unknown value in a cvParam via some kind of category specification (e.g. "instrument model", or by the category's accession number). If this is accepted as a requirement, it practically eliminates method A as an option because it provides no indication of what category the unknown cvParam's value belongs to. There are perhaps other requirements for the cvParam, but I'll let others fill them in. My new proposed compromise is to split values into a valueAccession and a valueName, just like the optional unitAccession and unitName. The two value attributes would not be optional like the unit attributes, though. A special CV accession number could be allocated to indicate an "unrestricted" value, in which case the reader would use the valueName as the value. Alternatively, the reader could read the accession attribute (which in this compromise would always indicate a category's accession number) and choose based on that whether to look up the valueAccession in the CV or to use the valueName verbatim. So the SRM spectrum example would become: <cvParam cvLabel="MS" accession="MS:1000035" name="spectrum type" valueAccession="MS:1000583" valueName="SRM spectrum"/> I like ketchup on my worms, how bout you? -Matt Chambers Vanderbilt MSRC For reference, AFAIK this is the last post in this thread: Joshua Tasman wrote: > Hi all, > > Actually, I agree that we'd be better served if more structure was > applied at the xml schema level, but since design decisions have > already been made and it seems we're past the point of changing them, > I think we should stick to a consistent flavor. > > I'd propose finding most instances in the schema where attributes and > values are defined by the xml schema and replacing them with cvParams. > If we're reliant on the OBO, let's completely get away from any > parsing of human-readable elements. In the OBO, we already have > inconsistent capitalization for source file types: "mzData File" vs > "wiff file". Let's simplify things and rely on the nice clean accession. > > From a look through the instance document, some examples: > > I'd like to see soureFileType as a sub cvParam with a specific > accession reference, vs attribute: > <sourceFile id="1" sourceFileName="tiny1.RAW" > sourceFileLocation="file://F:/data/Exp01" sourceFileType="Xcalibur RAW > file"> > > contactInfo could use value'd cvParams for name, institution, etc, or > any other added features like email, phone, etc. > > fileChecksum's type should be a cv accession, instead of: > <fileChecksum type="Sha1"> > > In spectrum, spectrumType should be an cvParam, not attribute: > <spectrum id="S19" scanNumber="19" spectrumType="MSn" msLevel="1"> > > In binaryDataArray, attributes compressionType and dataType should be > cvParams: > <binaryDataArray dataType="64-bit float" compressionType="none" > arrayLength="43" encodedLength="5000" dataProcessingRef="Xcalibur > Processing"> > > > Josh > |
From: Eric D. <ede...@sy...> - 2007-10-02 22:32:26
|
Hi everyone, I am happy to announce that the mzML 0.99.0 specification document has been submitted to the PSI document process. This is an important milestone in the completion of mzML, but it is most certainly not the end of development and feedback. =20 The specification document and all related materials are publicly available at: =20 http://psidev.info/index.php?q=3Dnode/257 =20 There are various kits of instance documents, xsds, the controlled vocabulary, validators, etc. listed at that site. Please examine and respond. =20 The actual specification document is posted at: =20 http://psidev.info/index.php?q=3Dnode/300 =20 You may post comments at that site, or you may send them to this list. We addressed nearly all issues brought up in the preview period in August. The one main issue that remains unresolved is the problem of cvParams and how to handle the inevitable scenario of new terms and older software. This is an important issue. There is a discussion of it in the specification document. Your input is sought. =20 We encourage you to begin developing (or adapting) software that implements the format if you are comfortable knowing that there will be changes before the 1.0.0 release. I believe that it is primarily by attempting to implement the format that the community will test the format most rigorously and reveal issues that still need to be resolved; this is far more effective than gazing at the specification document. =20 Regards, Eric =20 =20 ---------------------------------- Eric Deutsch, Ph.D. Institute for Systems Biology 1441 North 34th Street Seattle WA 98103 Tel: 206-732-1397 Fax: 206-732-1260 Email: ede...@sy... WWW: http://www.systemsbiology.org/Senior_Research_Scientists/Eric_Deutsch =20 |
From: Matthew C. <mat...@va...> - 2007-08-08 19:36:25
|
> -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Joshua Tasman > Sent: Wednesday, August 08, 2007 1:58 PM > To: psi...@li... > Subject: [Psidev-ms-dev] attributes vs cvParams > > Hi all, > > Actually, I agree that we'd be better served if more structure was > applied at the xml schema level, but since design decisions have already > been made and it seems we're past the point of changing them, I think we > should stick to a consistent flavor. I'm not terribly concerned about flavor of the XML I consume, but I don't feel strongly one way or the other about most of the cvParam/schema issues. I do feel strongly that parsers should not be required to look at the CV to get basic meaning out of the file. > I'd propose finding most instances in the schema where attributes and > values are defined by the xml schema and replacing them with cvParams. > If we're reliant on the OBO, let's completely get away from any parsing > of human-readable elements. In the OBO, we already have inconsistent > capitalization for source file types: "mzData File" vs "wiff file". > Let's simplify things and rely on the nice clean accession. > > From a look through the instance document, some examples: > > I'd like to see soureFileType as a sub cvParam with a specific accession > reference, vs attribute: > <sourceFile id="1" sourceFileName="tiny1.RAW" > sourceFileLocation="file://F:/data/Exp01" sourceFileType="Xcalibur RAW > file"> I'm happy with: <sourceFile id="1" sourceFileName="tiny1.RAW" sourceFileLocation="file://F:/data/Exp01"> <cvParam cvLabel="MS" accession="MS:xxxxxxx" name="Source file type" value="Xcalibur RAW file" /> </sourceFile> This must be accompanied by adding specific valid values to the ontology, not just unique accession numbers. I am not happy with: <sourceFile id="1" sourceFileName="tiny1.RAW" sourceFileLocation="file://F:/data/Exp01"> <cvParam cvLabel="MS" accession="MS:xxxxxxx" name="Xcalibur RAW file" value="" /> </sourceFile> The idea of values being represented as unique accession numbers is against common sense and possibly carcinogenic. ;) > contactInfo could use value'd cvParams for name, institution, etc, or > any other added features like email, phone, etc. > > fileChecksum's type should be a cv accession, instead of: > <fileChecksum type="Sha1"> What exactly are you suggesting here? <fileChecksum accession="MS:xx(sha1)xx">71be39fb2700ab2f3c8b2234b91274968b6899b1</fileChec ksum> Or <fileChecksum>71be39fb2700ab2f3c8b2234b91274968b6899b1<cvParam cvLabel="MS" accession="MS:xx(checksumType)xx" name="Checksum type" value="Sha1" /></fileChecksum> <!-- ewwww --> Or <fileChecksum>71be39fb2700ab2f3c8b2234b91274968b6899b1<cvParam cvLabel="MS" accession="MS:xx(sha1)xx" name="Sha1" value="" /></fileChecksum><!-- double ewww! --> I don't think any of these is better than leaving it as an attribute (and possibly giving the checksum type attribute a schema type instead of putting it in the ontology. I don't think the cvParam paradigm works well on elements which only have text nodes for children or which have no children at all. > In spectrum, spectrumType should be an cvParam, not attribute: > <spectrum id="S19" scanNumber="19" spectrumType="MSn" msLevel="1"> I agree with this one. > In binaryDataArray, attributes compressionType and dataType should be > cvParams: > <binaryDataArray dataType="64-bit float" compressionType="none" > arrayLength="43" encodedLength="5000" dataProcessingRef="Xcalibur > Processing"> I agree with this as well. -Matt |
From: Joshua T. <jt...@sy...> - 2007-08-08 18:58:23
|
Hi all, Actually, I agree that we'd be better served if more structure was applied at the xml schema level, but since design decisions have already been made and it seems we're past the point of changing them, I think we should stick to a consistent flavor. I'd propose finding most instances in the schema where attributes and values are defined by the xml schema and replacing them with cvParams. If we're reliant on the OBO, let's completely get away from any parsing of human-readable elements. In the OBO, we already have inconsistent capitalization for source file types: "mzData File" vs "wiff file". Let's simplify things and rely on the nice clean accession. From a look through the instance document, some examples: I'd like to see soureFileType as a sub cvParam with a specific accession reference, vs attribute: <sourceFile id="1" sourceFileName="tiny1.RAW" sourceFileLocation="file://F:/data/Exp01" sourceFileType="Xcalibur RAW file"> contactInfo could use value'd cvParams for name, institution, etc, or any other added features like email, phone, etc. fileChecksum's type should be a cv accession, instead of: <fileChecksum type="Sha1"> In spectrum, spectrumType should be an cvParam, not attribute: <spectrum id="S19" scanNumber="19" spectrumType="MSn" msLevel="1"> In binaryDataArray, attributes compressionType and dataType should be cvParams: <binaryDataArray dataType="64-bit float" compressionType="none" arrayLength="43" encodedLength="5000" dataProcessingRef="Xcalibur Processing"> Josh |
From: Mike C. <tu...@gm...> - 2007-08-08 16:24:34
|
On 8/8/07, Matt Chambers <mat...@va...> wrote: > > This does require a CV class and some methods: > > cv.loadFromFile() > > cv.isChildOf() > > cv.getName() > > > > but this is not really complicated. > > But it is really relatively complicated. It is more conceptually and > computationally complicated than simple string comparison (with the > OPTION of checking the CV to see if the value is a controlled one). And > worse, it's a complication I don't see a justification for unless there > is a better reason than the one you gave above which has a more simple > solution. I agree with Matt. A call like "isChildOf" looks simple, but what's entailed in that call is that the *correct* CV is available and has been parsed into a tree in memory. There are good reasons to think that this will be fairly difficult to do correctly in practice. But on top of that, it just seems needlessly difficult. It'd be a little like having products in your grocery store marked with their trademark name, but not a succinct description of what they *are*--which you can only find out with a stock list lookup. ("Shimmer? Is that a floor polish or a dessert topping? Hope my stock list is up to date...") The alternative here would appear to be very simple. Something like the previously mentioned <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQ-FT"/> would work fine. As for the differing spellings of "LTQ-FT", there's a canonical spelling available in the CV, and anyone that can't get that right will probably find the complexity of multiple CV versions insurmountable. Consider also, how should newly created instruments be handled? If our lab invents the "MassMaster2000", do we need to create our own augmented CV in order to handle this? Does everyone who wants to read MassMaster2000 mzML files need a copy of this augmented CV? What if they have twenty other augmented CVs? How are those to be managed? Mike |
From: Brian P. <bri...@in...> - 2007-08-08 15:49:39
|
If ionSelection is just one of many things that are too complicated and varied and dynamic to actually specify, then just off the top of my head I think it's going to be pretty hard to do a good job of parsing mzML. I take your point about mzXML being too specific, but there's such a thing as too general as well. My fear is that we'll see it balkanized, with most parsers only really able to deal with the mode of mzML usage that the author really cares about, which just leaves us with a bunch of ad hoc standards. The instrument name example (wherein a parser cannot be made robust enough to read future versions) makes me think that not enough mental energy has gone into considering the practicalities of being a consumer of mzML. I've seen this in other standards efforts I've been involved with in other industries (internet security, circuit board manufacturing) - writers (mostly hardware vendors) love the fexibility because they can just do it their way, but readers (software vendors) bear the brunt of what amounts to one format per vendor, and finally just fall back onto the per-vendor solutions they have already invested in. >> it is the same amount of work as if everything was in the schema. There actually *is* an advantage of specifying via schema instead of ontology, which I've already pointed out - W3C schema is itself a standard with a host of tools built up around it that will generate readers and writers from properly formed schemas. If mzML just used elements for everything and each element had an attribute pointing at the ontololgy I think we'd be better off. The schema and the ontology would need to evolve together, of course. But, as you say, this thing is more or less nailed down at this point, so I'm wasting the list's time with this schema talk, and I do apologise. I don't blame anyone for being annoyed at me dredging up these fundamental objections yet again so late in the process. Anyway, off for vacation until the end of next week. Sorry to start a flame then abandon it. Cheers, Brian _____ From: del...@gm... [mailto:del...@gm...] On Behalf Of Angel Pizarro Sent: Wednesday, August 08, 2007 6:01 AM To: Brian Pratt Cc: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value On 8/7/07, Brian Pratt <bri...@in...> wrote: Hi Angel, If I understand your question to be about identifying current mismatches between terminology in the schema and the ontology, I'm not sure there are any - but probably only because the schema has so little actual terminology in it. My question was more of a pragmatic one, about where would you add specificity into the mzML schema. Your selecitonWindow example below is a good one, in that the specification of of selectWindow is probably a range value and we should have two sub-elements that corresponding to type the cvParam values to define the window (or just a well defined range sub-element, skipping cvParam altogether). I don't think your second example is a good one tho, since there are so many permutations of an ionSelection protocol and that more are certainly one the way, t is better handled by an ontology specification. Yes this does make parsers slightly harder, since now you must pay attention to the incoming ontology, but it is the same amount of work as if everything was in the schema. mzXML could get away with tight specification of these complex and changing annotations, since its sole purpose was support of the ISB pipeline. Its open source status only served to increase the user base, but the schema changes were solely driven by the needs of that pipeline and solely by the community that used it. Tryin to build consensus across many different groups has led to the current version of mzML and that major structure of mzML will not change at this point, so please let's just get to the specifics of going through the schema and identifying where you think an annotation should be promoted to the level of a schema element, and we'll discuss as a group. -angel Consider this example: <xs:element name="selectionWindow" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="cvParam" type="dx:CVParamType" minOccurs="2" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> which says absolutely nothing at all about what a selectionWindow element can be expected to contain when you encounter it. It just says it will contain at least two "parameters". Not much of an aid to software development. The schema, if we can call it that, doesn't even specify what some of the most fundamental information about a scan looks like. For example, it specifies that a scan may have a list of precursors, each of which will contain an ionSelection, but stops short of telling you what an ionSelection looks like: <xs:element name="ionSelection" type="dx:ParamGroupType"> <xs:annotation> <xs:documentation>This captures the type of ion selection being performed, and trigger m/z (or m/z's), neutral loss criteria etc. for tandem-MS or data dependent scans.</xs:documentation> </xs:annotation> </xs:element> Nearly all the details of nearly all the elements are just unspecified blobs. Normally with an XML format you can expect to at least start your work by running it through something like XMLSpy that will autogenerate a reader and a writer that you can then polish up (to handle, for example, the necessary weirdness of base64+zlib in the peaklists). But with this, you get no kind of a head start at all, since the vast majority of the syntax is hidden behind blobs like dx:CVParamType and dx:ParamGroupType . It's just not a specification. The statement that led to your question, I think, was just me saying that if we *did* create an actual schema, we'd want its terminology to agree with the ontology where ever possible. But it has to actually contain some terminology, unlike the current schema. Brian _____ From: del...@gm... [mailto:del...@gm...] On Behalf Of Angel Pizarro Sent: Tuesday, August 07, 2007 1:10 PM To: Brian Pratt Cc: psi...@li... Subject: Re: [Psidev-ms-dev] cvParams using name attribute as value On 8/7/07, Brian Pratt <bri...@in...> wrote: Hey, the horse just twitched: by placing CVparam information in attributes of the elements of a conventionally structured XML schema (ala mzXML) we can make use of the OBO work without adding a lot of unwanted complexity to software systems that aren't really interested in it. An mzML that integrates well with OBO-aware systems is an excellent idea, but an mzML that demands you BE an OBO-aware system seems less likely to achieve widespread adoption. Can you name specific attributes that you want to have cv terms be the value for that are currently not in the schema? -angel ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Matt C. <mat...@va...> - 2007-08-08 13:24:53
|
Eric Deutsch wrote: > > The decision was made to make individual models cv terms to avoid > problems like: > > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LCQ Deca"/> > > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LCQ DECA"/> > > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQ FT"/> > > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQ-FT"/> > > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQFT"/> > Is this the main/only reason for this usage of terms? This just seems like a great argument for having the ontology control the values of the terms and not just the terms themselves. That way, the simple term/name->value relationship is always maintained, and this problem is eliminated. I am not advocating changing the structure of mzML at this point, I see this as a rather minor change. > I would argue that your code snippet below would better look like: > > #define MS_CV_POLARITY_TYPE “MS:1000037” > > if( element.parent == “spectrumDescription” ) { > > for each child { > > if (child.name=="cvParam") then { > > if( cv.isChildOf(child.attrs[‘accession], MS_CV_POLARITY_TYPE) ) // if > a polarity type > > spectrum.polarity = cv.getName(child.attrs[‘accession’]); > > } > > } > > Note that the cvParam name (should that be “positive” or “Positive” or > “positive polarity” or “Polarity” or “polarity”?) is not in the code, > just MS:1000037 which can be considered final. > > This does require a CV class and some methods: > > cv.loadFromFile() > > cv.isChildOf() > > cv.getName() > > but this is not really complicated. > But it is really relatively complicated. It is more conceptually and computationally complicated than simple string comparison (with the OPTION of checking the CV to see if the value is a controlled one). And worse, it's a complication I don't see a justification for unless there is a better reason than the one you gave above which has a more simple solution. Why force parsers to create a CV class and methods just to ensure that "LCQ Deca" is spelled right (or that it's given its proper accession number)? -Matt |
From: Angel P. <an...@ma...> - 2007-08-08 13:04:21
|
On 8/8/07, Eric Deutsch <ede...@sy...> wrote: > > Thank you all for the lively discussion. > > > > One proposal I once made in Lyon (which was roundly dismissed I believe) > was something like this: instead of: > > > > <cvParam cvLabel="MS" accession="MS:1000554" name="LCQ Deca" value=""/> > > > > Have: > > > > <cvParam cvLabel="MS" parentAccession="MS:1000031" accession="MS:1000554" > name="LCQ Deca" value=""/> > > > > Thus the parser can easily be coded to know that any cvParam with a > parentAccession="MS:1000031" is going to be an instrument model whether or > not it's in the CV. The mzML semantic validator tool would, of course, check > all this. The main argument against this was the potential for > inconsistency, I seem to recall. > The argument was that MAGE v1 did cv terms this way and caused tremendous amount of confusion for the MAGE producers and array express annotation checking team alike. It is infinitely easier to deal with nested cvParams than trying to output a term and a parent at the same time. |