From: Joshua T. <jt...@sy...> - 2007-08-08 18:58:23
|
Hi all, Actually, I agree that we'd be better served if more structure was applied at the xml schema level, but since design decisions have already been made and it seems we're past the point of changing them, I think we should stick to a consistent flavor. I'd propose finding most instances in the schema where attributes and values are defined by the xml schema and replacing them with cvParams. If we're reliant on the OBO, let's completely get away from any parsing of human-readable elements. In the OBO, we already have inconsistent capitalization for source file types: "mzData File" vs "wiff file". Let's simplify things and rely on the nice clean accession. From a look through the instance document, some examples: I'd like to see soureFileType as a sub cvParam with a specific accession reference, vs attribute: <sourceFile id="1" sourceFileName="tiny1.RAW" sourceFileLocation="file://F:/data/Exp01" sourceFileType="Xcalibur RAW file"> contactInfo could use value'd cvParams for name, institution, etc, or any other added features like email, phone, etc. fileChecksum's type should be a cv accession, instead of: <fileChecksum type="Sha1"> In spectrum, spectrumType should be an cvParam, not attribute: <spectrum id="S19" scanNumber="19" spectrumType="MSn" msLevel="1"> In binaryDataArray, attributes compressionType and dataType should be cvParams: <binaryDataArray dataType="64-bit float" compressionType="none" arrayLength="43" encodedLength="5000" dataProcessingRef="Xcalibur Processing"> Josh |
From: Matthew C. <mat...@va...> - 2007-08-08 19:36:25
|
> -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Joshua Tasman > Sent: Wednesday, August 08, 2007 1:58 PM > To: psi...@li... > Subject: [Psidev-ms-dev] attributes vs cvParams > > Hi all, > > Actually, I agree that we'd be better served if more structure was > applied at the xml schema level, but since design decisions have already > been made and it seems we're past the point of changing them, I think we > should stick to a consistent flavor. I'm not terribly concerned about flavor of the XML I consume, but I don't feel strongly one way or the other about most of the cvParam/schema issues. I do feel strongly that parsers should not be required to look at the CV to get basic meaning out of the file. > I'd propose finding most instances in the schema where attributes and > values are defined by the xml schema and replacing them with cvParams. > If we're reliant on the OBO, let's completely get away from any parsing > of human-readable elements. In the OBO, we already have inconsistent > capitalization for source file types: "mzData File" vs "wiff file". > Let's simplify things and rely on the nice clean accession. > > From a look through the instance document, some examples: > > I'd like to see soureFileType as a sub cvParam with a specific accession > reference, vs attribute: > <sourceFile id="1" sourceFileName="tiny1.RAW" > sourceFileLocation="file://F:/data/Exp01" sourceFileType="Xcalibur RAW > file"> I'm happy with: <sourceFile id="1" sourceFileName="tiny1.RAW" sourceFileLocation="file://F:/data/Exp01"> <cvParam cvLabel="MS" accession="MS:xxxxxxx" name="Source file type" value="Xcalibur RAW file" /> </sourceFile> This must be accompanied by adding specific valid values to the ontology, not just unique accession numbers. I am not happy with: <sourceFile id="1" sourceFileName="tiny1.RAW" sourceFileLocation="file://F:/data/Exp01"> <cvParam cvLabel="MS" accession="MS:xxxxxxx" name="Xcalibur RAW file" value="" /> </sourceFile> The idea of values being represented as unique accession numbers is against common sense and possibly carcinogenic. ;) > contactInfo could use value'd cvParams for name, institution, etc, or > any other added features like email, phone, etc. > > fileChecksum's type should be a cv accession, instead of: > <fileChecksum type="Sha1"> What exactly are you suggesting here? <fileChecksum accession="MS:xx(sha1)xx">71be39fb2700ab2f3c8b2234b91274968b6899b1</fileChec ksum> Or <fileChecksum>71be39fb2700ab2f3c8b2234b91274968b6899b1<cvParam cvLabel="MS" accession="MS:xx(checksumType)xx" name="Checksum type" value="Sha1" /></fileChecksum> <!-- ewwww --> Or <fileChecksum>71be39fb2700ab2f3c8b2234b91274968b6899b1<cvParam cvLabel="MS" accession="MS:xx(sha1)xx" name="Sha1" value="" /></fileChecksum><!-- double ewww! --> I don't think any of these is better than leaving it as an attribute (and possibly giving the checksum type attribute a schema type instead of putting it in the ontology. I don't think the cvParam paradigm works well on elements which only have text nodes for children or which have no children at all. > In spectrum, spectrumType should be an cvParam, not attribute: > <spectrum id="S19" scanNumber="19" spectrumType="MSn" msLevel="1"> I agree with this one. > In binaryDataArray, attributes compressionType and dataType should be > cvParams: > <binaryDataArray dataType="64-bit float" compressionType="none" > arrayLength="43" encodedLength="5000" dataProcessingRef="Xcalibur > Processing"> I agree with this as well. -Matt |
From: Matthew C. <mat...@va...> - 2007-10-03 15:34:18
|
Hi all, Time to reopen this can of worms! I like the specification document. It's clearly written. Unfortunately there is no clear way that I know of to capture the semantically valid cvParam relationships in a flat written document, but that can be done externally and it doesn't bother me. I have one comment before discussing cvParams though: where is the rationale for having "referenceable" paramGroups? I'm not disagreeing with the idea, I think it's good, but it does need a rationale because it's not typical XML practice. For example, why not use the xlink standard to do the referencing? Also, do we guarantee the order of the elements so that "referenceableParamGroupList" is always known to come before the first "run" element (which if I read correctly is the first element to make use of "paramGroupRef"s)? As for attributes vs. cvParams, I have a compromise to propose between methods A, B and C. I earlier proposed an extension to the structure of the CV which would be intended to force format writers to use certain well-defined values instead of whatever kind of capitalization and spacing they wish. That proposal still stands and I'd like to hear feedback on it. But I think we should agree on some basic requirements and then evaluate proposals from there (this was probably done in one of your meetings or teleconferences, but I don't recall such a requirements list being posted on this mailing list). According to the specification document, there is a requirement to have a long-term, unchanging specification, mainly due to vendor interests it seems, which of course in the changing field of MS also means a requirement of a companion CV. I happen to agree with the idea of having a long-term, unchanging specification with a CV, even though I don't intend to use the CV very much, if at all. From a previous post by Eric Deutsch in this thread: <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LCQ Deca"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LCQ DECA"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQ FT"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQ-FT"/> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" value="LTQFT"/> OK, so because of this legitimate concern we have another requirement: the spec must allow defining a restricted value set for categories like "instrument model." I do not see a reason for a requirement that the spec must use accession numbers to enumerate those values. Consider, for example, that we have not specified whether the cvLabel parameter is case sensitive or not. Suppose a naughty writer starts using lowercase instead of uppercase for the cvLabel, or for the cvLabel prefix on the accession number. Even worse, suppose the case sensitivity between the accession number's prefix and the cvLabel don't match. The best we can do is specify things like case sensitivity for these issues or force a certain case in certain contexts. We can't prevent people from writing broken instances of the specification. Based on the above requirement, one concern that I have (and I think many others do too, because frankly I get a strong impression that many people who want to use this spec don't care about being CV aware) is that a writer should be able to write a cvParam with a value that is not in the allowed value set of the CV without making readers have no clue what the value is actually indicating. In other words, regardless of whether a reader is CV aware or not, a (machine OR human) reader should be able to glean the purpose of an unknown value in a cvParam via some kind of category specification (e.g. "instrument model", or by the category's accession number). If this is accepted as a requirement, it practically eliminates method A as an option because it provides no indication of what category the unknown cvParam's value belongs to. There are perhaps other requirements for the cvParam, but I'll let others fill them in. My new proposed compromise is to split values into a valueAccession and a valueName, just like the optional unitAccession and unitName. The two value attributes would not be optional like the unit attributes, though. A special CV accession number could be allocated to indicate an "unrestricted" value, in which case the reader would use the valueName as the value. Alternatively, the reader could read the accession attribute (which in this compromise would always indicate a category's accession number) and choose based on that whether to look up the valueAccession in the CV or to use the valueName verbatim. So the SRM spectrum example would become: <cvParam cvLabel="MS" accession="MS:1000035" name="spectrum type" valueAccession="MS:1000583" valueName="SRM spectrum"/> I like ketchup on my worms, how bout you? -Matt Chambers Vanderbilt MSRC For reference, AFAIK this is the last post in this thread: Joshua Tasman wrote: > Hi all, > > Actually, I agree that we'd be better served if more structure was > applied at the xml schema level, but since design decisions have > already been made and it seems we're past the point of changing them, > I think we should stick to a consistent flavor. > > I'd propose finding most instances in the schema where attributes and > values are defined by the xml schema and replacing them with cvParams. > If we're reliant on the OBO, let's completely get away from any > parsing of human-readable elements. In the OBO, we already have > inconsistent capitalization for source file types: "mzData File" vs > "wiff file". Let's simplify things and rely on the nice clean accession. > > From a look through the instance document, some examples: > > I'd like to see soureFileType as a sub cvParam with a specific > accession reference, vs attribute: > <sourceFile id="1" sourceFileName="tiny1.RAW" > sourceFileLocation="file://F:/data/Exp01" sourceFileType="Xcalibur RAW > file"> > > contactInfo could use value'd cvParams for name, institution, etc, or > any other added features like email, phone, etc. > > fileChecksum's type should be a cv accession, instead of: > <fileChecksum type="Sha1"> > > In spectrum, spectrumType should be an cvParam, not attribute: > <spectrum id="S19" scanNumber="19" spectrumType="MSn" msLevel="1"> > > In binaryDataArray, attributes compressionType and dataType should be > cvParams: > <binaryDataArray dataType="64-bit float" compressionType="none" > arrayLength="43" encodedLength="5000" dataProcessingRef="Xcalibur > Processing"> > > > Josh > |
From: Lennart M. <len...@gm...> - 2007-10-04 10:24:52
|
Hi Matt, > Time to reopen this can of worms! I like the specification document. > It's clearly written. Unfortunately there is no clear way that I know > of to capture the semantically valid cvParam relationships in a flat > written document, but that can be done externally and it doesn't bother > me. I have one comment before discussing cvParams though: where is the > rationale for having "referenceable" paramGroups? I'm not disagreeing > with the idea, I think it's good, but it does need a rationale because > it's not typical XML practice. For example, why not use the xlink > standard to do the referencing? Also, do we guarantee the order of the > elements so that "referenceableParamGroupList" is always known to come > before the first "run" element (which if I read correctly is the first > element to make use of "paramGroupRef"s)? The order of the elements is fixed. ReferenceableParamGroups can be referenced from any 'normal' paramgroup (which consists of any number of such refs, user params and cv params), as is clearly evident from the schema and schemadoc. > As for attributes vs. cvParams, I have a compromise to propose between > methods A, B and C. I earlier proposed an extension to the structure of > the CV which would be intended to force format writers to use certain > well-defined values instead of whatever kind of capitalization and > spacing they wish. That proposal still stands and I'd like to hear > feedback on it. This is no use. It imnmediately breaks down in the face of synonyms. Accession numbers are the way to go. Everybody in the life sciences knows and understands this principle ('9606' is 'human' or 'Homo sapiens' or 'man' or ...) > But I think we should agree on some basic requirements and then evaluate > proposals from there (this was probably done in one of your meetings or > teleconferences, but I don't recall such a requirements list being > posted on this mailing list). According to the specification document, > there is a requirement to have a long-term, unchanging specification, > mainly due to vendor interests it seems, which of course in the changing > field of MS also means a requirement of a companion CV. I happen to > agree with the idea of having a long-term, unchanging specification with > a CV, even though I don't intend to use the CV very much, if at all. That would make for very poor mzML documents then, as we semantically validate these files now (see the semantic validator in the beefier mzML kit). Your CV-less files would surely not validate, and would NOT be mzML files. > From a previous post by Eric Deutsch in this thread: > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LCQ Deca"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LCQ DECA"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQ FT"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQ-FT"/> > <cvParam cvLabel="MS" accession="MS:1000031" name="instrument model" > value="LTQFT"/> > OK, so because of this legitimate concern we have another requirement: > the spec must allow defining a restricted value set for categories like > "instrument model." Sorry, but you are erroneously jumping to conclusions. The CV allows children to be added dynamically, correct usage of these can be validated and the list of children can be updated on-the-fly from web resources like the OLS (which auto-update every night). > I do not see a reason for a requirement that the > spec must use accession numbers to enumerate those values. Consider, for > example, that we have not specified whether the cvLabel parameter is > case sensitive or not. Suppose a naughty writer starts using lowercase > instead of uppercase for the cvLabel, or for the cvLabel prefix on the > accession number. Even worse, suppose the case sensitivity between the > accession number's prefix and the cvLabel don't match. The best we can > do is specify things like case sensitivity for these issues or force a > certain case in certain contexts. We can't prevent people from writing > broken instances of the specification. Again, you fail to see the point. The corrrect usage of CV terms can be validated. So if you mistype a number or its prefix, this will be considered an error. We need numbers because we want to be able to deal with synonyms (or even outright changes in the term names; it has happened before). Numbers are robust, numbers are convenient, numbers are strong. Text is not. > Based on the above requirement, one concern that I have (and I think > many others do too, because frankly I get a strong impression that many > people who want to use this spec don't care about being CV aware) is > that a writer should be able to write a cvParam with a value that is not > in the allowed value set of the CV without making readers have no clue > what the value is actually indicating. In other words, regardless of > whether a reader is CV aware or not, a (machine OR human) reader should > be able to glean the purpose of an unknown value in a cvParam via some > kind of category specification (e.g. "instrument model", or by the > category's accession number). If this is accepted as a requirement, it > practically eliminates method A as an option because it provides no > indication of what category the unknown cvParam's value belongs to. There is the option to include userparams. Alternatively, you take the productive approach and signal the need to add the term to the CV. Remember that powerful and extremely user-friendly tools like the OLS take care of updating new terms for you fully automatically. If you need to know the context of a term, referring to the CV should be your first and most prominent approach. > There are perhaps other requirements for the cvParam, but I'll let > others fill them in. My new proposed compromise is to split values into > a valueAccession and a valueName, just like the optional unitAccession > and unitName. The two value attributes would not be optional like the > unit attributes, though. A special CV accession number could be > allocated to indicate an "unrestricted" value, in which case the reader > would use the valueName as the value. Alternatively, the reader could > read the accession attribute (which in this compromise would always > indicate a category's accession number) and choose based on that whether > to look up the valueAccession in the CV or to use the valueName > verbatim. So the SRM spectrum example would become: > <cvParam cvLabel="MS" accession="MS:1000035" name="spectrum type" > valueAccession="MS:1000583" valueName="SRM spectrum"/> For various complex reasons, this will wreck havoc. Because now the two (accession and value accession) run the (unnecessary!) risk of being able to go out of sync. I seem to read in your comments so far that there is a certain reluctance to the use of CV terms because this is new, and doesn't fit well with what you are good at right now. I would ask that you have a look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer documentation on how to access the OLS web services using your favourite programming language. After playing with it a bit, you'll notice that incorporating CV's into the parsing is not that much work, yet yields very clear benefits. Cheers, lnnrt. |
From: Mike C. <tu...@gm...> - 2007-10-04 16:52:41
|
On 10/4/07, Lennart Martens <len...@gm...> wrote: > This is no use. It imnmediately breaks down in the face of synonyms. > Accession numbers are the way to go. Everybody in the life sciences > knows and understands this principle ('9606' is 'human' or 'Homo > sapiens' or 'man' or ...) Hmm. I think what you are saying is that end users are not always able to properly distinguish between canonical *identifiers* (e.g., '9606' or 'human') and descriptive text unless the former happens to look a meaningless string, such as a string of digits. That may be, but strings of digits have their own problems. It's a lot easier to see that 'humaZ' is probably an invalid identifier than that '9607' is, when looking for (the inevitable) problems. I think that biologists understand the value of having semi-meaningful identifiers. They don't use digit strings for gene identifiers, for example. > That would make for very poor mzML documents then, as we semantically > validate these files now (see the semantic validator in the beefier mzML > kit). Your CV-less files would surely not validate, and would NOT be > mzML files. Hmm. How complex is a minimal valid mzML file? If they're not fairly easy to generate, without knowing much about CV, this seems like a problem. > Sorry, but you are erroneously jumping to conclusions. The CV allows > children to be added dynamically, correct usage of these can be > validated and the list of children can be updated on-the-fly from web > resources like the OLS (which auto-update every night). I'm not sure what this means. A nightly update of terms from the web cannot be on our critical path for processing of spectra. We need to be able to proceed even if the OLS disappears forever. > Again, you fail to see the point. The corrrect usage of CV terms can be > validated. So if you mistype a number or its prefix, this will be > considered an error. We need numbers because we want to be able to deal > with synonyms (or even outright changes in the term names; it has > happened before). Numbers are robust, numbers are convenient, numbers > are strong. Text is not. Actually, it's the other way around. Characters strings are robust and convenient, numbers are not. The string 'human' is clearly not equal to 'humaZ'. The string '123' is clearly not equal to the string '0123'. Is the number 123 the same or different than 0123? How about 0 and -0, not to mention 123.4 and 123.40 or 0.999999999999 and 1.0? The use of numbers in a context like this seems to be mostly due to history. They may be a little more convenient for programmers, but that's negligible. > Remember that powerful and extremely user-friendly tools like the OLS > take care of updating new terms for you fully automatically. This phrase "powerful and extremely user-friendly tools" is a little scary. This implies having to learn, debug, etc., another piece of software--one not necessarily under our control. To be truly useful, the spec really has to stand on its own (possibly referencing other specs and data). > I seem to read in your comments so far that there is a certain > reluctance to the use of CV terms because this is new, and doesn't fit > well with what you are good at right now. I would ask that you have a > look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer > documentation on how to access the OLS web services using your favourite > programming language. After playing with it a bit, you'll notice that > incorporating CV's into the parsing is not that much work, yet yields > very clear benefits. I don't even have time to keep up with this list, and the benefits of OLS are far from clear. Mike |
From: Matthew C. <mat...@va...> - 2007-10-04 16:12:23
|
Hi Lennart, Lennart Martens wrote: >> As for attributes vs. cvParams, I have a compromise to propose >> between methods A, B and C. I earlier proposed an extension to the >> structure of the CV which would be intended to force format writers >> to use certain well-defined values instead of whatever kind of >> capitalization and spacing they wish. That proposal still stands and >> I'd like to hear feedback on it. > > This is no use. It imnmediately breaks down in the face of synonyms. > Accession numbers are the way to go. Everybody in the life sciences > knows and understands this principle ('9606' is 'human' or 'Homo > sapiens' or 'man' or ...) > I am a mere computer scientist, and to me all characters on computers are numbers. ;) But I know what you are saying, and I have taken that into consideration. That is why my suggestion was for the CV to CONTROL the synonyms and not let the synonyms be written but one way in VALID mzML. From a technical perspective, this is no different than controlling the accession numbers. From a practical perspective, I appreciate that some users might not be comfortable with having their options for text-based value attributes be controlled like they are for accession numbers, and if that's the majority perspective then I'm fine with using accession numbers for values. >> But I think we should agree on some basic requirements and then >> evaluate proposals from there (this was probably done in one of your >> meetings or teleconferences, but I don't recall such a requirements >> list being posted on this mailing list). According to the >> specification document, there is a requirement to have a long-term, >> unchanging specification, mainly due to vendor interests it seems, >> which of course in the changing field of MS also means a requirement >> of a companion CV. I happen to agree with the idea of having a >> long-term, unchanging specification with a CV, even though I don't >> intend to use the CV very much, if at all. > > That would make for very poor mzML documents then, as we semantically > validate these files now (see the semantic validator in the beefier > mzML kit). Your CV-less files would surely not validate, and would NOT > be mzML files. > Um, excuse me but I'm perfectly capable of writing and reading valid mzML without using a CV web service or any kind of external validation. It may take a /bit/ of manual effort, but it's entirely possible. Of course, if you go with method A for the cvParams and the manual parser has to have an else/if for every possible value's accession number, then you're talking about a LOT of manual effort. But with method B or C, not much at all. >> From a previous post by Eric Deutsch in this thread: >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LCQ Deca"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LCQ DECA"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LTQ FT"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LTQ-FT"/> >> <cvParam cvLabel="MS" accession="MS:1000031" name="instrument >> model" value="LTQFT"/> >> OK, so because of this legitimate concern we have another >> requirement: the spec must allow defining a restricted value set for >> categories like "instrument model." > > Sorry, but you are erroneously jumping to conclusions. The CV allows > children to be added dynamically, correct usage of these can be > validated and the list of children can be updated on-the-fly from web > resources like the OLS (which auto-update every night). > I don't understand what you're saying here. Are you saying that we do NOT have a requirement of the spec needing to restrict the values set for a given cvParam category? I don't understand the relevance of the updatability of the CV in this context. >> I do not see a reason for a requirement that the spec must use >> accession numbers to enumerate those values. Consider, for example, >> that we have not specified whether the cvLabel parameter is case >> sensitive or not. Suppose a naughty writer starts using lowercase >> instead of uppercase for the cvLabel, or for the cvLabel prefix on >> the accession number. Even worse, suppose the case sensitivity >> between the accession number's prefix and the cvLabel don't match. >> The best we can do is specify things like case sensitivity for these >> issues or force a certain case in certain contexts. We can't prevent >> people from writing broken instances of the specification. > > Again, you fail to see the point. The corrrect usage of CV terms can > be validated. So if you mistype a number or its prefix, this will be > considered an error. We need numbers because we want to be able to > deal with synonyms (or even outright changes in the term names; it has > happened before). Numbers are robust, numbers are convenient, numbers > are strong. Text is not. > Ah, so you WANT support for synonyms. I don't really understand that in the context of writing a standard data-representation format, but ok. >> Based on the above requirement, one concern that I have (and I think >> many others do too, because frankly I get a strong impression that >> many people who want to use this spec don't care about being CV >> aware) is that a writer should be able to write a cvParam with a >> value that is not in the allowed value set of the CV without making >> readers have no clue what the value is actually indicating. In other >> words, regardless of whether a reader is CV aware or not, a (machine >> OR human) reader should be able to glean the purpose of an unknown >> value in a cvParam via some kind of category specification (e.g. >> "instrument model", or by the category's accession number). If this >> is accepted as a requirement, it practically eliminates method A as >> an option because it provides no indication of what category the >> unknown cvParam's value belongs to. > > There is the option to include userparams. Alternatively, you take the > productive approach and signal the need to add the term to the CV. > Remember that powerful and extremely user-friendly tools like the OLS > take care of updating new terms for you fully automatically. If you > need to know the context of a term, referring to the CV should be your > first and most prominent approach. Oh yes, the userParam. A synonym for the <comment> element ;). Please tell me how to use such an element in a meaningful and deterministic way. If I write a value into a cvParam with the category "instrument model" where the value text is "Super Duper Ion Trap" and the value's accession number is a special accession number which means "not yet in CV", ANY reader software should be able to interpret that parameter and ultimately say that it has no idea what to do with data from such an instrument. The reader software can even be updated to know how to deal with that instrument by its value text instead of the value accession number, and once that's done some usable data already exists. Nobody had to wait for that instrument model to be added to the CV for the data to be usable. After that instrument model is added to the CV, of course, the writer should be updated to use the proper accession number. If a reader is using the CV tools, their parser will be capable of reading such data automatically, and any reader that chose to manually update in order to deal with the value text while the value accession indicated "not yet in CV" can then choose whether to keep that support intact in order to deal with the data that was already generated, or it can remove it and return to using the pure CV. If a primary goal is "flexibility," then forcing people to add a web service to their XML parser in order to get the CV is seriously stretching that goal. >> There are perhaps other requirements for the cvParam, but I'll let >> others fill them in. My new proposed compromise is to split values >> into a valueAccession and a valueName, just like the optional >> unitAccession and unitName. The two value attributes would not be >> optional like the unit attributes, though. A special CV accession >> number could be allocated to indicate an "unrestricted" value, in >> which case the reader would use the valueName as the value. >> Alternatively, the reader could read the accession attribute (which >> in this compromise would always indicate a category's accession >> number) and choose based on that whether to look up the >> valueAccession in the CV or to use the valueName verbatim. So the SRM >> spectrum example would become: >> <cvParam cvLabel="MS" accession="MS:1000035" name="spectrum type" >> valueAccession="MS:1000583" valueName="SRM spectrum"/> > > For various complex reasons, this will wreck havoc. Because now the > two (accession and value accession) run the (unnecessary!) risk of > being able to go out of sync. > I see you have not elected to enumerate these various complex reasons, or describe what on earth you mean by having the accession numbers go out of sync. I think you failed to notice that this compromise is very similar to method C, which in a recent post you put in your (tied) vote for. In my opinion, it looks better and is more intuitive than the syntax in method C, but the semantics are exactly the same. In method C it would look like: <cvParam cvLabel="MS" categoryAccession="MS:1000035" categoryName="spectrum type" accession="MS:1000583" name="SRM spectrum"/> You see? Straight out of the specification document. Were you perhaps referring to the special accession numbers? I proposed one that would mean that the value is "unrestricted" and another that would mean that the current value is not yet added to the cvParam but has been (or soon will be) submitted for adding (patent pending, if you will). > I seem to read in your comments so far that there is a certain > reluctance to the use of CV terms because this is new, and doesn't fit > well with what you are good at right now. I would ask that you have a > look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer > documentation on how to access the OLS web services using your > favourite programming language. After playing with it a bit, you'll > notice that incorporating CV's into the parsing is not that much work, > yet yields very clear benefits. > You read correctly. The clear benefits that the CV provides are not having to update the parser manually to deal with new CV terms and having a unified set of categories and values from which to generate data models. Excuse my rudeness, but: Whoopdeedoo! The vast majority of development effort is NOT in the parser, regardless of whether the parser is automatically or manually written. The vast majority of development is in the PROCESSING of the data that gets parsed, and unless I'm missing something big, the CV provides no benefit at all for processing new kinds of data. I'm NOT suggesting that the CV should provide such a benefit, of course, only trying to convey the reason for my reluctance. In other words, I have no qualms about writing a new "else if" block to my parser every time a new kind of data comes out, considering that I will always have to add 500 other lines of code elsewhere in my software to actually process the new kind of data in a meaningful way. -Matt |
From: Angel P. <an...@ma...> - 2007-10-04 16:44:05
|
Lennert and Matt, While I appreciate that this is a topic of great interest to everyone in the community, let's turn the heat down a bit. Let me see if I can play the arbiter here: cvParams since their introduction have always been contentious. Given the choice for design of a data formate where attributes (or sub elements or inner text) could be encoded with a tight set of enumerated sets of values vs. empty slots, a developer will always choose the former. Why then did the mzML group choose cvParams? The answer is two fold: 1) the audience, and 2) the intent of the standard 1) Name one standard that has received industry support across multiple vendors/tools/institutions that is tightly controlled with enumerated values. Prove me wrong, but I can't think of any. The reasons for this is that consensus building is a slow process and approval of any change in a data format can take months if not years. You need flexible data formats for standards. This already rules out enumerated values, but you can also make the case that vendors are unwilling to tie their development efforts to projects that are not under their complete control (essentially motivated by risk management). As a vendor, if you officially support even on release of a fast moving data format, customer expectations are such that you are now expected to support all future releases of that format. 2) The intent of mzML is data transfer and vendor independent storage of mass spec experimental data. It is not (officially) meant to be an operational format. Operational formats would put much more weight on the side of enumerated values. So for theses reasons (there are more though) cvParams are not going to go away. As for actually doing work with mzML files, Matt is absolutely right, this is going to be way more difficult than working with mzXML 2.x (as a developer) While OLS is a fine andd dandy project, it is not the end-all be-all solution to our problems. It assumes network connectivity, which is a dubious assumption. Even assuming very fast connectivity, the overhead of SOAP protocols are waaaayyy too big to except in your typical use of mzML files, which are signal processing and searches. Please stop equating OLS with mzML (or any other ML) since for most uses outside of a repository it just won't work. -a |
From: Matthew C. <mat...@va...> - 2007-10-04 16:57:25
|
Thanks Angel, I didn't intend for the discussion to get heated, it just seemed to me that Lennart didn't understand what I posted (which may be my fault, it's hard to know without other replies). Remember I posted that I agree with cvParams and appreciate the flexibility they provide. But there is a difference between cvParams that have meaning without the CV and cvParams that aren't. I much prefer the latter. So neither of us are arguing for cvParams to go away. You must be talking to somebody else. :) -Matt Angel Pizarro wrote: > Lennert and Matt, > > While I appreciate that this is a topic of great interest to everyone > in the community, let's turn the heat down a bit. Let me see if I can > play the arbiter here: > > cvParams since their introduction have always been contentious. Given > the choice for design of a data formate where attributes (or sub > elements or inner text) could be encoded with a tight set of > enumerated sets of values vs. empty slots, a developer will always > choose the former. > > Why then did the mzML group choose cvParams? The answer is two fold: > 1) the audience, and 2) the intent of the standard > > 1) Name one standard that has received industry support across > multiple vendors/tools/institutions that is tightly controlled with > enumerated values. Prove me wrong, but I can't think of any. > > The reasons for this is that consensus building is a slow process and > approval of any change in a data format can take months if not years. > You need flexible data formats for standards. This already rules out > enumerated values, but you can also make the case that vendors are > unwilling to tie their development efforts to projects that are not > under their complete control (essentially motivated by risk > management). As a vendor, if you officially support even on release of > a fast moving data format, customer expectations are such that you are > now expected to support all future releases of that format. > > 2) The intent of mzML is data transfer and vendor independent storage > of mass spec experimental data. It is not (officially) meant to be an > operational format. Operational formats would put much more weight on > the side of enumerated values. > > > So for theses reasons (there are more though) cvParams are not going > to go away. As for actually doing work with mzML files, Matt is > absolutely right, this is going to be way more difficult than working > with mzXML 2.x (as a developer) While OLS is a fine andd dandy > project, it is not the end-all be-all solution to our problems. It > assumes network connectivity, which is a dubious assumption. Even > assuming very fast connectivity, the overhead of SOAP protocols are > waaaayyy too big to except in your typical use of mzML files, which > are signal processing and searches. Please stop equating OLS with > mzML (or any other ML) since for most uses outside of a repository it > just won't work. -a |
From: Eric D. <ede...@sy...> - 2007-10-04 18:29:14
|
Hi everyone, thank you for the discussion. Please do try to keep the posts fairly respectful since we don't want to turn off others from contributing to the discussion. I won't be able to reply to all the posts here, but I am reading them with interest. I do note that the discussion is dominated by a few. I know there is a large group of lurkers out there, reading but not saying anything. I would highly encourage those who have not yet contributed to send a short note with your thoughts, however brief. While it may be a small group of us wrestling with the details, we're very interested what everyone else out there is thinking, including the vendors. Even regarding the issues of extensive use of cvParams and the CV and a long-term stable schema: the builders of mzML have taken cues from the community that this is important to them, despite the rapidly advancing field. If you have opinions on this, please do share them. Most XML formats that I generate and use myself are quite strongly structured, so such heavy use of cvParams is a stretch for me. I agree that there is a significant element of risk here. I want to believe that we can make this work because we have a high quality semantic validator easily available to the community at the time of submission for review. This is new, as far as I'm aware. If we can get everyone to use that validator responsibly, this may be a success story. Thank you! Eric > -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Matthew Chambers > Sent: Thursday, October 04, 2007 9:57 AM > To: Angel Pizarro > Cc: psi...@li... > Subject: Re: [Psidev-ms-dev] attributes vs cvParams >=20 > Thanks Angel, I didn't intend for the discussion to get heated, it just > seemed to me that Lennart didn't understand what I posted (which may be > my fault, it's hard to know without other replies). Remember I posted > that I agree with cvParams and appreciate the flexibility they provide. > But there is a difference between cvParams that have meaning without the > CV and cvParams that aren't. I much prefer the latter. So neither of > us are arguing for cvParams to go away. You must be talking to somebody > else. :) >=20 > -Matt >=20 > Angel Pizarro wrote: > > Lennert and Matt, > > > > While I appreciate that this is a topic of great interest to everyone > > in the community, let's turn the heat down a bit. Let me see if I can > > play the arbiter here: > > > > cvParams since their introduction have always been contentious. Given > > the choice for design of a data formate where attributes (or sub > > elements or inner text) could be encoded with a tight set of > > enumerated sets of values vs. empty slots, a developer will always > > choose the former. > > > > Why then did the mzML group choose cvParams? The answer is two fold: > > 1) the audience, and 2) the intent of the standard > > > > 1) Name one standard that has received industry support across > > multiple vendors/tools/institutions that is tightly controlled with > > enumerated values. Prove me wrong, but I can't think of any. > > > > The reasons for this is that consensus building is a slow process and > > approval of any change in a data format can take months if not years. > > You need flexible data formats for standards. This already rules out > > enumerated values, but you can also make the case that vendors are > > unwilling to tie their development efforts to projects that are not > > under their complete control (essentially motivated by risk > > management). As a vendor, if you officially support even on release of > > a fast moving data format, customer expectations are such that you are > > now expected to support all future releases of that format. > > > > 2) The intent of mzML is data transfer and vendor independent storage > > of mass spec experimental data. It is not (officially) meant to be an > > operational format. Operational formats would put much more weight on > > the side of enumerated values. > > > > > > So for theses reasons (there are more though) cvParams are not going > > to go away. As for actually doing work with mzML files, Matt is > > absolutely right, this is going to be way more difficult than working > > with mzXML 2.x (as a developer) While OLS is a fine andd dandy > > project, it is not the end-all be-all solution to our problems. It > > assumes network connectivity, which is a dubious assumption. Even > > assuming very fast connectivity, the overhead of SOAP protocols are > > waaaayyy too big to except in your typical use of mzML files, which > > are signal processing and searches. Please stop equating OLS with > > mzML (or any other ML) since for most uses outside of a repository it > > just won't work. -a >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Mike C. <tu...@gm...> - 2007-10-04 17:08:07
|
On 10/4/07, Matthew Chambers <mat...@va...> wrote: > Oh yes, the userParam. A synonym for the <comment> element ;). Please > tell me how to use such an element in a meaningful and deterministic > way. If I write a value into a cvParam with the category "instrument > model" where the value text is "Super Duper Ion Trap" and the value's > accession number is a special accession number which means "not yet in > CV", ANY reader software should be able to interpret that parameter and > ultimately say that it has no idea what to do with data from such an > instrument. I agree with Matt here. In particular, if I encounter this new "Super Duper Ion Trap" for the first time, it would be completely unacceptable for my software to indicate this by saying that my mzML file is invalid. My software needs to be able to parse this file and tell me that the data came from a new instrument called "Super Duper Ion Trap" that it doesn't know how to deal with. Mike |
From: Angel P. <an...@ma...> - 2007-10-04 17:56:12
|
On 10/4/07, Mike Coleman <tu...@gm...> wrote: > > On 10/4/07, Matthew Chambers <mat...@va...> wrote: > > Oh yes, the userParam. A synonym for the <comment> element ;). Please > > tell me how to use such an element in a meaningful and deterministic > > way. If I write a value into a cvParam with the category "instrument > > model" where the value text is "Super Duper Ion Trap" and the value's > > accession number is a special accession number which means "not yet in > > CV", ANY reader software should be able to interpret that parameter and > > ultimately say that it has no idea what to do with data from such an > > instrument. > > I agree with Matt here. In particular, if I encounter this new "Super > Duper Ion Trap" for the first time, it would be completely > unacceptable for my software to indicate this by saying that my mzML > file is invalid. My software needs to be able to parse this file and > tell me that the data came from a new instrument called "Super Duper > Ion Trap" that it doesn't know how to deal with. WRT to my point about operational vs. repository data formats. For a repository, it is completely valid (and desirable) for the software to parse this new value and add it to the list of possible values for the ontology category. -angel |
From: Matthew C. <mat...@va...> - 2007-10-04 18:39:35
|
I'm not sure what you're saying here. Users can programmatically (via a web service, I presume) add terms to the CV without going through a community approval process? If it's something else, please elaborate. -Matt Angel Pizarro wrote: > > WRT to my point about operational vs. repository data formats. For a > repository, it is completely valid (and desirable) for the software to > parse this new value and add it to the list of possible values for the > ontology category. > > -angel > > |
From: Angel P. <an...@ma...> - 2007-10-04 18:51:05
|
Y, I guess that it was not too clear, sorry about that. I did not mean to imply users can add terms and accession on the fly. That would be a userParam. cvParams need a source CV and that source CV would be the portal for submitting new terms. Shameless plug for PSI: all of the working groups have a CV development component, so if an area is important to you, please review the CV's and send additions / amendments to the group for review! By my reply, I only meant that parsers written for data loading into a repository ( e.g. theGPM / CPAS / SBEAMS) have a different set of requirements than other tools. New terms (e.g. not in the repositories catalog yet) should not be show-stoppers for those types of parsers. -angel On 10/4/07, Matthew Chambers <mat...@va...> wrote: > > I'm not sure what you're saying here. Users can programmatically (via a > web service, I presume) add terms to the CV without going through a > community approval process? If it's something else, please elaborate. > > -Matt > > Angel Pizarro wrote: > > > > WRT to my point about operational vs. repository data formats. For a > > repository, it is completely valid (and desirable) for the software to > > parse this new value and add it to the list of possible values for the > > ontology category. > > > > -angel > > > > > |
From: Matthew C. <mat...@va...> - 2007-10-04 19:00:32
|
Oh, I understand now. I am not familiar with what GPM/CPAS/SBEAMS do with MS data when they parse it, but I can certainly conceive of simply reading the cvParams in as key-value pairs and storing them as text. Like I said earlier, it's adding support in signal processing software for the new terms that has the greatest cost, and very little of that cost needs to go toward supporting the new terms in the software's parser. If the cvParam takes the form of method A in the spec, though, then a manually written, CV-unaware parser could potentially require significant changes, whereas method B or C (or my modified proposal of C) would not. -Matt Angel Pizarro wrote: > Y, I guess that it was not too clear, sorry about that. I did not mean > to imply users can add terms and accession on the fly. That would be a > userParam. cvParams need a source CV and that source CV would be the > portal for submitting new terms. > > Shameless plug for PSI: all of the working groups have a CV > development component, so if an area is important to you, please > review the CV's and send additions / amendments to the group for review! > > By my reply, I only meant that parsers written for data loading into > a repository ( e.g. theGPM / CPAS / SBEAMS) have a different set of > requirements than other tools. New terms (e.g. not in the repositories > catalog yet) should not be show-stoppers for those types of parsers. > > -angel > On 10/4/07, *Matthew Chambers* < mat...@va... > <mailto:mat...@va...>> wrote: > > I'm not sure what you're saying here. Users can programmatically > (via a > web service, I presume) add terms to the CV without going through a > community approval process? If it's something else, please elaborate. > > -Matt > > Angel Pizarro wrote: > > > > WRT to my point about operational vs. repository data formats. > For a > > repository, it is completely valid (and desirable) for the > software to > > parse this new value and add it to the list of possible values > for the > > ontology category. > > > > -angel > > > > > > |