From: Brian P. <bri...@in...> - 2007-10-08 21:43:11
|
Hi Matt, >> CV is organized by accession numbers, which are unique, whereas the schema is organized by element names, which are usually unique but not always. Right, but I propose to use XSD value restriction syntax to associate each element with a unique accession number. The schema can declare things like "a 'foo' element will always have an attribute named 'accession' which has exactly one legal value 'MS:12345', and a 'bar' element will always have an attribute named 'accession' which has exactly one legal value 'MS:54321'". Throw in the inheritance mechanisms of W3C schema and even if you wound up with two elements with the same name (in different branches of the inheritance tree, of course) they'd still be instantly uniquely identifiable by the value given in the accession attribute, and a validating parser could automagically intercept bogus accession numbers. Let's imagine an element "foo" with a subelement "crunchyCoating", and another element "bar" that also, as it happens, has a subelement named "crunchyCoating". Because we have assigned each element a unique accession number and used XSD restriction to enforce it, we can take an element instance completely out of context: <crunchyCoating accession="MS:1000321" 12.5/> and still understand it even though it looks unlike this other one taken out of context: <crunchyCoating accession="MS:1000777" "My cat's breath smells like cat food."/> Moving over to our CV (or schema) we can learn that MS:1000321 is defined as "Snell hardness of candy bar outer layer" and MS:10007777 is defined as "Ralph Wiggum quote". In practice, I'd probably want to declare the accession attribute optional since for most applications it's just a waste of bytes (can be derived from context+schema), but for the deeply paranoid it can be there explicitly. >> I'd rather extend the OBO format with the features we need unless such an extension would be prohibitively difficult to implement. I'm sorry, that just seems insane when the whole W3C ecosystem already exists to deal with these sorts of mundane data typing and validation issues. There must be better uses of our time. >> As for changes in the parser code, are you referring to the semantic validator or to an applied user of the format? I was thinking of the applied use of the format, but it's true in either case. From what I can tell, most anticipated changes to the ontology are really just additions to attribute value restriction lists (again with the adding a new mass spec model example), which really ought not to force changes to reader code for the most part. Returning to our favored example, it's just a different string to put in the "mass spec type" record in your database or what have you (OK, maybe the new model represents a whole new technology and your app actually needs a big rewrite, but you take my meaning, I hope). - Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Monday, October 08, 2007 2:00 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_of errors?) Brian Pratt wrote: Hi Matt, Sorry, I had meant to explicitly point out how the XSD orientation addresses your synonym concerns, although in the end I think I misunderstood them. Each element has associated with it precisely one correct accession attribute value, and you can use that to determine whether or the element is actually the thing you suspect it is since all true synonyms point back to the same accession number. The element names remain stable, as one would hope. I don't understand, though, why you're interested in making it easy to *introduce* synonyms - I was assuming that the purpose of this standardization effort was to *do away* with synonyms so as to reduce ambiguity. I agree that doing away with synonyms is pretty much the whole purpose of a controlled vocabulary, both for values and for categories, but others have apparently supported it (I think Lennart was the last one to remind me of supporting synonyms which was the reason to give controlled values accession numbers). If we had a CV format capable of representing the controlled values, then that would be a simpler way to maintain the synonyms than with the schema. This is because the CV is organized by accession numbers, which are unique, whereas the schema is organized by element names, which are usually unique but not always. I assume semantic validation is a goal, or we wouldn't have the business with reflectron state going on. In any case, a spec that doesn't lead to semantic validation is a poor sort of spec. Agreed. IS_A and PART_OF already have well defined meanings (see http://obofoundry.org/ro/), so we really can't redefine them for our own purposes. The mechanism for enumerating a value range just isn't there, so the authors have tried to hack it with the inheritance techniques available, which leads to all the gyrations over how to add a new instrument type. This is just a sign that we're trying to drive a screw with a hammer, or whatever metaphor you prefer for a "not even wrong" scenario. It seems that OBO is currently incapable of providing a way for us to control the values for our categories and that it's not really intended for that. So we either must extend it with support for that relationship (as well as specifying types and ranges for uncontrolled value categories), or forget it entirely and stick with the schema. Personally I prefer the accession numbers for the categories and for the values, so I'd rather extend the OBO format with the features we need unless such an extension would be prohibitively difficult to implement. I don't understand the assertion that pushing the maintenance load into the CV brings greater flexibility (nor the use of the term "flat" in describing the CV, which is just an obfuscated inheritance tree). Maintaining the CV directly has now been demonstrated as providing plenty of flexibility to screw up the inheritance hierarchy of the terms, but that's not a good thing, and doesn't seem inherently more flexible than doing the maintenance in an XSD. In either case, the vast majority of changes one is likely to make are along the lines of adding a new instrument type, which would not engender a change to the parser code. No, wait, it WOULD engender a change to the parser code in a CV-centric world, because the only way to express restriction lists is through inheritance instead of simple value restrictions. So, it's actually less flexible to maintain the CV. I was only referring to the support for synonyms. If synonyms are rejected, then as far as I can tell, maintaining the CV with an auto-generated schema would be worse than maintaining a hand-rolled schema by itself. As for changes in the parser code, are you referring to the semantic validator or to an applied user of the format? In the former case, the schema would either be edited by hand or be auto-generated (with updated schema restrictions) after a CV update, neither of which would require an update to the validator. In the latter case, we come back to the A, B, and C options for cvParams (not to mention D, E, and F ;) ). If the cvParams stop at the category level, then parsers needn't be updated to understand new values. If cvParams can refer to a value by itself, then the parser is a pain in the ass to write and it would need to be updated whenever the CV/schema was updated. -Matt |