Re: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_of errors?)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Matt,

>> CV is organized by accession numbers, which are unique, whereas the
schema is organized by element names, which are usually unique but not
always.

Right, but I propose to use XSD value restriction syntax to associate each
element with a unique accession number.  The schema can declare things like
"a 'foo' element will always have an attribute named 'accession' which has
exactly one legal value 'MS:12345', and a 'bar' element will always have an
attribute named 'accession' which has exactly one legal value 'MS:54321'".
Throw in the inheritance mechanisms of W3C schema and even if you wound up
with two elements with the same name (in different branches of the
inheritance tree, of course) they'd still be instantly uniquely identifiable
by the value given in the accession attribute, and a validating parser could
automagically intercept bogus accession numbers.  Let's imagine an element
"foo" with a subelement "crunchyCoating", and another element "bar" that
also, as it happens, has a subelement named "crunchyCoating".  Because we
have assigned each element a unique accession number and used XSD
restriction to enforce it, we can take an element instance completely out of
context: 

<crunchyCoating accession="MS:1000321" 12.5/>

and still understand it even though it looks unlike this other one taken out
of context:

<crunchyCoating accession="MS:1000777" "My cat's breath smells like cat
food."/>

Moving over to our CV (or schema) we can learn that MS:1000321 is defined as
"Snell hardness of candy bar outer layer" and MS:10007777 is defined as
"Ralph Wiggum quote".

In practice, I'd probably want to declare the accession attribute optional
since for most applications it's just a waste of bytes (can be derived from
context+schema), but for the deeply paranoid it can be there explicitly.

>> I'd rather extend the OBO format with the features we need unless such an
extension would be prohibitively difficult to implement.

I'm sorry, that just seems insane when the whole W3C ecosystem already
exists to deal with these sorts of mundane data typing and validation
issues.  There must be better uses of our time.

>> As for changes in the parser code, are you referring to the semantic
validator or to an applied user of the format? 

I was thinking of the applied use of the format, but it's true in either
case.   From what I can tell, most anticipated changes to the ontology are
really just additions to attribute value restriction lists (again with the
adding a new mass spec model example), which really ought not to force
changes to reader code for the most part.  Returning to our favored example,
it's just a different string to put in the "mass spec type" record in your
database or what have you (OK, maybe the new model represents a whole new
technology and your app actually needs a big rewrite, but you take my
meaning, I hope).

- Brian

  _____  

From: psi...@li...
[mailto:psi...@li...] On Behalf Of Matthew
Chambers
Sent: Monday, October 08, 2007 2:00 PM
To: Mass spectrometry standard development
Subject: Re: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_of
errors?)

Brian Pratt wrote: 

Hi Matt,

Sorry, I had meant to explicitly point out how the XSD orientation addresses
your synonym concerns, although in the end I think I misunderstood them.
Each element has associated with it precisely one correct accession
attribute value, and you can use that to determine whether or the element is
actually the thing you suspect it is since all true synonyms point back to
the same accession number.  The element names remain stable, as one would
hope.  I don't understand, though, why you're interested in making it easy
to *introduce* synonyms - I was assuming that the purpose of this
standardization effort was to *do away* with synonyms so as to reduce
ambiguity.

I agree that doing away with synonyms is pretty much the whole purpose of a
controlled vocabulary, both for values and for categories, but others have
apparently supported it (I think Lennart was the last one to remind me of
supporting synonyms which was the reason to give controlled values accession
numbers).  If we had a CV format capable of representing the controlled
values, then that would be a simpler way to maintain the synonyms than with
the schema.  This is because the CV is organized by accession numbers, which
are unique, whereas the schema is organized by element names, which are
usually unique but not always.

I assume semantic validation is a goal, or we wouldn't have the business
with reflectron state going on.  In any case, a spec that doesn't lead to
semantic validation is a poor sort of spec.

Agreed.

IS_A and PART_OF already have well defined meanings (see
http://obofoundry.org/ro/), so we really can't redefine them for our own
purposes.  The mechanism for enumerating a value range just isn't there, so
the authors have tried to hack it with the inheritance techniques available,
which leads to all the gyrations over how to add a new instrument type.
This is just a sign that we're trying to drive a screw with a hammer, or
whatever metaphor you prefer for a "not even wrong" scenario.

It seems that OBO is currently incapable of providing a way for us to
control the values for our categories and that it's not really intended for
that.  So we either must extend it with support for that relationship (as
well as specifying types and ranges for uncontrolled value categories), or
forget it entirely and stick with the schema.  Personally I prefer the
accession numbers for the categories and for the values, so I'd rather
extend the OBO format with the features we need unless such an extension
would be prohibitively difficult to implement.

I don't understand the assertion that pushing the maintenance load into the
CV brings greater flexibility (nor the use of the term "flat" in describing
the CV, which is just an obfuscated inheritance tree).  Maintaining the CV
directly has now been demonstrated as providing plenty of flexibility to
screw up the inheritance hierarchy of the terms, but that's not a good
thing, and doesn't seem inherently more flexible than doing the maintenance
in an XSD.  In either case, the vast majority of changes one is likely to
make are along the lines of adding a new instrument type, which would not
engender a change to the parser code.  No, wait, it WOULD engender a change
to the parser code in a CV-centric world, because the only way to express
restriction lists is through inheritance instead of simple value
restrictions.  So, it's actually less flexible to maintain the CV.

I was only referring to the support for synonyms.  If synonyms are rejected,
then as far as I can tell, maintaining the CV with an auto-generated schema
would be worse than maintaining a hand-rolled schema by itself.  As for
changes in the parser code, are you referring to the semantic validator or
to an applied user of the format?  In the former case, the schema would
either be edited by hand or be auto-generated (with updated schema
restrictions) after a CV update, neither of which would require an update to
the validator.  In the latter case, we come back to the A, B, and C options
for cvParams (not to mention D, E, and F ;) ). If the cvParams stop at the
category level, then parsers needn't be updated to understand new values.
If cvParams can refer to a value by itself, then the parser is a pain in the
ass to write and it would need to be updated whenever the CV/schema was
updated.

-Matt