Re: [Psidev-ms-dev] attributes vs cvParams

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 10/4/07, Lennart Martens <len...@gm...> wrote:
> This is no use. It imnmediately breaks down in the face of synonyms.
> Accession numbers are the way to go. Everybody in the life sciences
> knows and understands this principle ('9606' is 'human' or 'Homo
> sapiens' or 'man' or ...)

Hmm.  I think what you are saying is that end users are not always
able to properly distinguish between canonical *identifiers* (e.g.,
'9606' or 'human') and descriptive text unless the former happens to
look a meaningless string, such as a string of digits.

That may be, but strings of digits have their own problems.  It's a
lot easier to see that 'humaZ' is probably an invalid identifier than
that '9607' is, when looking for (the inevitable) problems.

I think that biologists understand the value of having semi-meaningful
identifiers.  They don't use digit strings for gene identifiers, for
example.

> That would make for very poor mzML documents then, as we semantically
> validate these files now (see the semantic validator in the beefier mzML
> kit). Your CV-less files would surely not validate, and would NOT be
> mzML files.

Hmm.  How complex is a minimal valid mzML file?  If they're not fairly
easy to generate, without knowing much about CV, this seems like a
problem.

> Sorry, but you are erroneously jumping to conclusions. The CV allows
> children to be added dynamically, correct usage of these can be
> validated and the list of children can be updated on-the-fly from web
> resources like the OLS (which auto-update every night).

I'm not sure what this means.  A nightly update of terms from the web
cannot be on our critical path for processing of spectra.  We need to
be able to proceed even if the OLS disappears forever.

> Again, you fail to see the point. The corrrect usage of CV terms can be
> validated. So if you mistype a number or its prefix, this will be
> considered an error. We need numbers because we want to be able to deal
> with synonyms (or even outright changes in the term names; it has
> happened before). Numbers are robust, numbers are convenient, numbers
> are strong. Text is not.

Actually, it's the other way around.  Characters strings are robust
and convenient, numbers are not.  The string 'human' is clearly not
equal to 'humaZ'.  The string '123' is clearly not equal to the string
'0123'.  Is the number 123 the same or different than 0123?  How about
0 and -0, not to mention 123.4 and 123.40 or 0.999999999999 and 1.0?

The use of numbers in a context like this seems to be mostly due to
history.  They may be a little more convenient for programmers, but
that's negligible.

> Remember that powerful and extremely user-friendly tools like the OLS
> take care of updating new terms for you fully automatically.

This phrase "powerful and extremely user-friendly tools" is a little
scary.  This implies having to learn, debug, etc., another piece of
software--one not necessarily under our control.  To be truly useful,
the spec really has to stand on its own (possibly referencing other
specs and data).

> I seem to read in your comments so far that there is a certain
> reluctance to the use of CV terms because this is new, and doesn't fit
> well with what you are good at right now. I would ask that you have a
> look at CV's on OLS (http://www.ebi.ac.uk/ols), and readthe developer
> documentation on how to access the OLS web services using your favourite
> programming language. After playing with it a bit, you'll notice that
> incorporating CV's into the parsing is not that much work, yet yields
> very clear benefits.

I don't even have time to keep up with this list, and the benefits of
OLS are far from clear.

Mike