Re: [Psidev-ms-dev] mzML 0.99.0 comments

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey All,

It's true that in practice most day to day consumers of mzML files will not
bother with validation.  The value of the detailed validation capability of
a fully realized xsd is largely seen during the *development* of the readers
and writers, not in their day to day operation.  (Of course it's also seen
in their day to day operation because they work properly, having been
written properly.)

Ideally we would test every conceivable combination of writer and reader,
but since we can't expect to do that (we can't start until everybody
finishes, and imagine the back and forth!) we instead have to make it
possible for the writers to readily check their work in syntactic and
semantic detail, and for the readers to not have to make a lot of guesses
about what they're likely to see.  The fully realized xsd helps on both
counts - ready validation for the writers, and a clear spec for the readers.
It also gives the possibility of automatically generated code as a jumping
off point for the programmers of both readers and writers, which can reduce
defect rates.

Matt asks if I envision one schema or two.  We need to go out the gate with
one schema that expresses everything we know we want to say today (includes
any intelligence in the current mapping file, plus more detail).  The
anticipated need for vendors to extend the schema independent of the
official schema release cycle (our "stability" goal) is then handled by
schemas the vendors create, which inherit from and extend the standard
schema.   The proposed idea of a second schema from the get-go just to layer
on the CV mappings is unwarranted complexity.  These belong in the core xsd
as (optional) attributes of the various elements, when that one-time OBI
event comes we'll just update the core xsd to add attributes that indicate
relationships from elements to the new CV as well.  It's far enough away not
to threaten the appearance of stability in the spec, and in any case won't
break backward compatibility.

The important point about hard coding rules vs expressing relationships and
constraints in the xsd is one of economies of scale.  It was asked whether
hard coding was any more work than getting the schema right: the answer is
yes, as it has to be done repeatedly, once per validating reader
implementation (not everyone uses Java, or is even allowed to use open
source code in their product).  Why make everyone reinvent the wheel and
probably get it wrong, when we have a nice, standard, language independent
means of expressing those constraints? 

It just comes down to KISS:  Keep It Simple, Stupid! (not calling names
here, that's just the acronym as I learned it).  We're here to deal with MS
raw data transfer, not to design new data format description languages.
More than once on this list I've seen snarky asides about coders who aren't
up to muscling through these proposed convolutions, but a truly competent
coder is professionally lazy (managers prefer "elegant").  Moreover, a
standards effort is supposed to consolidate the efforts of the community so
its individuals can get on with their real work - we shouldn't be blithely
proposing things that create more individual work than they absolutely need
to.

- Brian

-----Original Message-----
From: psi...@li...
[mailto:psi...@li...] On Behalf Of Chris
Taylor
Sent: Thursday, October 18, 2007 9:37 AM
To: Mass spectrometry standard development
Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments

Hiya.

Matthew Chambers wrote:
> I'm glad we're getting good participation and discussion of this issue 
> now!  Chris, your characterization is a reasonable one for the 
> two-schema approach I described.
> 
> To respond to qualification of the current state of affairs, I'll quote 
> something you said the other day:
>> Clearly we need the basic (and rilly rilly easy to do) syntactic 
>> validation provided by a fairly rich XML schema.
> This is not clear to me.  I do not see a clear advantage to validating 
> syntax and not validating semantics.  In my experience, reading a file 
> with invalid semantics is as likely to result in a parser error as 
> reading a file with invalid syntax (although I admit that implementing 
> error handling for semantic errors tends to be more intuitive).

The only thing I'd say here is that there is a minimum effort 
option available for implementers who cannot or choose not to 
validate content -- i.e. the 'core' schema is there to allow 
syntactic validation only, the extended schema you suggested 
would then allow the Brians and yourselves of this world to do 
more. Seems a neat solution. That said I don't contest your 
assertion that the more thorough the validation, the more likely 
one is to catch the subtle errors as well as the gross ones.

>> But supporting 
>> the kinds of functionality discussed (which would mean the CV 
>> rapidly becoming a 'proper' ontology, which we don't have the 
>> person-hours to do right btw) is really just a nice to have at 
>> the moment. True semantic validation is just about feasible but 
>> _isn't_ practical imho.
> I think you misunderstood the functionality I was suggesting to be added 
> to the CV.  I was not suggesting significant logic changes in the CV, 
> only a simple instance_of relationship added to every controlled value 
> to link it to its parent category: "LTQ" is a controlled value, and it 
> should be an 'instance_of' an "instrument model", which is a controlled 
> category.  In my view, the distinction between controlled values and 
> categories in the CV is crucial and it doesn't come close to making the 
> CV any more of a 'proper' ontology (i.e. that machines can use to gain 
> knowledge about the domain without human intervention).  It would, 
> however, mean that a machine could auto-generate a schema from the CV, 
> which is what I was aiming for. :)  I don't really agree with the idea 
> that the PSI MS CV should be a filler which gets replaced by the OBI CV 
> whenever it comes about, but if that's the consensus view then that 
> would be reason enough to give up the idea of using the CV to 
> auto-generate the schema.

Thing here is that I heard several people assert (not on here) 
that defining terminating endpoints is storing up trouble and 
instances are therefore hostages to fortune; you'll just end up 
making a new class and deprecating the instance. Obviously there 
are clear endpoints (is there only one variant of an LTQ btw? is 
  it a child or a sib to have an LTQ-FT?) but there are also 
going to be mistakes made -- rope to hang ourselves (overly 
dramatic phrase but nonetheless).

Then there is the case where people _want_ to use a more generic 
parent (not sure how many there are in the CV tbh as it is quite 
flat iirc but still there are many ontologies in the world where 
the nodes are used as much as the leaves). A (simple-ish) 
example off the top of my head (not necessarily directly 
applicable, just for the principle) would be where someone has a 
machine not yet described and just wants to say something about it.

>> Certainly for all but the most dedicated 
>> coders it is a pipe dream. All that can realistically be hoped 
>> for at the moment is correct usage (i.e. checking in an 
>> application of some sort that the term is appropriate given its 
>> usage), for which this wattage of CV is just fine.This is what 
>> the MIers have done -- a java app uses hard-coded rules to check 
>> usage (and in that simple scenario the intelligent use of 
>> class-superclass stuff can bring benefits).
> It seems here you DO suggest validating semantics, but instead of doing 
> it with the CV/schema it must be implemented manually by hard-coding the 
> rules into a user application.  Right now, there is no way (short of 
> parsing the ms-mapping file and adopting that format) to get that kind 
> of validation without the hard-coding you mention.  Brian and I both 
> think that a proper specification should include a way to get this kind 
> of validation without hard-coding the rules, even if applications choose 
> not to use it.

I think in the absence of an ontology to afford this sort of 
functionality (and with one expected), hard coding is not an 
awful solution (the workload for your suggestion wouldn't be 
orders of magnitude different would it, bearing in mind this is 
a temporary state of affairs so not subject to years of 
maintenance?). The MI group certainly went this route straight 
off the bat...

At the risk of becoming dull, I'd restate that this is why I 
like the separable schemata you suggested, as we get the best of 
both worlds no?

>> But what they're not 
>> doing is something like (for MS now) I have a Voyager so why on 
>> earth do I have ion trap data -- sound the klaxon; this can only 
>> come from something of the sophistication of OBI (or a _LOT_ of 
>> bespoke coding), which is in a flavour of OWL (a cruise liner to 
>> OBO's dinghy).
> It's true, AFAIK, that validating (for example) the value of the "mass 
> analyzer" category based on the value provided for the "instrument 
> model" category is not possible with the current CV/schema.  It is not 
> even possible after the extensions proposed by Brian or me.  Such 
> functionality would require a much more interconnected CV (and the XSD 
> schema would be so confusing to maintain that it would almost certainly 
> have to be auto-generated from the CV).  I don't think anybody 
> particularly expects this functionality either, so we needn't worry 
> about it. :)

Well I'm kind of hoping we will ultimately be able to get this 
from OBI, which is being built in a very thorough and extensible 
(in terms of the richness of relations between classes) manner.

Cheers, Chris.

> -Matt
> 
> 
> Chris Taylor wrote:
>> Hiya.
>>
>> So your solution can, if I understand correctly, be 
>> characterised as formalising the mapping file info in an XSD 
>> that happens (for obvious reasons) to inherit from the main 
>> schema? If so, then as long as everyone likes it, I see that as 
>> a nice, neat, robust solution.
>>
>> Funnily enough I was chatting to a fellow PSIer yesterday about 
>> the mapping file(s) (this is cross-WG policy stuff you see) and 
>> enquired as to the current nature of the thing. I think if there 
>> is a clamour to formalise the map then hopefully there will be a 
>> response. To qualify the current state of affairs though, this 
>> was not meant to be a formal part of the standard -- more 
>> something akin to documentation (it didn't exist at all at one 
>> point -- bridging the gap was something done in the CV, which is 
>> not a great method for a number of reasons).
>>
>> Cheers, Chris.
>>
>>
>> Matthew Chambers wrote:
>>   
>>> If the consensus is that the CV should be left simple like it is now, 
>>> then I must agree with Brian.  The current schema is incapable of doing 
>>> real validation, and the ms-mapping file is worse than a fleshed-out CV 
>>> or XSD (it's more confusing, it takes longer to maintain, and it's 
>>> non-standard).
>>>
>>> I still want Brian to clarify if he wants a one-schema spec or a 
>>> two-schema spec.  I support the latter approach, where one schema is a 
>>> stable, syntactical version and the other inherits from the first one 
>>> and defines all the semantic restrictions as well.  It would be up to 
>>> implementors which schema to use for validation, and of course only the 
>>> syntactical schema would be "stable" because the semantic restrictions 
>>> in the second schema would change to match the CV whenever it was
updated.
>>>
>>> -Matt
>>>
>>>     
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Psidev-ms-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev
> 

-- 
~~~~~~~~~~~~~~~~~~~~~~~~
  chr...@eb...
  http://mibbi.sf.net/
~~~~~~~~~~~~~~~~~~~~~~~~

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Psidev-ms-dev mailing list
Psi...@li...
https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev