Re: [Psidev-ms-dev] mzML 0.99.0 comments

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Matt,

Matt,

I can only speculate on the history of PSI CV as a subset of OBO, my guess
is they just wanted to keep it simple as it was never intended to provide
the kind of granularity we need for fully automated semantic validation.

So, I disagree on your point of CV being nearly there as an XSD replacement.
It doesn't seem to have, for example, any means of saying whether an element
or attribute is required or not, or how many times it can occur, etc etc.
That's why that whole crazy xsd-like infrastructure that the java validator
uses was built up (the ms-mapping.xml schema file is attached, for those who
don't want to dig for it), and even that I have already shown to be
inadequate.  I don't want to see us follow previous groups down that rabbit
hole.

I also think that in practice nobody is going to be all that interested in
messing with the CV beyond adding the occasional machine model etc.  I think
a one time determination of the XSD will prove quite durable, and it's
already been largely done between the existing xsd and ms-mapping.xml.

You're right, for the applications I'm personally looking at right now I
think the CV isn't very important.  But your use case of vendor DLLs using
CV to disambiguate their APIs is a perfect example of how CV can improve
things.  I support its development and I think mzML should play well with
it.  Even though the existence of a system that would actually do anything
with the CV info in an mzML file is currently theoretical, it's the right
direction to be heading in and it's worth caring about and doing it right.

- Brian

-----Original Message-----
From: psi...@li...
[mailto:psi...@li...] On Behalf Of Matthew
Chambers
Sent: Tuesday, October 16, 2007 12:20 PM
To: Mass spectrometry standard development
Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments

Brian Pratt wrote:
> (First of all, thanks to Frank for shedding more light on the topic -
heat,
> we have already!)
>
>   
Heat and light are just different wavelengths on the same spectrum. ;)

> Matt,
>
> You're right about OBO not limiting itself to is_a and part_of, but it
> appears that PSI has explicitly chosen to do so.  I doubt we have the
> political heft to change that now, or that we should want to do so.
Further
> contortions to turn CV into something to rival the readily available power
> of XSD are misguided, in my opinion.
>   
If what you say is true, I at least want to see some rationale of why 
PSI would explicitly limit their CVs to 'is_a' and 'part_of' 
relationships.  I agree that contorting a CV to make it work as an XSD 
is misguided, but it's already been done to a great extent and I just 
want to go that little bit further to finish it.  I was suggesting that 
we should leverage the validation power of XSD by autogenerating an XSD 
from a properly done (contorted!) CV, where maintaining the CV is 
preferable to the XSD primarily because OBO CVs are ubiquitous in the 
life sciences while XSDs are not (AFAIK).  Also, it means only having to 
maintain the CV instead of maintaining both the CV and the XSD 
(autogenerating the CV from the XSD is conceivable, but pointless 
because by then you are putting new accession number straight into the 
XSD along with all the baggage that needs to get passed to the CV but 
isn't really important to the XSD).

> Frankly it seems to me that the CV doesn't really need to be all that
> logically consistent: in its current bogus state it doesn't seem to have
> bothered anyone, including the official validator.  PSI clearly never
meant
> for CV to do things like datatyping and range limiting so we should stop
> pushing on that rope and just allow CV to play its proper role in
> disambiguating the terms we use in the XSD, by use of accession numbers in
> the XSD.  
>   
I think you say this because, as things currently are, you don't plan to 
care much about the CV and frankly neither do I.  And there is a 
legitimate reason to not care about a CV if it doesn't specify enough 
semantics of the format to truly and unambiguously define the the 
terms.  The data type of a term is as much a part of its definition as 
the English description of it!  Imagine different users of the CV trying 
to pass around instances of terms using different data types for the 
different instances!  I don't think that constitutes an unambiguous 
controlled vocabulary. :)

> The thing to do now is to transfer most of the intelligence in the
> ms-mapping.xml schema file (for it is indeed a schema, albeit written in a
> nonstandard format) to the XSD file then add the proper datatyping and
range
> checking.  I was happy to see that this second schema contains the work I
> thought we were going to have to generate from the CV itself, although I
was
> also somewhat surprised to learn of the existence of such a key artifact
> this late in the discussion.  Or maybe I just missed it somehow.
>
> As I've said before we should be braver than we have been so far.  The
> refusal to put useful content in the XSD file simply for fear of being
wrong
> about it is just deplorable and doesn't serve the purposes of the
community.
> And I'm appalled at the disingenuousness of claiming a "stable schema"
when
> many key parts of the spec are in fact expressed in a schema
> (ms-mapping.xml) which is explicitly unstable.
>   
I agree wholeheartedly.  We only disagree about maintaining the fully 
specified XSD.  I think it should be autogenerated from a fixed CV and a 
stable template schema, whereas you think it should be hand rolled.  Let 
me get you to clear something up though: do you want there to be a 
single, ever-changing schema, or would you also accept a basic stable 
schema (without CV-related restrictions) which can be derived from in 
order to create the fully specified schema with the ever-changing 
restrictions?  In the latter case, we can have a schema that is stable 
but doesn't serve for anything more than syntactical validation, and 
also a schema that can be used for full semantic validation, and which 
schema that a program uses is up to the program.

> The charge has been leveled on this list that (paraphrasing here) some old
> dogs are resisting learning new tricks when it comes to the use of CV.
> That's always something to be mindful of, but after careful consideration
I
> really just don't see the advantage of a CV-centric approach, when all the
> added complexity and reinvention still leaves us well short of where
proper
> use of XSD would get us.  Fully realized XSD that references CV to define
> its terms seems like the obvious choice for a system that wants to gain
> widespread and rapid adoption.
>   
Speaking of learning new tricks, when will the vendors' raw file reading 
libraries return CV accession numbers to describe terms instead of 
ambiguous strings?  That would be nice.  But if that never happens, each 
conversion program has to maintain its own vendor-to-CV mapping.  And if 
a program wants to read both vendor-proprietary formats and the XML 
formats, your mapping problems have become nightmares.

-Matt

> - Brian
>
> -----Original Message-----
> From: psi...@li...
> [mailto:psi...@li...] On Behalf Of Matthew
> Chambers
> Sent: Tuesday, October 16, 2007 8:27 AM
> To: Mass spectrometry standard development
> Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments
>
> Hi Frank, I read the Guidelines you linked to and also the paper 
> describing the Relation Ontology (http://genomebiology.com/2005/6/5/R46) 
> which is referenced from the Guidelines. The Relation Ontology does not 
> in any way suggest that reliable OBO CVs should be limited to IS_A and 
> PART_OF relationships! Rather, it does a good job of defining when IS_A 
> and PART_OF should be used and what they really mean. I think if we 
> looked closely we could find quite a few cases in the CV where the use 
> of IS_A and PART_OF is bogus according to the Relation Ontology 
> definition, especially with regard to values being indistinct from 
> categories.
>
> Therefore, I take issue with the following text from the Guidelines 
> which has no corresponding rationale and which is currently biting us in 
> the arse:
>
> 11. Relations between RU's
> As the PSI CV will be developed under the OBO umbrella [3], the 
> relations created between terms MUST ascribe to the definitions and 
> formal requirements provided in the OBO Relations Ontology (RO) paper 
> [7], as the relations 'is_a' and 'part_of'.
>
> It is not clear whether the Relation Ontology recommends or discourages 
> using OBO to typedef new relationship types into existence (my proposed 
> 'value of'), but that won't be necessary. I think we can accomplish the 
> same effect with the existing relationship, 'instance_of', which IS part 
> of the Relation Ontology. In fact, 'instance_of' is a primitive relation 
> in the Relation Ontology, whereas 'is_a' is not. Here is the Relation 
> Ontology definition for 'instance_of':
>
> p instance_of P - a primitive relation between a process instance and a 
> class which it instantiates holding independently of time
>
> That sounds like a pretty good way to distinguish between values 
> (instances) and categories (classes) to me! Further, the instance_of 
> relationship can be used in addition to the current part_of and is_a 
> relationships and it will serve to disambiguate a branch of the CV where 
> the actual category that a value belongs to is an ancestor instead of a 
> direct parent. For instance:
> MS:1000173 "MAT900XP"
> is a MS:1000493 "Finnigan MAT"
> part of MS:1000483 "Thermo Fisher Scientific"
> is a MS:1000031 "model by vendor"
> part of MS:1000463 "instrument description"
> part of MS:0000000 "MZ controlled vocabularies"
> What category does the controlled value "MAT900XP" belong to, i.e. if we 
> used cvParam method B, would it look like:
> <cvParam cvLabel="MS" categoryName="Finnigan MAT" 
> categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/>
> Or would it look like:
> <cvParam cvLabel="MS" categoryName="model by vendor" 
> categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/>
>
> Of course I think it should be the latter, but how would you derive that 
> from the CV? You can't, unless you add a new relationship or convention, 
> so I suggest:
> MS:1000173 "MAT900XP"
> instance of MS:1000031 "model by vendor"
> is a MS:1000493 "Finnigan MAT"
> part of MS:1000483 "Thermo Fisher Scientific"
> is a MS:1000031 "model by vendor"
> part of MS:1000463 "instrument description"
> part of MS:0000000 "MZ controlled vocabularies"
> It would also be good to get rid of the MS:1000483->MS:1000031 
> relationship at that point because "Thermo Fisher Scientific" is NOT an 
> instrument model.
>
> I have to disagree with your assertion that OBO does not allow a CV to 
> model datatypes and cardinality. I think the trailing modifiers (which 
> may have been added since you last looked at the OBO language spec) 
> would serve to model those properties quite nicely.
>
> -Matt
>   

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Psidev-ms-dev mailing list
Psi...@li...
https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev