Re: [Psidev-ms-dev] mzML 0.99.0 comments

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hiya.

Just a few points:

The CV is deliberately as simple as possible -- just the 
barebones -- enough to find the term you need. In part this is a 
pragmatic outcome from the lack of person-hours, but not 
completely; it is also to avoid the complications of using the 
more complex relationships that are available (roles, for 
example, the benefit of which in this setting is unclear) and 
some of the less standard (=weird) ones.

The CV and the schema should be separable entities imho. Mostly 
this is to allow the use of other CVs/ontologies as they become 
available. If either of these products depends too much on the 
other the result of removing that other would be crippling; this 
is 'bad' bundling, basically. Because they are separate, the 
mapping file for the use of that particular CV with the schema 
is provided. This is a convenience thing for developers, 
basically, which they would be able to figure out for themselves 
given a week, and is no part of any standard. If you recall a 
while ago, the MGED 'ontology' (MO, which is really a CV, hence 
the quotes) got a good kicking in the literature for being 
directly structured around a model/schema (MAGE); there were 
many criticisms voiced there (not all valid, especially the ones 
about process, but nonetheless -- who critiques the critics eh).

On 'other' term sources, consider OBI (the successor to MO, 
inter alia), which is destined ultimately to replace the CVs 
generated by PSI and MGED with a proper ontology supporting all 
sorts of nice things. The OBI dev calls, especially the 
instrument track, would be a _great_ place to redirect this 
enthusiasm to ensure that all is well. Really the PSI CVs as 
they stand are fillers to use while that big job gets done. 
Please I implore you if you really do have major issues/needs, 
go to a few of the OBI calls. For instruments the guy to mail is 
Daniel Schober at EBI (CCed on here); incidentally he also 
handles the needs of the metabolomics community who have 
heee-uge overlaps with PSI (on MS for example) and who will most 
likely use mzML for their MS work also (I co-chair their formats 
WG and have been heavily promoting PSI products to them with an 
eye on the cross-domain integrative thing). Ah synergy.

Clearly we need the basic (and rilly rilly easy to do) syntactic 
validation provided by a fairly rich XML schema. But supporting 
the kinds of functionality discussed (which would mean the CV 
rapidly becoming a 'proper' ontology, which we don't have the 
person-hours to do right btw) is really just a nice to have at 
the moment. True semantic validation is just about feasible but 
_isn't_ practical imho. Certainly for all but the most dedicated 
coders it is a pipe dream. All that can realistically be hoped 
for at the moment is correct usage (i.e. checking in an 
application of some sort that the term is appropriate given its 
usage), for which this wattage of CV is just fine. This is what 
the MIers have done -- a java app uses hard-coded rules to check 
usage (and in that simple scenario the intelligent use of 
class-superclass stuff can bring benefits). But what they're not 
doing is something like (for MS now) I have a Voyager so why on 
earth do I have ion trap data -- sound the klaxon; this can only 
come from something of the sophistication of OBI (or a _LOT_ of 
bespoke coding), which is in a flavour of OWL (a cruise liner to 
OBO's dinghy).

Finally, again on where to draw the separating line; the more 
detail in the schema, the more labile that schema. So the schema 
should be as stable as possible (tend towards simpler). That 
schema should also remain as simple to dumb-validate as possible 
(so someone with barely the ability to run a simple validation 
check can wheel out a standard XSD tool and be done -- again 
tend towards simpler). The rest of the ~needed detail has then 
to be elsewhere in that scenario; in the CV (but that also has 
limits as discussed above) and the mapping file (the mortar 
between the bricks). The point is that although that makes work 
for those who really want to go for it on validation (to the 
point of reasoning in some sense), those developing simpler 
implementations will be able to keep things simple (e.g. person 
X uses a simple library to check for well-formedness and 
validity against the XSD, cares not-a-whole-hell-of-a-lot about 
the CV terms used as they know that most came direct from the 
instrument somehow with no user intervention, and just wants a 
coherent file with some metadata around the data to put in a 
database, which is where the CV matters most -- for retrieval). 
To truly go up a level on validation (excepting the halfway 
house of stating which terms [from a _particular_ source] can go 
where) is unrealistic and currently the benefits are minimal I 
would say (compare the effort of implementing to the benefit of 
the 0.1% of files in which you catch an error by that route, or 
the frequency of searches based on proteins/peptides, or on 
atomic terms (possibly AND/OR-ed), to that of searches truly 
exploiting the power of ontologies).

Not that I'm against powerful ontology-based queries supported 
by systems that reason like a herd of ancient g(r)eeks; it'll 
truly rock when it comes and will be key to the provision of 
good integrated (i.e. cross-domain) resources down the line. But 
the time is not now -- we need OBI first. To forcibly mature the 
MS CV to support such functionality is a waste of effort better 
spent in making OBI all it can be.

WHY can I not write a short email (that was rhetorical...)

Cheers, Chris.

~~~~~~~~~~~~~~~~~~~~~~~~
  chr...@eb...
  http://mibbi.sf.net/
~~~~~~~~~~~~~~~~~~~~~~~~