Re: [Psidev-ms-dev] mzML 0.99.0 comments

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

If the consensus is that the CV should be left simple like it is now, 
then I must agree with Brian.  The current schema is incapable of doing 
real validation, and the ms-mapping file is worse than a fleshed-out CV 
or XSD (it's more confusing, it takes longer to maintain, and it's 
non-standard).

I still want Brian to clarify if he wants a one-schema spec or a 
two-schema spec.  I support the latter approach, where one schema is a 
stable, syntactical version and the other inherits from the first one 
and defines all the semantic restrictions as well.  It would be up to 
implementors which schema to use for validation, and of course only the 
syntactical schema would be "stable" because the semantic restrictions 
in the second schema would change to match the CV whenever it was updated.

-Matt

Brian Pratt wrote:
> Hi Chris,
>
> Most helpful to have some more background, thanks.  Especially in light of
> the idea that the PSI CVs as they stand are fillers to use while OBI gets
> done, your term "bad bundling" is appropriate.
>
> If we go with a fully realized xsd wherein each element definition has a CV
> reference, when OBI comes to fruition we just tweak the xsd.  It's a small
> change to the "foo" element definition, which is already declared to have
> the meaning found at "MS:12345", to declare it as also having the meaning
> found at "OB:54321".  The point is that it's still a foo element so all
> existing mzML files remain valid, and all those mzML parsers out there don't
> have to be changed.  In the currently contemplated mzML you'd have to go
> through all parsers in existence and update them to understand that <cvParam
> accession="OB:54321"/> is the same as <cvParam accession="MS:12345"/>, and
> of course older systems just won't understand it at all.  Bad bundling
> indeed!  The xsd approach is in fact the more stable one.  
>
> It's odd, to say the least, to have the "mortar" of this project (the
> mapping file) not be part of the official standard.  It's the only artifact
> we have at the moment, as far as I can see, that attempts to define the
> detailed structure of an mzML file.  It's the de facto standard, and "de
> facto" has been identified as a Bad Thing on this list.
>
> So, to recap this and previous posts, the current proposal employs an
> unnecessarily elaborate, nonstandard, inflexible, sneaky, and inadequate way
> to couple mzML to the CV.  This is readily corrected by moving the mapping
> file content to the xsd which actually forms the standard, then adding
> detail so that, for example, it is clear that a scan window must have both a
> low mz and high mz but dwell time is optional.
>
> Using the CV to define terms is important, but mostly what both vendors and
> users really want from a data format standard is to not be forever tweaking
> readers and writers to adjust to "valid" but unexpected usages.  This is
> only achieved by the standard being extremely clear on what "valid" means,
> something the current proposal largely flinches from doing.  As currently
> proposed, mzML feels like a big step backwards.
>
> Brian
>
>
> -----Original Message-----
> From: psi...@li...
> [mailto:psi...@li...] On Behalf Of Chris
> Taylor
> Sent: Wednesday, October 17, 2007 2:27 AM
> To: Mass spectrometry standard development
> Cc: Daniel Schober
> Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments
>
> Hiya.
>
> Just a few points:
>
> The CV is deliberately as simple as possible -- just the 
> barebones -- enough to find the term you need. In part this is a 
> pragmatic outcome from the lack of person-hours, but not 
> completely; it is also to avoid the complications of using the 
> more complex relationships that are available (roles, for 
> example, the benefit of which in this setting is unclear) and 
> some of the less standard (=weird) ones.
>
> The CV and the schema should be separable entities imho. Mostly 
> this is to allow the use of other CVs/ontologies as they become 
> available. If either of these products depends too much on the 
> other the result of removing that other would be crippling; this 
> is 'bad' bundling, basically. Because they are separate, the 
> mapping file for the use of that particular CV with the schema 
> is provided. This is a convenience thing for developers, 
> basically, which they would be able to figure out for themselves 
> given a week, and is no part of any standard. If you recall a 
> while ago, the MGED 'ontology' (MO, which is really a CV, hence 
> the quotes) got a good kicking in the literature for being 
> directly structured around a model/schema (MAGE); there were 
> many criticisms voiced there (not all valid, especially the ones 
> about process, but nonetheless -- who critiques the critics eh).
>
> On 'other' term sources, consider OBI (the successor to MO, 
> inter alia), which is destined ultimately to replace the CVs 
> generated by PSI and MGED with a proper ontology supporting all 
> sorts of nice things. The OBI dev calls, especially the 
> instrument track, would be a _great_ place to redirect this 
> enthusiasm to ensure that all is well. Really the PSI CVs as 
> they stand are fillers to use while that big job gets done. 
> Please I implore you if you really do have major issues/needs, 
> go to a few of the OBI calls. For instruments the guy to mail is 
> Daniel Schober at EBI (CCed on here); incidentally he also 
> handles the needs of the metabolomics community who have 
> heee-uge overlaps with PSI (on MS for example) and who will most 
> likely use mzML for their MS work also (I co-chair their formats 
> WG and have been heavily promoting PSI products to them with an 
> eye on the cross-domain integrative thing). Ah synergy.
>
> Clearly we need the basic (and rilly rilly easy to do) syntactic 
> validation provided by a fairly rich XML schema. But supporting 
> the kinds of functionality discussed (which would mean the CV 
> rapidly becoming a 'proper' ontology, which we don't have the 
> person-hours to do right btw) is really just a nice to have at 
> the moment. True semantic validation is just about feasible but 
> _isn't_ practical imho. Certainly for all but the most dedicated 
> coders it is a pipe dream. All that can realistically be hoped 
> for at the moment is correct usage (i.e. checking in an 
> application of some sort that the term is appropriate given its 
> usage), for which this wattage of CV is just fine. This is what 
> the MIers have done -- a java app uses hard-coded rules to check 
> usage (and in that simple scenario the intelligent use of 
> class-superclass stuff can bring benefits). But what they're not 
> doing is something like (for MS now) I have a Voyager so why on 
> earth do I have ion trap data -- sound the klaxon; this can only 
> come from something of the sophistication of OBI (or a _LOT_ of 
> bespoke coding), which is in a flavour of OWL (a cruise liner to 
> OBO's dinghy).
>
> Finally, again on where to draw the separating line; the more 
> detail in the schema, the more labile that schema. So the schema 
> should be as stable as possible (tend towards simpler). That 
> schema should also remain as simple to dumb-validate as possible 
> (so someone with barely the ability to run a simple validation 
> check can wheel out a standard XSD tool and be done -- again 
> tend towards simpler). The rest of the ~needed detail has then 
> to be elsewhere in that scenario; in the CV (but that also has 
> limits as discussed above) and the mapping file (the mortar 
> between the bricks). The point is that although that makes work 
> for those who really want to go for it on validation (to the 
> point of reasoning in some sense), those developing simpler 
> implementations will be able to keep things simple (e.g. person 
> X uses a simple library to check for well-formedness and 
> validity against the XSD, cares not-a-whole-hell-of-a-lot about 
> the CV terms used as they know that most came direct from the 
> instrument somehow with no user intervention, and just wants a 
> coherent file with some metadata around the data to put in a 
> database, which is where the CV matters most -- for retrieval). 
> To truly go up a level on validation (excepting the halfway 
> house of stating which terms [from a _particular_ source] can go 
> where) is unrealistic and currently the benefits are minimal I 
> would say (compare the effort of implementing to the benefit of 
> the 0.1% of files in which you catch an error by that route, or 
> the frequency of searches based on proteins/peptides, or on 
> atomic terms (possibly AND/OR-ed), to that of searches truly 
> exploiting the power of ontologies).
>
> Not that I'm against powerful ontology-based queries supported 
> by systems that reason like a herd of ancient g(r)eeks; it'll 
> truly rock when it comes and will be key to the provision of 
> good integrated (i.e. cross-domain) resources down the line. But 
> the time is not now -- we need OBI first. To forcibly mature the 
> MS CV to support such functionality is a waste of effort better 
> spent in making OBI all it can be.
>
> WHY can I not write a short email (that was rhetorical...)
>
> Cheers, Chris.
>
>