From: Eric D. <ede...@sy...> - 2007-10-11 08:36:29
|
Hi everyone, I've taken some time to think carefully about what Brian says and here is my attempt at focusing the discussion: =20 - First: yes, there are several problems in the CV is_a and part_of. We agreed at the CV meeting that we will tackle this to try to make it uniform. =20 - Here are two rules within the CV worth that may hold true and should be documented: - if a term's direct parent is a "xxxx attribute", then it must furnish a value within the cvParam element, else it cannot - if a term has children, then it cannot be specified as a cvParam (except as a category/parent in option C) Is this correct? Counter examples? =20 - Regarding the reflectron example, I think the CV should look like this, even though it does not quite now: - "reflectron on" is_a "reflectron state" is_a "analyzer attribute" - "reflectron off" is_a "reflectron state" is_a "analyzer attribute" =20 - Thus cvParams would be used like this: Option A: <cvParam cvLabel=3D"MS" accession=3D"MS:1000105" name=3D"reflectron off" value=3D"" /> Option C+: <cvParam name=3D"reflectron off" cvLabel=3D"MS" accession=3D"MS:1000105" parentAccession=3D" MS:1000021"/> =20 - Brian proposed: <reflectronState accession=3D"MS:1000021" off/> This does not seem like well formed XML to me. Or is it?? I assume he meant this: <reflectronState accession=3D"MS:1000105" name=3D"reflectron off"/> =20 - If so, the real dilemma is between: 1) <cvParam name=3D"reflectron off" cvLabel=3D"MS" = accession=3D"MS:1000105" parentAccession=3D" MS:1000021"/> 2) <reflectronState accession=3D"MS:1000105" name=3D"reflectron off"/> Brian, would you agree that these are the two sides? They both seem fully complete to me. If I've got it wrong, then the rest would seem premature, but I'll press on believing I've got it right. Because by creating an element in the schema <reflectronState>, this automatically takes the place of { cvLabel=3D"MS" parentAccession=3D" MS:1000021" } =20 - So for option 1, we're essentially at that right now (we would need to adjust option A to option 1, but it's close) =20 - For option 2, we would need to find all the CV terms that we think deserve to be promoted to element status and add them to schema. I don't know how many there are, but there would be lots. The schema would increase in size many fold. =20 - A further complication is where does this element go? Does it go in the instrument description section? Or could the reflectron be turned on and off for different spectra and thus go in the scan element? I have no idea. If we put it in the schema, we've got to get it right now. If we don't, then the schema will have to be updated to fix it. =20 - The current state is a flexible (some might say lazy or dangerous) way. We acknowledge that we don't have all the CV terms and we're not exactly sure where some will be used, so we leave it open. No example instance document yet has reflectron state information in it. I'd be delighted if someone could provide one. =20 - So what we can do today is provide a term "reflectron off" that almost no one really cares much about and let someone out there who does care write some mzML with this annotation in it. When this document is checked against the semantic validator, the validator will complain that you've used a child term of "reflectron state" in a place where it's not allowed. But the writer insists that it should be allowed there. The PSI-MS WG is pursuaded it should be. So we update the semantic validator and the CV perhaps and these new documents are written out with reflectron state information and validate. Most software doesn't care a hoot about the reflectron state and that cvParam can be safely ignored or dumbly displayed to the user in case the user cares. All the above can happen without a rev of the schema. =20 - But that's the same thing as updating the schema except in name, you say. Perhaps. =20 - So, I hope I have helped this discussion rather than confused it. Clearly the current schema has a big element of flexibility/power/danger in it. Some would believe that this will allow us to improve the format in minor ways without schema revision and provide a way for producers to express their data with annotations that make sense to them. The only thing standing between flexibility and utter mayhem is the semantic validator. Perhaps in some sense, this is half XML schema and half pseudo RDF. Can we pull it off or are we lunatics for trying it? =20 - I am clearly biased here, but I try to keep an open mind. =20 - To my mind, the most important unconsidered problem that Brian brings up is the data type problem. Consider the example: <cvParam cvLabel=3D"MS" accession=3D"MS:1000285" name=3D"total ion = current" value=3D"1.66755e+007" parentAccession=3D"MS:1000499"/> Brian's proposed alternative is (I hope I'm right): <spectrumAttribute accession=3D"MS:1000285" name=3D"total ion current" value=3D"1.66755e+007"> In principle, this second way would allow me to specify a data type and let XML validators enforce it. However, this may not quite work either, because what if I want: <spectrumAttribute accession=3D"MS:1009999" name=3D"spectrum = subjective quality" value=3D"10"> To be allowed? All spectrumAttributes would have to have the same data type for that to work. The example is pretty contrived. Unless every single attribute got its own element like: <totalIonCurrent value=3D"1.66755e+007"> =20 - The latter here is fully specified and concrete. But if we get anything wrong or want to add anything, then we have to release a new version of the schema. One possible option is to full specify in schema everything we can think of now, and then for new or later things use cvParam. If we do that, then we're still needing to apply sematic validation so we've only half-solved the problem. Finally, a dangerous door may be opening. If we want to expand this duality, we have a possible "more than one way to do it" problem. Some might choose to use the cvParam, and some the schema element. The only thing that could prevent that is the semantic validator again. =20 - I wonder whether we can add a nice method of datatype validation to option 1 above? Any ideas? =20 I had hoped to focus the discussion, but rereading it, all I did was shake the already-opened can of worms. =20 Let the commentary ensue. =20 Regards, Eric =20 =20 =20 =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Monday, October 08, 2007 11:38 AM To: 'Mass spectrometry standard development' Subject: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_oferrors?) =20 Eh, it's even more broken than I thought. I've amended my amendments inline below, new changes in double parenthesis. =20 =20 After a day so of messing with this, it is now: =20 MANIFESTO TIME! =20 RESOLVED: The mzML specification process should be schema-centric, and the CV should be generated from the schema (should be a fairly simple matter of XSLT, since XSD is itself XML). =20 =20 REASON 1: THE CV-CENTRIC APPROACH IS ERROR PRONE. The kinds of inheritance errors shown below are, if not actually impossible, much harder to make in the context of a W3C schema when using readily available software tools to create and maintain the schema. =20 REASON 2: OBO/CV IS AN INSUFFICIENT TOOL FOR THE JOB OF PRODUCING A READILY AND THOROUGHLY VALIDATABLE DATA FORMAT. CV apparently provides no means for specifying range or formatting of instance values. An "isolation width" (MS:1000023) could happily have a value of "-2", "2", "two", or "extra sprinkles, please". You could (and should) certainly put some text in the description along the lines of "this is a non-negative floating point value" but that's no help to a validating parser. XSD on the other hand has standardized syntax for enforcing precisely these kinds of restrictions, meaning that validating parsers and code generators (for both read and write) don't need any special-purpose logic added. =20 =20 There are a handful of places where value range restrictions have been attempted in the MS CV, but these are awkward because of the tools. The reflectron_state, for example, has two children "on" and "off", but this only confuses things, since these are not *values* of reflectron state but rather *are* reflectron states, a distinction which may be meaningless in English but significant when attempting to create a data structure. Picture how this looks in an instance doc: <cvParam cvLabel=3D"MS" accession=3D"MS:1000105" name=3D"off" value=3D"" = /> I can't think of anything nice to say about that. Better it should read: <reflectronState accession=3D"MS:1000021" off/> =20 =20 CONCLUSION: THE CV WORK TO DATE IS IMPORTANT AND USEFUL, BUT SHOULD BE RECAST AS SCHEMA WORK The CV should not attempt to be a replacement for the schema - it just hasn't got the requisite mechanisms to do the job. The information CV can convey is only a subset of the information that is needed to fully specify a data format. The information in the CV as it stands should be folded into the mzML schema, and maintained therein moving forward. An actual OBO/CV file can be generated as needed.=20 =20 - Brian =20 =20 ________________________________ From: Brian Pratt [mailto:bri...@in...]=20 Sent: Friday, October 05, 2007 11:52 PM To: 'Mass spectrometry standard development' Subject: more is_a vs. part_of errors? =20 There are a handful of other cases where it appears that the authors have gotten "is a" and "part_of" confused. My proposed corrections (IN CAPS) inline: =20 MS:1000025 "magnetic field strength" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000024 "final MS exponent" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description"=20 part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000022 "TOF Total Path Length" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000014 "accuracy" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 ((note, these next two are just ugly, see notes at top of message)) =20 MS:1000106 "on" is a MS:1000021 "reflectron state" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000105 "off" is a MS:1000021 "reflectron state" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 =20 =20 The following changes would make the Thermo and ABI stuff look like all the other vendors: =20 MS:1000495 "Applied Biosystems" part of (IS_A) MS:1000121 "ABI / SCIEX" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000176 "MAT95XP Trap" is a (IS_A) MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000175 "MAT95XP" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000174 "MAT900XP Trap" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000173 "MAT900XP" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000172 "MAT253" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 =20 I still think there's a schema in there, albeit jammed in slightly sideways at the moment. (( I don't think that anymore. I think there's a subset of a schema in there. )) =20 - Brian |