From: Brian P. <bri...@in...> - 2007-10-16 00:54:59
|
Hi Eric, Sorry if I missed anything obvious on the open source nature of the code. Glad to hear it, obviously! It allows me to answer a lot of questions for myself. The existence of the mapping-ms.xml file was lost on me before now, sorry. I see where it gets us a good deal of the way to where pure xsd would, but not actually all the way. For example, the validator accepts the addition of a dwell time to a selectionWindow: <cvParam cvLabel="MS" accession="MS:1000502" name="dwell time" value="1800.000000"/> although I think it's probably nonsensical since it lacks units etc. The validator also happily accepts two copies of that line, in place of the 1000500 and 1000501 lines - all it cares about is seeing two cvParams of the proper inheritance type. The semantic constraints which can be expressed by the combination of the CV and mappings-ms.xml files with the custom java validation code are pretty crude compared to the capabilities of perfectly standard and language independent XSD. This all seems terribly convoluted, approximate, and error prone... such are the wages of reinventing the wheel. Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Monday, October 15, 2007 4:37 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML validator experiences Hi Brian, thank you for your continued input and effort. I'm sorry I've been slow to respond on many of your posts, I have a bunch of other pots boiling over here. However, I think I can answer your questions here and promote further testing. 1) Regarding 2min.mzML, we'll fix it, thanks. 2) Regarding how does the validator know that MS:1000528 is invalid, please download: http://tools.proteomecenter.org/software/mzMLKit/mzML_0.99.0_large.zip (this is hyperlinked from the main development page http://www.psidev.info/index.php?q=node/257) In it, you will find the semantic validator software. One of the files in the distro is ms-mapping.xml. It is this file that encodes these rules and is what is used by the semantic validator. This file should be more prominently posted and will be. 3) The semantic validator is FOSS, please see the PSI SVN repository and contribute! https://psidev.svn.sourceforge.net/svnroot/psidev/psi/mzml/ (this is hyperlinked from the main development page http://www.psidev.info/index.php?q=node/257) 4) So, it turns out that the semantic validator is using an XML file to enforce the semantic rules, it is NOT reading the doc. It should be noted that this software and the mapping mechanism was developed originally for the PSI molecular interactions schema. That format uses the same built-in flexibility with semantic validation. We are borrowing that mechanism and software for mzML. 5) Further, in the doc, the cvParams section for each element is meant to represent "Some examples of allowed cvParams (not necessarily complete)". I will clarify that in the doc. Further, one of the things I realized that we need to do, is include in the doc the rules set forth in the ms-mapping.xml file. These rules are NOT currently in the doc, but they should be and will be. The doc is actually autogenerated from the other files, so I just need to include some code that parses this ms-mapping file and includes that information in the doc. This will be done for 0.99.1. Thanks! 6) Regarding your Observation Two: It is true that the standard relies on the maintenance of three artifacts: xsd, cv, mapping-ms.xml (not the doc as you had inferred; the doc is essentially autogenerated from the former) (and behind the scenes, the example instance documents also need to be maintained). This translates to the desired-stable schema, the evolving controlled vocabulary, and the evolving ruleset on how you may use the CV within the xsd. This is where we are led by the requirement that the schema be stable with provisions for flexibility in annotating many kinds of mass spec data. Thanks! Eric _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Monday, October 15, 2007 3:19 PM To: 'Mass spectrometry standard development' Subject: [Psidev-ms-dev] mzML validator experiences Hello All, I decided to fool around with the validator at http://eddie.thep.lu.se/prodac_validator/validator.pl to see how well that can be done in the presence of an inadequately specified file format. My plan was to take a valid file, mess with it, and see if the validator would notice. A little hiccup at first - I gave it the automatically generated file http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/2min.mzML - it doesn't actually validate, claiming a missing index element. Somebody might want to check that out. Then I gave it the handrolled http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/tiny4_LTQ-FT.mzML0.99.0.mzML - this validates fine. So, let the mayhem begin. I tried removing the selectionWindow element surrounding the cvParams declaring the upper and lower bounds of the selection window, but the validator is XSD aware so it caught that easily. Then I tried changing the accession numbers in the selection window for others that might be honestly conceptually mistaken by an incautious output module author: accession="MS:1000501" name="scan m/z lower limit" changed to accession="MS:1000528" name="lowest m/z value" the validator caught this as well, flagging the use of accession numbers that were incorrect for that context. But the knowledge behind this doesn't seem to come from the XSD or the CV file. So, how does the validator know? Observation one: the validator doesn't appear to be open source (or if it is, a prominent link to the source should be provided). The use of a closed source tool like this in a standards effort isn't a good idea, since it's hard to answer questions like the one above. Apparently the author of the validator made excellent use of the documentation at http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document /mzML0.99.0_specificationDocument.doc which stipulates in English that the only valid cvParams in that context are: <cvParam cvLabel="MS" accession="MS:1000501" name="scan m/z lower limit" value="400.000000"/> <cvParam cvLabel="MS" accession="MS:1000500" name="scan m/z upper limit" value="1800.000000"/> Ignore for the moment that this appears to be an example rather than a spec. Do note though that there's nothing to say that one of each has to be present. Of course a reasonable human would probably infer this, but words like "reasonable human" and "infer" are not really what you want to hear when discussing a machine readable data format standard. Observation two: I'm not at all keen on the idea of a data format that relies on the understanding and simultaneous maintenance of three different artifacts (xsd, cv, doc), one of which (.doc) is not really machine readable. I think (but I can't be 100% sure without seeing the code) that the author has done a very good job under the circumstances, but probably had a harder time then was necessary given the bizarre construction of the spec. He or she probably would have appreciated more xsd content to do the heavy lifting, and certainly had to make a few fairly safe guesses along the way like the "must have one of each of MS:1000501 and MS:1000500 " thing. - Brian |