From: Brian P. <bri...@in...> - 2007-10-15 22:20:10
|
Hello All, I decided to fool around with the validator at http://eddie.thep.lu.se/prodac_validator/validator.pl to see how well that can be done in the presence of an inadequately specified file format. My plan was to take a valid file, mess with it, and see if the validator would notice. A little hiccup at first - I gave it the automatically generated file http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/2min.mzML - it doesn't actually validate, claiming a missing index element. Somebody might want to check that out. Then I gave it the handrolled http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/tiny4_LTQ-FT.mzML0.99.0.mzML - this validates fine. So, let the mayhem begin. I tried removing the selectionWindow element surrounding the cvParams declaring the upper and lower bounds of the selection window, but the validator is XSD aware so it caught that easily. Then I tried changing the accession numbers in the selection window for others that might be honestly conceptually mistaken by an incautious output module author: accession="MS:1000501" name="scan m/z lower limit" changed to accession="MS:1000528" name="lowest m/z value" the validator caught this as well, flagging the use of accession numbers that were incorrect for that context. But the knowledge behind this doesn't seem to come from the XSD or the CV file. So, how does the validator know? Observation one: the validator doesn't appear to be open source (or if it is, a prominent link to the source should be provided). The use of a closed source tool like this in a standards effort isn't a good idea, since it's hard to answer questions like the one above. Apparently the author of the validator made excellent use of the documentation at http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document /mzML0.99.0_specificationDocument.doc which stipulates in English that the only valid cvParams in that context are: <cvParam cvLabel="MS" accession="MS:1000501" name="scan m/z lower limit" value="400.000000"/> <cvParam cvLabel="MS" accession="MS:1000500" name="scan m/z upper limit" value="1800.000000"/> Ignore for the moment that this appears to be an example rather than a spec. Do note though that there's nothing to say that one of each has to be present. Of course a reasonable human would probably infer this, but words like "reasonable human" and "infer" are not really what you want to hear when discussing a machine readable data format standard. Observation two: I'm not at all keen on the idea of a data format that relies on the understanding and simultaneous maintenance of three different artifacts (xsd, cv, doc), one of which (.doc) is not really machine readable. I think (but I can't be 100% sure without seeing the code) that the author has done a very good job under the circumstances, but probably had a harder time then was necessary given the bizarre construction of the spec. He or she probably would have appreciated more xsd content to do the heavy lifting, and certainly had to make a few fairly safe guesses along the way like the "must have one of each of MS:1000501 and MS:1000500 " thing. - Brian |