|
From: Brian P. <bri...@in...> - 2007-10-16 00:54:59
|
Hi Eric,
Sorry if I missed anything obvious on the open source nature of the code.
Glad to hear it, obviously! It allows me to answer a lot of questions for
myself.
The existence of the mapping-ms.xml file was lost on me before now, sorry.
I see where it gets us a good deal of the way to where pure xsd would, but
not actually all the way.
For example, the validator accepts the addition of a dwell time to a
selectionWindow:
<cvParam cvLabel="MS" accession="MS:1000502" name="dwell time"
value="1800.000000"/>
although I think it's probably nonsensical since it lacks units etc.
The validator also happily accepts two copies of that line, in place of the
1000500 and 1000501 lines - all it cares about is seeing two cvParams of the
proper inheritance type.
The semantic constraints which can be expressed by the combination of the CV
and mappings-ms.xml files with the custom java validation code are pretty
crude compared to the capabilities of perfectly standard and language
independent XSD.
This all seems terribly convoluted, approximate, and error prone... such are
the wages of reinventing the wheel.
Brian
_____
From: psi...@li...
[mailto:psi...@li...] On Behalf Of Eric
Deutsch
Sent: Monday, October 15, 2007 4:37 PM
To: Mass spectrometry standard development
Subject: Re: [Psidev-ms-dev] mzML validator experiences
Hi Brian, thank you for your continued input and effort. I'm sorry I've been
slow to respond on many of your posts, I have a bunch of other pots boiling
over here. However, I think I can answer your questions here and promote
further testing.
1) Regarding 2min.mzML, we'll fix it, thanks.
2) Regarding how does the validator know that MS:1000528 is invalid, please
download:
http://tools.proteomecenter.org/software/mzMLKit/mzML_0.99.0_large.zip
(this is hyperlinked from the main development page
http://www.psidev.info/index.php?q=node/257)
In it, you will find the semantic validator software. One of the files in
the distro is ms-mapping.xml. It is this file that encodes these rules and
is what is used by the semantic validator. This file should be more
prominently posted and will be.
3) The semantic validator is FOSS, please see the PSI SVN repository and
contribute!
https://psidev.svn.sourceforge.net/svnroot/psidev/psi/mzml/
(this is hyperlinked from the main development page
http://www.psidev.info/index.php?q=node/257)
4) So, it turns out that the semantic validator is using an XML file to
enforce the semantic rules, it is NOT reading the doc. It should be noted
that this software and the mapping mechanism was developed originally for
the PSI molecular interactions schema. That format uses the same built-in
flexibility with semantic validation. We are borrowing that mechanism and
software for mzML.
5) Further, in the doc, the cvParams section for each element is meant to
represent "Some examples of allowed cvParams (not necessarily complete)". I
will clarify that in the doc. Further, one of the things I realized that we
need to do, is include in the doc the rules set forth in the ms-mapping.xml
file. These rules are NOT currently in the doc, but they should be and will
be. The doc is actually autogenerated from the other files, so I just need
to include some code that parses this ms-mapping file and includes that
information in the doc. This will be done for 0.99.1. Thanks!
6) Regarding your Observation Two: It is true that the standard relies on
the maintenance of three artifacts: xsd, cv, mapping-ms.xml (not the doc as
you had inferred; the doc is essentially autogenerated from the former) (and
behind the scenes, the example instance documents also need to be
maintained). This translates to the desired-stable schema, the evolving
controlled vocabulary, and the evolving ruleset on how you may use the CV
within the xsd. This is where we are led by the requirement that the schema
be stable with provisions for flexibility in annotating many kinds of mass
spec data.
Thanks!
Eric
_____
From: psi...@li...
[mailto:psi...@li...] On Behalf Of Brian
Pratt
Sent: Monday, October 15, 2007 3:19 PM
To: 'Mass spectrometry standard development'
Subject: [Psidev-ms-dev] mzML validator experiences
Hello All,
I decided to fool around with the validator at
http://eddie.thep.lu.se/prodac_validator/validator.pl to see how well that
can be done in the presence of an inadequately specified file format. My
plan was to take a valid file, mess with it, and see if the validator would
notice.
A little hiccup at first - I gave it the automatically generated file
http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance
File/2min.mzML
- it doesn't actually validate, claiming a missing index element. Somebody
might want to check that out.
Then I gave it the handrolled
http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance
File/tiny4_LTQ-FT.mzML0.99.0.mzML - this validates fine. So, let the mayhem
begin.
I tried removing the selectionWindow element surrounding the cvParams
declaring the upper and lower bounds of the selection window, but the
validator is XSD aware so it caught that easily.
Then I tried changing the accession numbers in the selection window for
others that might be honestly conceptually mistaken by an incautious output
module author:
accession="MS:1000501" name="scan m/z lower limit"
changed to
accession="MS:1000528" name="lowest m/z value"
the validator caught this as well, flagging the use of accession numbers
that were incorrect for that context. But the knowledge behind this doesn't
seem to come from the XSD or the CV file. So, how does the validator know?
Observation one: the validator doesn't appear to be open source (or if it
is, a prominent link to the source should be provided). The use of a closed
source tool like this in a standards effort isn't a good idea, since it's
hard to answer questions like the one above.
Apparently the author of the validator made excellent use of the
documentation at
http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document
/mzML0.99.0_specificationDocument.doc
which stipulates in English that the only valid cvParams in that context
are:
<cvParam cvLabel="MS" accession="MS:1000501" name="scan m/z lower limit"
value="400.000000"/>
<cvParam cvLabel="MS" accession="MS:1000500" name="scan m/z upper limit"
value="1800.000000"/>
Ignore for the moment that this appears to be an example rather than a spec.
Do note though that there's nothing to say that one of each has to be
present. Of course a reasonable human would probably infer this, but words
like "reasonable human" and "infer" are not really what you want to hear
when discussing a machine readable data format standard.
Observation two: I'm not at all keen on the idea of a data format that
relies on the understanding and simultaneous maintenance of three different
artifacts (xsd, cv, doc), one of which (.doc) is not really machine
readable.
I think (but I can't be 100% sure without seeing the code) that the author
has done a very good job under the circumstances, but probably had a harder
time then was necessary given the bizarre construction of the spec. He or
she probably would have appreciated more xsd content to do the heavy
lifting, and certainly had to make a few fairly safe guesses along the way
like the "must have one of each of MS:1000501 and MS:1000500 " thing.
- Brian
|