You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(1) |
Aug
(5) |
Sep
|
Oct
(5) |
Nov
(1) |
Dec
(2) |
2005 |
Jan
(2) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(5) |
Jun
(2) |
Jul
(3) |
Aug
(7) |
Sep
(18) |
Oct
(22) |
Nov
(10) |
Dec
(15) |
2006 |
Jan
(15) |
Feb
(8) |
Mar
(16) |
Apr
(8) |
May
(2) |
Jun
(5) |
Jul
(3) |
Aug
(1) |
Sep
(34) |
Oct
(21) |
Nov
(14) |
Dec
(2) |
2007 |
Jan
|
Feb
(17) |
Mar
(10) |
Apr
(25) |
May
(11) |
Jun
(30) |
Jul
(1) |
Aug
(38) |
Sep
|
Oct
(119) |
Nov
(18) |
Dec
(3) |
2008 |
Jan
(34) |
Feb
(202) |
Mar
(57) |
Apr
(76) |
May
(44) |
Jun
(33) |
Jul
(33) |
Aug
(32) |
Sep
(41) |
Oct
(49) |
Nov
(84) |
Dec
(216) |
2009 |
Jan
(102) |
Feb
(126) |
Mar
(112) |
Apr
(26) |
May
(91) |
Jun
(54) |
Jul
(39) |
Aug
(29) |
Sep
(16) |
Oct
(18) |
Nov
(12) |
Dec
(23) |
2010 |
Jan
(29) |
Feb
(7) |
Mar
(11) |
Apr
(22) |
May
(9) |
Jun
(13) |
Jul
(7) |
Aug
(10) |
Sep
(9) |
Oct
(20) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
(4) |
Mar
(27) |
Apr
(15) |
May
(23) |
Jun
(13) |
Jul
(15) |
Aug
(11) |
Sep
(23) |
Oct
(18) |
Nov
(10) |
Dec
(7) |
2012 |
Jan
(23) |
Feb
(19) |
Mar
(7) |
Apr
(20) |
May
(16) |
Jun
(4) |
Jul
(6) |
Aug
(6) |
Sep
(14) |
Oct
(16) |
Nov
(31) |
Dec
(23) |
2013 |
Jan
(14) |
Feb
(19) |
Mar
(7) |
Apr
(25) |
May
(8) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(20) |
Oct
(19) |
Nov
(10) |
Dec
(12) |
2014 |
Jan
(6) |
Feb
(15) |
Mar
(6) |
Apr
(4) |
May
(16) |
Jun
(6) |
Jul
(4) |
Aug
(2) |
Sep
(3) |
Oct
(3) |
Nov
(7) |
Dec
(3) |
2015 |
Jan
(3) |
Feb
(8) |
Mar
(14) |
Apr
(3) |
May
(17) |
Jun
(9) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
(13) |
Nov
|
Dec
(6) |
2016 |
Jan
(8) |
Feb
(1) |
Mar
(20) |
Apr
(16) |
May
(11) |
Jun
(6) |
Jul
(5) |
Aug
|
Sep
(2) |
Oct
(5) |
Nov
(7) |
Dec
(2) |
2017 |
Jan
(10) |
Feb
(3) |
Mar
(17) |
Apr
(7) |
May
(5) |
Jun
(11) |
Jul
(4) |
Aug
(12) |
Sep
(9) |
Oct
(7) |
Nov
(2) |
Dec
(4) |
2018 |
Jan
(7) |
Feb
(2) |
Mar
(5) |
Apr
(6) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(1) |
Sep
(9) |
Oct
(5) |
Nov
(3) |
Dec
(5) |
2019 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(4) |
May
(2) |
Jun
(8) |
Jul
(2) |
Aug
(2) |
Sep
|
Oct
(2) |
Nov
(9) |
Dec
(1) |
2020 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
|
May
(3) |
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Fredrik L. <Fre...@im...> - 2007-11-23 14:08:14
|
Hi Henk, Randy and others, I think that normally you will produce two separate mzML files from that workflow. The first one will represent all the MS spectra collected in the first run, and the second one will contain a mixture of MS scans and MS/MS scans from the run that is performed with an inclusion list (pick list). The second file would look similar to the file: http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instanceFile/1min.mzML where some spectra are MS (level one) and some MSMS (level two). The picked masses are found under 'precursor' for the MS/MS spectra. In addition, probably the complete inclusion list should be given as cvParams or userParams in a 'referencableParamGroup' to specify which peaks the instrument was programmed to look for. One could imagine that you construct a third mzML file which is assembled from the first two files, but I'm not sure if that is allowed within the standard, since only one 'run' can be specified. What would be the preferred way to accomplish this? analysisXML or mzML? Has anyone created an mzML file from multiple runs? Regards Fredrik Randy Julian wrote: > Hi Henk, > > mzML was designed for the application you described. Take a look at the > specification document: > > http://www.psidev.info/index.php?q=node/303 > > In this public comment release, the spectrum element allows multiple > binary arrays to be stored. The main ones would be m/z and intensity. > The thought was there could be others - like picked peaks. We have > wrestled with allowing human readable arrays and I think the group > concluded they would be too confusing. There are many ways to do human > readable arrays and that violates the goal of minimizing 'ways to > represent the same thing' in the standard - a very good goal. > > This means that you will either have to encode the peak list in binary, > or you could use either the cvParam or userParam elements. I would > recommend that we adopt a standard nomenclature for picked peaks and > represent this in cvParams for situations where there are not too many. > > The fragmentation spectra can be stored directly and are best > represented in the binaryDataArray - this is what it was meant for. If > you have a large number of picked peaks, this binary array is also the > best way to store this type of data. > > As for 'fragments' of mzML, the spectrum element does have an ID > attribute. In theory, this means that each is uniquely identified in > the file and could be returned as part of a query (I'm thinking XQuery > style extraction from the document). While the spectrum element is not > self contained, it is 'identifiable' so is a candidate for a return > value from an XQuery or an LSID request - I don't think we have not > gotten that far - any suggestions? > > Read through the specification and let us know if you think it's unclear > on how the standard could do what you want. We are at the point where > external readers are needed. > > Thanks, > Randy Julian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Toorn, > H.W.P. van den (Henk) > Sent: Thursday, November 22, 2007 10:23 AM > To: Mass spectrometry standard development > Subject: [Psidev-ms-dev] mzML pick file question > > Dear developers, > > I have some questions concerning the mzML format. > We have some collaborators who are forced to use MS-peak pick files in > order to target peaks for MS-MS in a later run. To be more clear, the > workflow would be: do an MS run, pick the peaks you are interested in, > rerun the MS, use the list of picked peaks to do further fragmentation. > > My questions are: would it be possible to store such picked peaks in a > part of the mzML file, together with the original MS-spectra and the > resulting MS-MS fragmentations? Are there any obvious ways that > fragments of the mzML files could be used as an intermediary file > format? > > Thanks in advance, Henk van den Toorn > > ------------------------------------------------------------------------ > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Randy J. <rkj...@in...> - 2007-11-23 13:37:28
|
Hi Henk, mzML was designed for the application you described. Take a look at the specification document: http://www.psidev.info/index.php?q=3Dnode/303 In this public comment release, the spectrum element allows multiple binary arrays to be stored. The main ones would be m/z and intensity. The thought was there could be others - like picked peaks. We have wrestled with allowing human readable arrays and I think the group concluded they would be too confusing. There are many ways to do human readable arrays and that violates the goal of minimizing 'ways to represent the same thing' in the standard - a very good goal. This means that you will either have to encode the peak list in binary, or you could use either the cvParam or userParam elements. I would recommend that we adopt a standard nomenclature for picked peaks and represent this in cvParams for situations where there are not too many. The fragmentation spectra can be stored directly and are best represented in the binaryDataArray - this is what it was meant for. If you have a large number of picked peaks, this binary array is also the best way to store this type of data. As for 'fragments' of mzML, the spectrum element does have an ID attribute. In theory, this means that each is uniquely identified in the file and could be returned as part of a query (I'm thinking XQuery style extraction from the document). While the spectrum element is not self contained, it is 'identifiable' so is a candidate for a return value from an XQuery or an LSID request - I don't think we have not gotten that far - any suggestions? Read through the specification and let us know if you think it's unclear on how the standard could do what you want. We are at the point where external readers are needed. Thanks, Randy Julian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Toorn, H.W.P. van den (Henk) Sent: Thursday, November 22, 2007 10:23 AM To: Mass spectrometry standard development Subject: [Psidev-ms-dev] mzML pick file question Dear developers, I have some questions concerning the mzML format. We have some collaborators who are forced to use MS-peak pick files in order to target peaks for MS-MS in a later run. To be more clear, the workflow would be: do an MS run, pick the peaks you are interested in, rerun the MS, use the list of picked peaks to do further fragmentation.=20 My questions are: would it be possible to store such picked peaks in a part of the mzML file, together with the original MS-spectra and the resulting MS-MS fragmentations? Are there any obvious ways that fragments of the mzML files could be used as an intermediary file format? Thanks in advance, Henk van den Toorn ------------------------------------------------------------------------ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Toorn, H.W.P. v. d. (Henk) <H.W...@uu...> - 2007-11-22 15:23:45
|
Dear developers, I have some questions concerning the mzML format. We have some collaborators who are forced to use MS-peak pick files in order to target peaks for MS-MS in a later run. To be more clear, the workflow would be: do an MS run, pick the peaks you are interested in, rerun the MS, use the list of picked peaks to do further fragmentation.=20 My questions are: would it be possible to store such picked peaks in a part of the mzML file, together with the original MS-spectra and the resulting MS-MS fragmentations? Are there any obvious ways that fragments of the mzML files could be used as an intermediary file format? Thanks in advance, Henk van den Toorn |
From: Brian P. <bri...@in...> - 2007-10-18 20:08:37
|
Hi Chris, Quite right, with namespaces and all a combination of related schema operates as one, it's just a bit more complex to deal with mentally. I just don't want any unwarranted complexity. If there are aspects of MS raw data specification that are known to be volatile then moving them to a child schema might make sense, but so far I'm not sure what these are. I do know I see things in the mapping file that are clearly not volatile, like scan window. This picture will no doubt become clearer as things get moved into the xsd. I don't think anyone means to tie the xsd to a single CV, it's just that at the moment there's only one CV we're aware of that's useful in this context. The idea is that each element points at a CV entry, which doesn't have to be in the MS CV necessarily. When OBI is ready, we can update the xsd to point there as well. It won't destabilize any systems already using mzML, so I think it still meets your test of a proper standard. Third parties aren't really so interested in the assurance of a schema with an unchanging version number as they are in the assurance of not having to revisit their code every other week. It's extremely useful to hear what you're heard elsewhere, thanks for sticking your neck out! Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Chris Taylor Sent: Thursday, October 18, 2007 12:19 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments But with namespaces and all surely what is in physical reality two schemata can be operated as one anyway? So does it really make a huge difference? I'm actually asking rather than being rhetorical... Despite all the arguments I perceive 'proper' standards as being completely static apart from through (infrequent) versions (XML Schema itself, for example). Maybe I have a biased notion of standards but should we not be making a core thing that is static and keeping the volatile stuff in the second one? And I do still see a tie to one CV as bundling for no reason -- it's a short term gain (a year or so, which means that just at the point that we have good implementations, it'll be change-o time). I dunno. I'm as I said just throwing in opinions I've heard elsewhere mostly. On balance it really comes down to pragmatism versus kind/strength of assurance (to third parties). I'm gonna pull my head in now anyway :) Cheers, Chris. Brian Pratt wrote: > Hey All, > > It's true that in practice most day to day consumers of mzML files will not > bother with validation. The value of the detailed validation capability of > a fully realized xsd is largely seen during the *development* of the readers > and writers, not in their day to day operation. (Of course it's also seen > in their day to day operation because they work properly, having been > written properly.) > > Ideally we would test every conceivable combination of writer and reader, > but since we can't expect to do that (we can't start until everybody > finishes, and imagine the back and forth!) we instead have to make it > possible for the writers to readily check their work in syntactic and > semantic detail, and for the readers to not have to make a lot of guesses > about what they're likely to see. The fully realized xsd helps on both > counts - ready validation for the writers, and a clear spec for the readers. > It also gives the possibility of automatically generated code as a jumping > off point for the programmers of both readers and writers, which can reduce > defect rates. > > Matt asks if I envision one schema or two. We need to go out the gate with > one schema that expresses everything we know we want to say today (includes > any intelligence in the current mapping file, plus more detail). The > anticipated need for vendors to extend the schema independent of the > official schema release cycle (our "stability" goal) is then handled by > schemas the vendors create, which inherit from and extend the standard > schema. The proposed idea of a second schema from the get-go just to layer > on the CV mappings is unwarranted complexity. These belong in the core xsd > as (optional) attributes of the various elements, when that one-time OBI > event comes we'll just update the core xsd to add attributes that indicate > relationships from elements to the new CV as well. It's far enough away not > to threaten the appearance of stability in the spec, and in any case won't > break backward compatibility. > > The important point about hard coding rules vs expressing relationships and > constraints in the xsd is one of economies of scale. It was asked whether > hard coding was any more work than getting the schema right: the answer is > yes, as it has to be done repeatedly, once per validating reader > implementation (not everyone uses Java, or is even allowed to use open > source code in their product). Why make everyone reinvent the wheel and > probably get it wrong, when we have a nice, standard, language independent > means of expressing those constraints? > > It just comes down to KISS: Keep It Simple, Stupid! (not calling names > here, that's just the acronym as I learned it). We're here to deal with MS > raw data transfer, not to design new data format description languages. > More than once on this list I've seen snarky asides about coders who aren't > up to muscling through these proposed convolutions, but a truly competent > coder is professionally lazy (managers prefer "elegant"). Moreover, a > standards effort is supposed to consolidate the efforts of the community so > its individuals can get on with their real work - we shouldn't be blithely > proposing things that create more individual work than they absolutely need > to. > > - Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Chris > Taylor > Sent: Thursday, October 18, 2007 9:37 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments > > Hiya. > > Matthew Chambers wrote: >> I'm glad we're getting good participation and discussion of this issue >> now! Chris, your characterization is a reasonable one for the >> two-schema approach I described. >> >> To respond to qualification of the current state of affairs, I'll quote >> something you said the other day: >>> Clearly we need the basic (and rilly rilly easy to do) syntactic >>> validation provided by a fairly rich XML schema. >> This is not clear to me. I do not see a clear advantage to validating >> syntax and not validating semantics. In my experience, reading a file >> with invalid semantics is as likely to result in a parser error as >> reading a file with invalid syntax (although I admit that implementing >> error handling for semantic errors tends to be more intuitive). > > The only thing I'd say here is that there is a minimum effort > option available for implementers who cannot or choose not to > validate content -- i.e. the 'core' schema is there to allow > syntactic validation only, the extended schema you suggested > would then allow the Brians and yourselves of this world to do > more. Seems a neat solution. That said I don't contest your > assertion that the more thorough the validation, the more likely > one is to catch the subtle errors as well as the gross ones. > >>> But supporting >>> the kinds of functionality discussed (which would mean the CV >>> rapidly becoming a 'proper' ontology, which we don't have the >>> person-hours to do right btw) is really just a nice to have at >>> the moment. True semantic validation is just about feasible but >>> _isn't_ practical imho. >> I think you misunderstood the functionality I was suggesting to be added >> to the CV. I was not suggesting significant logic changes in the CV, >> only a simple instance_of relationship added to every controlled value >> to link it to its parent category: "LTQ" is a controlled value, and it >> should be an 'instance_of' an "instrument model", which is a controlled >> category. In my view, the distinction between controlled values and >> categories in the CV is crucial and it doesn't come close to making the >> CV any more of a 'proper' ontology (i.e. that machines can use to gain >> knowledge about the domain without human intervention). It would, >> however, mean that a machine could auto-generate a schema from the CV, >> which is what I was aiming for. :) I don't really agree with the idea >> that the PSI MS CV should be a filler which gets replaced by the OBI CV >> whenever it comes about, but if that's the consensus view then that >> would be reason enough to give up the idea of using the CV to >> auto-generate the schema. > > Thing here is that I heard several people assert (not on here) > that defining terminating endpoints is storing up trouble and > instances are therefore hostages to fortune; you'll just end up > making a new class and deprecating the instance. Obviously there > are clear endpoints (is there only one variant of an LTQ btw? is > it a child or a sib to have an LTQ-FT?) but there are also > going to be mistakes made -- rope to hang ourselves (overly > dramatic phrase but nonetheless). > > Then there is the case where people _want_ to use a more generic > parent (not sure how many there are in the CV tbh as it is quite > flat iirc but still there are many ontologies in the world where > the nodes are used as much as the leaves). A (simple-ish) > example off the top of my head (not necessarily directly > applicable, just for the principle) would be where someone has a > machine not yet described and just wants to say something about it. > >>> Certainly for all but the most dedicated >>> coders it is a pipe dream. All that can realistically be hoped >>> for at the moment is correct usage (i.e. checking in an >>> application of some sort that the term is appropriate given its >>> usage), for which this wattage of CV is just fine.This is what >>> the MIers have done -- a java app uses hard-coded rules to check >>> usage (and in that simple scenario the intelligent use of >>> class-superclass stuff can bring benefits). >> It seems here you DO suggest validating semantics, but instead of doing >> it with the CV/schema it must be implemented manually by hard-coding the >> rules into a user application. Right now, there is no way (short of >> parsing the ms-mapping file and adopting that format) to get that kind >> of validation without the hard-coding you mention. Brian and I both >> think that a proper specification should include a way to get this kind >> of validation without hard-coding the rules, even if applications choose >> not to use it. > > I think in the absence of an ontology to afford this sort of > functionality (and with one expected), hard coding is not an > awful solution (the workload for your suggestion wouldn't be > orders of magnitude different would it, bearing in mind this is > a temporary state of affairs so not subject to years of > maintenance?). The MI group certainly went this route straight > off the bat... > > At the risk of becoming dull, I'd restate that this is why I > like the separable schemata you suggested, as we get the best of > both worlds no? > >>> But what they're not >>> doing is something like (for MS now) I have a Voyager so why on >>> earth do I have ion trap data -- sound the klaxon; this can only >>> come from something of the sophistication of OBI (or a _LOT_ of >>> bespoke coding), which is in a flavour of OWL (a cruise liner to >>> OBO's dinghy). >> It's true, AFAIK, that validating (for example) the value of the "mass >> analyzer" category based on the value provided for the "instrument >> model" category is not possible with the current CV/schema. It is not >> even possible after the extensions proposed by Brian or me. Such >> functionality would require a much more interconnected CV (and the XSD >> schema would be so confusing to maintain that it would almost certainly >> have to be auto-generated from the CV). I don't think anybody >> particularly expects this functionality either, so we needn't worry >> about it. :) > > Well I'm kind of hoping we will ultimately be able to get this > from OBI, which is being built in a very thorough and extensible > (in terms of the richness of relations between classes) manner. > > Cheers, Chris. > > >> -Matt >> >> >> Chris Taylor wrote: >>> Hiya. >>> >>> So your solution can, if I understand correctly, be >>> characterised as formalising the mapping file info in an XSD >>> that happens (for obvious reasons) to inherit from the main >>> schema? If so, then as long as everyone likes it, I see that as >>> a nice, neat, robust solution. >>> >>> Funnily enough I was chatting to a fellow PSIer yesterday about >>> the mapping file(s) (this is cross-WG policy stuff you see) and >>> enquired as to the current nature of the thing. I think if there >>> is a clamour to formalise the map then hopefully there will be a >>> response. To qualify the current state of affairs though, this >>> was not meant to be a formal part of the standard -- more >>> something akin to documentation (it didn't exist at all at one >>> point -- bridging the gap was something done in the CV, which is >>> not a great method for a number of reasons). >>> >>> Cheers, Chris. >>> >>> >>> Matthew Chambers wrote: >>> >>>> If the consensus is that the CV should be left simple like it is now, >>>> then I must agree with Brian. The current schema is incapable of doing >>>> real validation, and the ms-mapping file is worse than a fleshed-out CV >>>> or XSD (it's more confusing, it takes longer to maintain, and it's >>>> non-standard). >>>> >>>> I still want Brian to clarify if he wants a one-schema spec or a >>>> two-schema spec. I support the latter approach, where one schema is a >>>> stable, syntactical version and the other inherits from the first one >>>> and defines all the semantic restrictions as well. It would be up to >>>> implementors which schema to use for validation, and of course only the >>>> syntactical schema would be "stable" because the semantic restrictions >>>> in the second schema would change to match the CV whenever it was > updated. >>>> -Matt >>>> >>>> >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. >> Still grepping through log files to find problems? Stop. >> Now Search log events and configuration files using AJAX and a browser. >> Download your FREE copy of Splunk now >> http://get.splunk.com/ >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> > -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Chris T. <chr...@eb...> - 2007-10-18 19:19:13
|
But with namespaces and all surely what is in physical reality two schemata can be operated as one anyway? So does it really make a huge difference? I'm actually asking rather than being rhetorical... Despite all the arguments I perceive 'proper' standards as being completely static apart from through (infrequent) versions (XML Schema itself, for example). Maybe I have a biased notion of standards but should we not be making a core thing that is static and keeping the volatile stuff in the second one? And I do still see a tie to one CV as bundling for no reason -- it's a short term gain (a year or so, which means that just at the point that we have good implementations, it'll be change-o time). I dunno. I'm as I said just throwing in opinions I've heard elsewhere mostly. On balance it really comes down to pragmatism versus kind/strength of assurance (to third parties). I'm gonna pull my head in now anyway :) Cheers, Chris. Brian Pratt wrote: > Hey All, > > It's true that in practice most day to day consumers of mzML files will not > bother with validation. The value of the detailed validation capability of > a fully realized xsd is largely seen during the *development* of the readers > and writers, not in their day to day operation. (Of course it's also seen > in their day to day operation because they work properly, having been > written properly.) > > Ideally we would test every conceivable combination of writer and reader, > but since we can't expect to do that (we can't start until everybody > finishes, and imagine the back and forth!) we instead have to make it > possible for the writers to readily check their work in syntactic and > semantic detail, and for the readers to not have to make a lot of guesses > about what they're likely to see. The fully realized xsd helps on both > counts - ready validation for the writers, and a clear spec for the readers. > It also gives the possibility of automatically generated code as a jumping > off point for the programmers of both readers and writers, which can reduce > defect rates. > > Matt asks if I envision one schema or two. We need to go out the gate with > one schema that expresses everything we know we want to say today (includes > any intelligence in the current mapping file, plus more detail). The > anticipated need for vendors to extend the schema independent of the > official schema release cycle (our "stability" goal) is then handled by > schemas the vendors create, which inherit from and extend the standard > schema. The proposed idea of a second schema from the get-go just to layer > on the CV mappings is unwarranted complexity. These belong in the core xsd > as (optional) attributes of the various elements, when that one-time OBI > event comes we'll just update the core xsd to add attributes that indicate > relationships from elements to the new CV as well. It's far enough away not > to threaten the appearance of stability in the spec, and in any case won't > break backward compatibility. > > The important point about hard coding rules vs expressing relationships and > constraints in the xsd is one of economies of scale. It was asked whether > hard coding was any more work than getting the schema right: the answer is > yes, as it has to be done repeatedly, once per validating reader > implementation (not everyone uses Java, or is even allowed to use open > source code in their product). Why make everyone reinvent the wheel and > probably get it wrong, when we have a nice, standard, language independent > means of expressing those constraints? > > It just comes down to KISS: Keep It Simple, Stupid! (not calling names > here, that's just the acronym as I learned it). We're here to deal with MS > raw data transfer, not to design new data format description languages. > More than once on this list I've seen snarky asides about coders who aren't > up to muscling through these proposed convolutions, but a truly competent > coder is professionally lazy (managers prefer "elegant"). Moreover, a > standards effort is supposed to consolidate the efforts of the community so > its individuals can get on with their real work - we shouldn't be blithely > proposing things that create more individual work than they absolutely need > to. > > - Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Chris > Taylor > Sent: Thursday, October 18, 2007 9:37 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments > > Hiya. > > Matthew Chambers wrote: >> I'm glad we're getting good participation and discussion of this issue >> now! Chris, your characterization is a reasonable one for the >> two-schema approach I described. >> >> To respond to qualification of the current state of affairs, I'll quote >> something you said the other day: >>> Clearly we need the basic (and rilly rilly easy to do) syntactic >>> validation provided by a fairly rich XML schema. >> This is not clear to me. I do not see a clear advantage to validating >> syntax and not validating semantics. In my experience, reading a file >> with invalid semantics is as likely to result in a parser error as >> reading a file with invalid syntax (although I admit that implementing >> error handling for semantic errors tends to be more intuitive). > > The only thing I'd say here is that there is a minimum effort > option available for implementers who cannot or choose not to > validate content -- i.e. the 'core' schema is there to allow > syntactic validation only, the extended schema you suggested > would then allow the Brians and yourselves of this world to do > more. Seems a neat solution. That said I don't contest your > assertion that the more thorough the validation, the more likely > one is to catch the subtle errors as well as the gross ones. > >>> But supporting >>> the kinds of functionality discussed (which would mean the CV >>> rapidly becoming a 'proper' ontology, which we don't have the >>> person-hours to do right btw) is really just a nice to have at >>> the moment. True semantic validation is just about feasible but >>> _isn't_ practical imho. >> I think you misunderstood the functionality I was suggesting to be added >> to the CV. I was not suggesting significant logic changes in the CV, >> only a simple instance_of relationship added to every controlled value >> to link it to its parent category: "LTQ" is a controlled value, and it >> should be an 'instance_of' an "instrument model", which is a controlled >> category. In my view, the distinction between controlled values and >> categories in the CV is crucial and it doesn't come close to making the >> CV any more of a 'proper' ontology (i.e. that machines can use to gain >> knowledge about the domain without human intervention). It would, >> however, mean that a machine could auto-generate a schema from the CV, >> which is what I was aiming for. :) I don't really agree with the idea >> that the PSI MS CV should be a filler which gets replaced by the OBI CV >> whenever it comes about, but if that's the consensus view then that >> would be reason enough to give up the idea of using the CV to >> auto-generate the schema. > > Thing here is that I heard several people assert (not on here) > that defining terminating endpoints is storing up trouble and > instances are therefore hostages to fortune; you'll just end up > making a new class and deprecating the instance. Obviously there > are clear endpoints (is there only one variant of an LTQ btw? is > it a child or a sib to have an LTQ-FT?) but there are also > going to be mistakes made -- rope to hang ourselves (overly > dramatic phrase but nonetheless). > > Then there is the case where people _want_ to use a more generic > parent (not sure how many there are in the CV tbh as it is quite > flat iirc but still there are many ontologies in the world where > the nodes are used as much as the leaves). A (simple-ish) > example off the top of my head (not necessarily directly > applicable, just for the principle) would be where someone has a > machine not yet described and just wants to say something about it. > >>> Certainly for all but the most dedicated >>> coders it is a pipe dream. All that can realistically be hoped >>> for at the moment is correct usage (i.e. checking in an >>> application of some sort that the term is appropriate given its >>> usage), for which this wattage of CV is just fine.This is what >>> the MIers have done -- a java app uses hard-coded rules to check >>> usage (and in that simple scenario the intelligent use of >>> class-superclass stuff can bring benefits). >> It seems here you DO suggest validating semantics, but instead of doing >> it with the CV/schema it must be implemented manually by hard-coding the >> rules into a user application. Right now, there is no way (short of >> parsing the ms-mapping file and adopting that format) to get that kind >> of validation without the hard-coding you mention. Brian and I both >> think that a proper specification should include a way to get this kind >> of validation without hard-coding the rules, even if applications choose >> not to use it. > > I think in the absence of an ontology to afford this sort of > functionality (and with one expected), hard coding is not an > awful solution (the workload for your suggestion wouldn't be > orders of magnitude different would it, bearing in mind this is > a temporary state of affairs so not subject to years of > maintenance?). The MI group certainly went this route straight > off the bat... > > At the risk of becoming dull, I'd restate that this is why I > like the separable schemata you suggested, as we get the best of > both worlds no? > >>> But what they're not >>> doing is something like (for MS now) I have a Voyager so why on >>> earth do I have ion trap data -- sound the klaxon; this can only >>> come from something of the sophistication of OBI (or a _LOT_ of >>> bespoke coding), which is in a flavour of OWL (a cruise liner to >>> OBO's dinghy). >> It's true, AFAIK, that validating (for example) the value of the "mass >> analyzer" category based on the value provided for the "instrument >> model" category is not possible with the current CV/schema. It is not >> even possible after the extensions proposed by Brian or me. Such >> functionality would require a much more interconnected CV (and the XSD >> schema would be so confusing to maintain that it would almost certainly >> have to be auto-generated from the CV). I don't think anybody >> particularly expects this functionality either, so we needn't worry >> about it. :) > > Well I'm kind of hoping we will ultimately be able to get this > from OBI, which is being built in a very thorough and extensible > (in terms of the richness of relations between classes) manner. > > Cheers, Chris. > > >> -Matt >> >> >> Chris Taylor wrote: >>> Hiya. >>> >>> So your solution can, if I understand correctly, be >>> characterised as formalising the mapping file info in an XSD >>> that happens (for obvious reasons) to inherit from the main >>> schema? If so, then as long as everyone likes it, I see that as >>> a nice, neat, robust solution. >>> >>> Funnily enough I was chatting to a fellow PSIer yesterday about >>> the mapping file(s) (this is cross-WG policy stuff you see) and >>> enquired as to the current nature of the thing. I think if there >>> is a clamour to formalise the map then hopefully there will be a >>> response. To qualify the current state of affairs though, this >>> was not meant to be a formal part of the standard -- more >>> something akin to documentation (it didn't exist at all at one >>> point -- bridging the gap was something done in the CV, which is >>> not a great method for a number of reasons). >>> >>> Cheers, Chris. >>> >>> >>> Matthew Chambers wrote: >>> >>>> If the consensus is that the CV should be left simple like it is now, >>>> then I must agree with Brian. The current schema is incapable of doing >>>> real validation, and the ms-mapping file is worse than a fleshed-out CV >>>> or XSD (it's more confusing, it takes longer to maintain, and it's >>>> non-standard). >>>> >>>> I still want Brian to clarify if he wants a one-schema spec or a >>>> two-schema spec. I support the latter approach, where one schema is a >>>> stable, syntactical version and the other inherits from the first one >>>> and defines all the semantic restrictions as well. It would be up to >>>> implementors which schema to use for validation, and of course only the >>>> syntactical schema would be "stable" because the semantic restrictions >>>> in the second schema would change to match the CV whenever it was > updated. >>>> -Matt >>>> >>>> >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Splunk Inc. >> Still grepping through log files to find problems? Stop. >> Now Search log events and configuration files using AJAX and a browser. >> Download your FREE copy of Splunk now >> http://get.splunk.com/ >> _______________________________________________ >> Psidev-ms-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >> > -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ |
From: Brian P. <bri...@in...> - 2007-10-18 18:01:26
|
Hey All, It's true that in practice most day to day consumers of mzML files will not bother with validation. The value of the detailed validation capability of a fully realized xsd is largely seen during the *development* of the readers and writers, not in their day to day operation. (Of course it's also seen in their day to day operation because they work properly, having been written properly.) Ideally we would test every conceivable combination of writer and reader, but since we can't expect to do that (we can't start until everybody finishes, and imagine the back and forth!) we instead have to make it possible for the writers to readily check their work in syntactic and semantic detail, and for the readers to not have to make a lot of guesses about what they're likely to see. The fully realized xsd helps on both counts - ready validation for the writers, and a clear spec for the readers. It also gives the possibility of automatically generated code as a jumping off point for the programmers of both readers and writers, which can reduce defect rates. Matt asks if I envision one schema or two. We need to go out the gate with one schema that expresses everything we know we want to say today (includes any intelligence in the current mapping file, plus more detail). The anticipated need for vendors to extend the schema independent of the official schema release cycle (our "stability" goal) is then handled by schemas the vendors create, which inherit from and extend the standard schema. The proposed idea of a second schema from the get-go just to layer on the CV mappings is unwarranted complexity. These belong in the core xsd as (optional) attributes of the various elements, when that one-time OBI event comes we'll just update the core xsd to add attributes that indicate relationships from elements to the new CV as well. It's far enough away not to threaten the appearance of stability in the spec, and in any case won't break backward compatibility. The important point about hard coding rules vs expressing relationships and constraints in the xsd is one of economies of scale. It was asked whether hard coding was any more work than getting the schema right: the answer is yes, as it has to be done repeatedly, once per validating reader implementation (not everyone uses Java, or is even allowed to use open source code in their product). Why make everyone reinvent the wheel and probably get it wrong, when we have a nice, standard, language independent means of expressing those constraints? It just comes down to KISS: Keep It Simple, Stupid! (not calling names here, that's just the acronym as I learned it). We're here to deal with MS raw data transfer, not to design new data format description languages. More than once on this list I've seen snarky asides about coders who aren't up to muscling through these proposed convolutions, but a truly competent coder is professionally lazy (managers prefer "elegant"). Moreover, a standards effort is supposed to consolidate the efforts of the community so its individuals can get on with their real work - we shouldn't be blithely proposing things that create more individual work than they absolutely need to. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Chris Taylor Sent: Thursday, October 18, 2007 9:37 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments Hiya. Matthew Chambers wrote: > I'm glad we're getting good participation and discussion of this issue > now! Chris, your characterization is a reasonable one for the > two-schema approach I described. > > To respond to qualification of the current state of affairs, I'll quote > something you said the other day: >> Clearly we need the basic (and rilly rilly easy to do) syntactic >> validation provided by a fairly rich XML schema. > This is not clear to me. I do not see a clear advantage to validating > syntax and not validating semantics. In my experience, reading a file > with invalid semantics is as likely to result in a parser error as > reading a file with invalid syntax (although I admit that implementing > error handling for semantic errors tends to be more intuitive). The only thing I'd say here is that there is a minimum effort option available for implementers who cannot or choose not to validate content -- i.e. the 'core' schema is there to allow syntactic validation only, the extended schema you suggested would then allow the Brians and yourselves of this world to do more. Seems a neat solution. That said I don't contest your assertion that the more thorough the validation, the more likely one is to catch the subtle errors as well as the gross ones. >> But supporting >> the kinds of functionality discussed (which would mean the CV >> rapidly becoming a 'proper' ontology, which we don't have the >> person-hours to do right btw) is really just a nice to have at >> the moment. True semantic validation is just about feasible but >> _isn't_ practical imho. > I think you misunderstood the functionality I was suggesting to be added > to the CV. I was not suggesting significant logic changes in the CV, > only a simple instance_of relationship added to every controlled value > to link it to its parent category: "LTQ" is a controlled value, and it > should be an 'instance_of' an "instrument model", which is a controlled > category. In my view, the distinction between controlled values and > categories in the CV is crucial and it doesn't come close to making the > CV any more of a 'proper' ontology (i.e. that machines can use to gain > knowledge about the domain without human intervention). It would, > however, mean that a machine could auto-generate a schema from the CV, > which is what I was aiming for. :) I don't really agree with the idea > that the PSI MS CV should be a filler which gets replaced by the OBI CV > whenever it comes about, but if that's the consensus view then that > would be reason enough to give up the idea of using the CV to > auto-generate the schema. Thing here is that I heard several people assert (not on here) that defining terminating endpoints is storing up trouble and instances are therefore hostages to fortune; you'll just end up making a new class and deprecating the instance. Obviously there are clear endpoints (is there only one variant of an LTQ btw? is it a child or a sib to have an LTQ-FT?) but there are also going to be mistakes made -- rope to hang ourselves (overly dramatic phrase but nonetheless). Then there is the case where people _want_ to use a more generic parent (not sure how many there are in the CV tbh as it is quite flat iirc but still there are many ontologies in the world where the nodes are used as much as the leaves). A (simple-ish) example off the top of my head (not necessarily directly applicable, just for the principle) would be where someone has a machine not yet described and just wants to say something about it. >> Certainly for all but the most dedicated >> coders it is a pipe dream. All that can realistically be hoped >> for at the moment is correct usage (i.e. checking in an >> application of some sort that the term is appropriate given its >> usage), for which this wattage of CV is just fine.This is what >> the MIers have done -- a java app uses hard-coded rules to check >> usage (and in that simple scenario the intelligent use of >> class-superclass stuff can bring benefits). > It seems here you DO suggest validating semantics, but instead of doing > it with the CV/schema it must be implemented manually by hard-coding the > rules into a user application. Right now, there is no way (short of > parsing the ms-mapping file and adopting that format) to get that kind > of validation without the hard-coding you mention. Brian and I both > think that a proper specification should include a way to get this kind > of validation without hard-coding the rules, even if applications choose > not to use it. I think in the absence of an ontology to afford this sort of functionality (and with one expected), hard coding is not an awful solution (the workload for your suggestion wouldn't be orders of magnitude different would it, bearing in mind this is a temporary state of affairs so not subject to years of maintenance?). The MI group certainly went this route straight off the bat... At the risk of becoming dull, I'd restate that this is why I like the separable schemata you suggested, as we get the best of both worlds no? >> But what they're not >> doing is something like (for MS now) I have a Voyager so why on >> earth do I have ion trap data -- sound the klaxon; this can only >> come from something of the sophistication of OBI (or a _LOT_ of >> bespoke coding), which is in a flavour of OWL (a cruise liner to >> OBO's dinghy). > It's true, AFAIK, that validating (for example) the value of the "mass > analyzer" category based on the value provided for the "instrument > model" category is not possible with the current CV/schema. It is not > even possible after the extensions proposed by Brian or me. Such > functionality would require a much more interconnected CV (and the XSD > schema would be so confusing to maintain that it would almost certainly > have to be auto-generated from the CV). I don't think anybody > particularly expects this functionality either, so we needn't worry > about it. :) Well I'm kind of hoping we will ultimately be able to get this from OBI, which is being built in a very thorough and extensible (in terms of the richness of relations between classes) manner. Cheers, Chris. > -Matt > > > Chris Taylor wrote: >> Hiya. >> >> So your solution can, if I understand correctly, be >> characterised as formalising the mapping file info in an XSD >> that happens (for obvious reasons) to inherit from the main >> schema? If so, then as long as everyone likes it, I see that as >> a nice, neat, robust solution. >> >> Funnily enough I was chatting to a fellow PSIer yesterday about >> the mapping file(s) (this is cross-WG policy stuff you see) and >> enquired as to the current nature of the thing. I think if there >> is a clamour to formalise the map then hopefully there will be a >> response. To qualify the current state of affairs though, this >> was not meant to be a formal part of the standard -- more >> something akin to documentation (it didn't exist at all at one >> point -- bridging the gap was something done in the CV, which is >> not a great method for a number of reasons). >> >> Cheers, Chris. >> >> >> Matthew Chambers wrote: >> >>> If the consensus is that the CV should be left simple like it is now, >>> then I must agree with Brian. The current schema is incapable of doing >>> real validation, and the ms-mapping file is worse than a fleshed-out CV >>> or XSD (it's more confusing, it takes longer to maintain, and it's >>> non-standard). >>> >>> I still want Brian to clarify if he wants a one-schema spec or a >>> two-schema spec. I support the latter approach, where one schema is a >>> stable, syntactical version and the other inherits from the first one >>> and defines all the semantic restrictions as well. It would be up to >>> implementors which schema to use for validation, and of course only the >>> syntactical schema would be "stable" because the semantic restrictions >>> in the second schema would change to match the CV whenever it was updated. >>> >>> -Matt >>> >>> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Chris T. <chr...@eb...> - 2007-10-18 16:37:26
|
Hiya. Matthew Chambers wrote: > I'm glad we're getting good participation and discussion of this issue > now! Chris, your characterization is a reasonable one for the > two-schema approach I described. > > To respond to qualification of the current state of affairs, I'll quote > something you said the other day: >> Clearly we need the basic (and rilly rilly easy to do) syntactic >> validation provided by a fairly rich XML schema. > This is not clear to me. I do not see a clear advantage to validating > syntax and not validating semantics. In my experience, reading a file > with invalid semantics is as likely to result in a parser error as > reading a file with invalid syntax (although I admit that implementing > error handling for semantic errors tends to be more intuitive). The only thing I'd say here is that there is a minimum effort option available for implementers who cannot or choose not to validate content -- i.e. the 'core' schema is there to allow syntactic validation only, the extended schema you suggested would then allow the Brians and yourselves of this world to do more. Seems a neat solution. That said I don't contest your assertion that the more thorough the validation, the more likely one is to catch the subtle errors as well as the gross ones. >> But supporting >> the kinds of functionality discussed (which would mean the CV >> rapidly becoming a 'proper' ontology, which we don't have the >> person-hours to do right btw) is really just a nice to have at >> the moment. True semantic validation is just about feasible but >> _isn't_ practical imho. > I think you misunderstood the functionality I was suggesting to be added > to the CV. I was not suggesting significant logic changes in the CV, > only a simple instance_of relationship added to every controlled value > to link it to its parent category: "LTQ" is a controlled value, and it > should be an 'instance_of' an "instrument model", which is a controlled > category. In my view, the distinction between controlled values and > categories in the CV is crucial and it doesn't come close to making the > CV any more of a 'proper' ontology (i.e. that machines can use to gain > knowledge about the domain without human intervention). It would, > however, mean that a machine could auto-generate a schema from the CV, > which is what I was aiming for. :) I don't really agree with the idea > that the PSI MS CV should be a filler which gets replaced by the OBI CV > whenever it comes about, but if that's the consensus view then that > would be reason enough to give up the idea of using the CV to > auto-generate the schema. Thing here is that I heard several people assert (not on here) that defining terminating endpoints is storing up trouble and instances are therefore hostages to fortune; you'll just end up making a new class and deprecating the instance. Obviously there are clear endpoints (is there only one variant of an LTQ btw? is it a child or a sib to have an LTQ-FT?) but there are also going to be mistakes made -- rope to hang ourselves (overly dramatic phrase but nonetheless). Then there is the case where people _want_ to use a more generic parent (not sure how many there are in the CV tbh as it is quite flat iirc but still there are many ontologies in the world where the nodes are used as much as the leaves). A (simple-ish) example off the top of my head (not necessarily directly applicable, just for the principle) would be where someone has a machine not yet described and just wants to say something about it. >> Certainly for all but the most dedicated >> coders it is a pipe dream. All that can realistically be hoped >> for at the moment is correct usage (i.e. checking in an >> application of some sort that the term is appropriate given its >> usage), for which this wattage of CV is just fine.This is what >> the MIers have done -- a java app uses hard-coded rules to check >> usage (and in that simple scenario the intelligent use of >> class-superclass stuff can bring benefits). > It seems here you DO suggest validating semantics, but instead of doing > it with the CV/schema it must be implemented manually by hard-coding the > rules into a user application. Right now, there is no way (short of > parsing the ms-mapping file and adopting that format) to get that kind > of validation without the hard-coding you mention. Brian and I both > think that a proper specification should include a way to get this kind > of validation without hard-coding the rules, even if applications choose > not to use it. I think in the absence of an ontology to afford this sort of functionality (and with one expected), hard coding is not an awful solution (the workload for your suggestion wouldn't be orders of magnitude different would it, bearing in mind this is a temporary state of affairs so not subject to years of maintenance?). The MI group certainly went this route straight off the bat... At the risk of becoming dull, I'd restate that this is why I like the separable schemata you suggested, as we get the best of both worlds no? >> But what they're not >> doing is something like (for MS now) I have a Voyager so why on >> earth do I have ion trap data -- sound the klaxon; this can only >> come from something of the sophistication of OBI (or a _LOT_ of >> bespoke coding), which is in a flavour of OWL (a cruise liner to >> OBO's dinghy). > It's true, AFAIK, that validating (for example) the value of the "mass > analyzer" category based on the value provided for the "instrument > model" category is not possible with the current CV/schema. It is not > even possible after the extensions proposed by Brian or me. Such > functionality would require a much more interconnected CV (and the XSD > schema would be so confusing to maintain that it would almost certainly > have to be auto-generated from the CV). I don't think anybody > particularly expects this functionality either, so we needn't worry > about it. :) Well I'm kind of hoping we will ultimately be able to get this from OBI, which is being built in a very thorough and extensible (in terms of the richness of relations between classes) manner. Cheers, Chris. > -Matt > > > Chris Taylor wrote: >> Hiya. >> >> So your solution can, if I understand correctly, be >> characterised as formalising the mapping file info in an XSD >> that happens (for obvious reasons) to inherit from the main >> schema? If so, then as long as everyone likes it, I see that as >> a nice, neat, robust solution. >> >> Funnily enough I was chatting to a fellow PSIer yesterday about >> the mapping file(s) (this is cross-WG policy stuff you see) and >> enquired as to the current nature of the thing. I think if there >> is a clamour to formalise the map then hopefully there will be a >> response. To qualify the current state of affairs though, this >> was not meant to be a formal part of the standard -- more >> something akin to documentation (it didn't exist at all at one >> point -- bridging the gap was something done in the CV, which is >> not a great method for a number of reasons). >> >> Cheers, Chris. >> >> >> Matthew Chambers wrote: >> >>> If the consensus is that the CV should be left simple like it is now, >>> then I must agree with Brian. The current schema is incapable of doing >>> real validation, and the ms-mapping file is worse than a fleshed-out CV >>> or XSD (it's more confusing, it takes longer to maintain, and it's >>> non-standard). >>> >>> I still want Brian to clarify if he wants a one-schema spec or a >>> two-schema spec. I support the latter approach, where one schema is a >>> stable, syntactical version and the other inherits from the first one >>> and defines all the semantic restrictions as well. It would be up to >>> implementors which schema to use for validation, and of course only the >>> syntactical schema would be "stable" because the semantic restrictions >>> in the second schema would change to match the CV whenever it was updated. >>> >>> -Matt >>> >>> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ |
From: Matthew C. <mat...@va...> - 2007-10-18 16:14:41
|
I'm glad we're getting good participation and discussion of this issue now! Chris, your characterization is a reasonable one for the two-schema approach I described. To respond to qualification of the current state of affairs, I'll quote something you said the other day: > Clearly we need the basic (and rilly rilly easy to do) syntactic > validation provided by a fairly rich XML schema. This is not clear to me. I do not see a clear advantage to validating syntax and not validating semantics. In my experience, reading a file with invalid semantics is as likely to result in a parser error as reading a file with invalid syntax (although I admit that implementing error handling for semantic errors tends to be more intuitive). > But supporting > the kinds of functionality discussed (which would mean the CV > rapidly becoming a 'proper' ontology, which we don't have the > person-hours to do right btw) is really just a nice to have at > the moment. True semantic validation is just about feasible but > _isn't_ practical imho. I think you misunderstood the functionality I was suggesting to be added to the CV. I was not suggesting significant logic changes in the CV, only a simple instance_of relationship added to every controlled value to link it to its parent category: "LTQ" is a controlled value, and it should be an 'instance_of' an "instrument model", which is a controlled category. In my view, the distinction between controlled values and categories in the CV is crucial and it doesn't come close to making the CV any more of a 'proper' ontology (i.e. that machines can use to gain knowledge about the domain without human intervention). It would, however, mean that a machine could auto-generate a schema from the CV, which is what I was aiming for. :) I don't really agree with the idea that the PSI MS CV should be a filler which gets replaced by the OBI CV whenever it comes about, but if that's the consensus view then that would be reason enough to give up the idea of using the CV to auto-generate the schema. > Certainly for all but the most dedicated > coders it is a pipe dream. All that can realistically be hoped > for at the moment is correct usage (i.e. checking in an > application of some sort that the term is appropriate given its > usage), for which this wattage of CV is just fine.This is what > the MIers have done -- a java app uses hard-coded rules to check > usage (and in that simple scenario the intelligent use of > class-superclass stuff can bring benefits). It seems here you DO suggest validating semantics, but instead of doing it with the CV/schema it must be implemented manually by hard-coding the rules into a user application. Right now, there is no way (short of parsing the ms-mapping file and adopting that format) to get that kind of validation without the hard-coding you mention. Brian and I both think that a proper specification should include a way to get this kind of validation without hard-coding the rules, even if applications choose not to use it. > But what they're not > doing is something like (for MS now) I have a Voyager so why on > earth do I have ion trap data -- sound the klaxon; this can only > come from something of the sophistication of OBI (or a _LOT_ of > bespoke coding), which is in a flavour of OWL (a cruise liner to > OBO's dinghy). It's true, AFAIK, that validating (for example) the value of the "mass analyzer" category based on the value provided for the "instrument model" category is not possible with the current CV/schema. It is not even possible after the extensions proposed by Brian or me. Such functionality would require a much more interconnected CV (and the XSD schema would be so confusing to maintain that it would almost certainly have to be auto-generated from the CV). I don't think anybody particularly expects this functionality either, so we needn't worry about it. :) -Matt Chris Taylor wrote: > Hiya. > > So your solution can, if I understand correctly, be > characterised as formalising the mapping file info in an XSD > that happens (for obvious reasons) to inherit from the main > schema? If so, then as long as everyone likes it, I see that as > a nice, neat, robust solution. > > Funnily enough I was chatting to a fellow PSIer yesterday about > the mapping file(s) (this is cross-WG policy stuff you see) and > enquired as to the current nature of the thing. I think if there > is a clamour to formalise the map then hopefully there will be a > response. To qualify the current state of affairs though, this > was not meant to be a formal part of the standard -- more > something akin to documentation (it didn't exist at all at one > point -- bridging the gap was something done in the CV, which is > not a great method for a number of reasons). > > Cheers, Chris. > > > Matthew Chambers wrote: > >> If the consensus is that the CV should be left simple like it is now, >> then I must agree with Brian. The current schema is incapable of doing >> real validation, and the ms-mapping file is worse than a fleshed-out CV >> or XSD (it's more confusing, it takes longer to maintain, and it's >> non-standard). >> >> I still want Brian to clarify if he wants a one-schema spec or a >> two-schema spec. I support the latter approach, where one schema is a >> stable, syntactical version and the other inherits from the first one >> and defines all the semantic restrictions as well. It would be up to >> implementors which schema to use for validation, and of course only the >> syntactical schema would be "stable" because the semantic restrictions >> in the second schema would change to match the CV whenever it was updated. >> >> -Matt >> >> > |
From: Chris T. <chr...@eb...> - 2007-10-18 15:39:46
|
Hiya. So your solution can, if I understand correctly, be characterised as formalising the mapping file info in an XSD that happens (for obvious reasons) to inherit from the main schema? If so, then as long as everyone likes it, I see that as a nice, neat, robust solution. Funnily enough I was chatting to a fellow PSIer yesterday about the mapping file(s) (this is cross-WG policy stuff you see) and enquired as to the current nature of the thing. I think if there is a clamour to formalise the map then hopefully there will be a response. To qualify the current state of affairs though, this was not meant to be a formal part of the standard -- more something akin to documentation (it didn't exist at all at one point -- bridging the gap was something done in the CV, which is not a great method for a number of reasons). Cheers, Chris. Matthew Chambers wrote: > If the consensus is that the CV should be left simple like it is now, > then I must agree with Brian. The current schema is incapable of doing > real validation, and the ms-mapping file is worse than a fleshed-out CV > or XSD (it's more confusing, it takes longer to maintain, and it's > non-standard). > > I still want Brian to clarify if he wants a one-schema spec or a > two-schema spec. I support the latter approach, where one schema is a > stable, syntactical version and the other inherits from the first one > and defines all the semantic restrictions as well. It would be up to > implementors which schema to use for validation, and of course only the > syntactical schema would be "stable" because the semantic restrictions > in the second schema would change to match the CV whenever it was updated. > > -Matt > > > Brian Pratt wrote: >> Hi Chris, >> >> Most helpful to have some more background, thanks. Especially in light of >> the idea that the PSI CVs as they stand are fillers to use while OBI gets >> done, your term "bad bundling" is appropriate. >> >> If we go with a fully realized xsd wherein each element definition has a CV >> reference, when OBI comes to fruition we just tweak the xsd. It's a small >> change to the "foo" element definition, which is already declared to have >> the meaning found at "MS:12345", to declare it as also having the meaning >> found at "OB:54321". The point is that it's still a foo element so all >> existing mzML files remain valid, and all those mzML parsers out there don't >> have to be changed. In the currently contemplated mzML you'd have to go >> through all parsers in existence and update them to understand that <cvParam >> accession="OB:54321"/> is the same as <cvParam accession="MS:12345"/>, and >> of course older systems just won't understand it at all. Bad bundling >> indeed! The xsd approach is in fact the more stable one. >> >> It's odd, to say the least, to have the "mortar" of this project (the >> mapping file) not be part of the official standard. It's the only artifact >> we have at the moment, as far as I can see, that attempts to define the >> detailed structure of an mzML file. It's the de facto standard, and "de >> facto" has been identified as a Bad Thing on this list. >> >> So, to recap this and previous posts, the current proposal employs an >> unnecessarily elaborate, nonstandard, inflexible, sneaky, and inadequate way >> to couple mzML to the CV. This is readily corrected by moving the mapping >> file content to the xsd which actually forms the standard, then adding >> detail so that, for example, it is clear that a scan window must have both a >> low mz and high mz but dwell time is optional. >> >> Using the CV to define terms is important, but mostly what both vendors and >> users really want from a data format standard is to not be forever tweaking >> readers and writers to adjust to "valid" but unexpected usages. This is >> only achieved by the standard being extremely clear on what "valid" means, >> something the current proposal largely flinches from doing. As currently >> proposed, mzML feels like a big step backwards. >> >> Brian >> >> >> -----Original Message----- >> From: psi...@li... >> [mailto:psi...@li...] On Behalf Of Chris >> Taylor >> Sent: Wednesday, October 17, 2007 2:27 AM >> To: Mass spectrometry standard development >> Cc: Daniel Schober >> Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments >> >> Hiya. >> >> Just a few points: >> >> The CV is deliberately as simple as possible -- just the >> barebones -- enough to find the term you need. In part this is a >> pragmatic outcome from the lack of person-hours, but not >> completely; it is also to avoid the complications of using the >> more complex relationships that are available (roles, for >> example, the benefit of which in this setting is unclear) and >> some of the less standard (=weird) ones. >> >> The CV and the schema should be separable entities imho. Mostly >> this is to allow the use of other CVs/ontologies as they become >> available. If either of these products depends too much on the >> other the result of removing that other would be crippling; this >> is 'bad' bundling, basically. Because they are separate, the >> mapping file for the use of that particular CV with the schema >> is provided. This is a convenience thing for developers, >> basically, which they would be able to figure out for themselves >> given a week, and is no part of any standard. If you recall a >> while ago, the MGED 'ontology' (MO, which is really a CV, hence >> the quotes) got a good kicking in the literature for being >> directly structured around a model/schema (MAGE); there were >> many criticisms voiced there (not all valid, especially the ones >> about process, but nonetheless -- who critiques the critics eh). >> >> On 'other' term sources, consider OBI (the successor to MO, >> inter alia), which is destined ultimately to replace the CVs >> generated by PSI and MGED with a proper ontology supporting all >> sorts of nice things. The OBI dev calls, especially the >> instrument track, would be a _great_ place to redirect this >> enthusiasm to ensure that all is well. Really the PSI CVs as >> they stand are fillers to use while that big job gets done. >> Please I implore you if you really do have major issues/needs, >> go to a few of the OBI calls. For instruments the guy to mail is >> Daniel Schober at EBI (CCed on here); incidentally he also >> handles the needs of the metabolomics community who have >> heee-uge overlaps with PSI (on MS for example) and who will most >> likely use mzML for their MS work also (I co-chair their formats >> WG and have been heavily promoting PSI products to them with an >> eye on the cross-domain integrative thing). Ah synergy. >> >> Clearly we need the basic (and rilly rilly easy to do) syntactic >> validation provided by a fairly rich XML schema. But supporting >> the kinds of functionality discussed (which would mean the CV >> rapidly becoming a 'proper' ontology, which we don't have the >> person-hours to do right btw) is really just a nice to have at >> the moment. True semantic validation is just about feasible but >> _isn't_ practical imho. Certainly for all but the most dedicated >> coders it is a pipe dream. All that can realistically be hoped >> for at the moment is correct usage (i.e. checking in an >> application of some sort that the term is appropriate given its >> usage), for which this wattage of CV is just fine. This is what >> the MIers have done -- a java app uses hard-coded rules to check >> usage (and in that simple scenario the intelligent use of >> class-superclass stuff can bring benefits). But what they're not >> doing is something like (for MS now) I have a Voyager so why on >> earth do I have ion trap data -- sound the klaxon; this can only >> come from something of the sophistication of OBI (or a _LOT_ of >> bespoke coding), which is in a flavour of OWL (a cruise liner to >> OBO's dinghy). >> >> Finally, again on where to draw the separating line; the more >> detail in the schema, the more labile that schema. So the schema >> should be as stable as possible (tend towards simpler). That >> schema should also remain as simple to dumb-validate as possible >> (so someone with barely the ability to run a simple validation >> check can wheel out a standard XSD tool and be done -- again >> tend towards simpler). The rest of the ~needed detail has then >> to be elsewhere in that scenario; in the CV (but that also has >> limits as discussed above) and the mapping file (the mortar >> between the bricks). The point is that although that makes work >> for those who really want to go for it on validation (to the >> point of reasoning in some sense), those developing simpler >> implementations will be able to keep things simple (e.g. person >> X uses a simple library to check for well-formedness and >> validity against the XSD, cares not-a-whole-hell-of-a-lot about >> the CV terms used as they know that most came direct from the >> instrument somehow with no user intervention, and just wants a >> coherent file with some metadata around the data to put in a >> database, which is where the CV matters most -- for retrieval). >> To truly go up a level on validation (excepting the halfway >> house of stating which terms [from a _particular_ source] can go >> where) is unrealistic and currently the benefits are minimal I >> would say (compare the effort of implementing to the benefit of >> the 0.1% of files in which you catch an error by that route, or >> the frequency of searches based on proteins/peptides, or on >> atomic terms (possibly AND/OR-ed), to that of searches truly >> exploiting the power of ontologies). >> >> Not that I'm against powerful ontology-based queries supported >> by systems that reason like a herd of ancient g(r)eeks; it'll >> truly rock when it comes and will be key to the provision of >> good integrated (i.e. cross-domain) resources down the line. But >> the time is not now -- we need OBI first. To forcibly mature the >> MS CV to support such functionality is a waste of effort better >> spent in making OBI all it can be. >> >> WHY can I not write a short email (that was rhetorical...) >> >> Cheers, Chris. >> >> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ |
From: Matthew C. <mat...@va...> - 2007-10-18 14:36:25
|
If the consensus is that the CV should be left simple like it is now, then I must agree with Brian. The current schema is incapable of doing real validation, and the ms-mapping file is worse than a fleshed-out CV or XSD (it's more confusing, it takes longer to maintain, and it's non-standard). I still want Brian to clarify if he wants a one-schema spec or a two-schema spec. I support the latter approach, where one schema is a stable, syntactical version and the other inherits from the first one and defines all the semantic restrictions as well. It would be up to implementors which schema to use for validation, and of course only the syntactical schema would be "stable" because the semantic restrictions in the second schema would change to match the CV whenever it was updated. -Matt Brian Pratt wrote: > Hi Chris, > > Most helpful to have some more background, thanks. Especially in light of > the idea that the PSI CVs as they stand are fillers to use while OBI gets > done, your term "bad bundling" is appropriate. > > If we go with a fully realized xsd wherein each element definition has a CV > reference, when OBI comes to fruition we just tweak the xsd. It's a small > change to the "foo" element definition, which is already declared to have > the meaning found at "MS:12345", to declare it as also having the meaning > found at "OB:54321". The point is that it's still a foo element so all > existing mzML files remain valid, and all those mzML parsers out there don't > have to be changed. In the currently contemplated mzML you'd have to go > through all parsers in existence and update them to understand that <cvParam > accession="OB:54321"/> is the same as <cvParam accession="MS:12345"/>, and > of course older systems just won't understand it at all. Bad bundling > indeed! The xsd approach is in fact the more stable one. > > It's odd, to say the least, to have the "mortar" of this project (the > mapping file) not be part of the official standard. It's the only artifact > we have at the moment, as far as I can see, that attempts to define the > detailed structure of an mzML file. It's the de facto standard, and "de > facto" has been identified as a Bad Thing on this list. > > So, to recap this and previous posts, the current proposal employs an > unnecessarily elaborate, nonstandard, inflexible, sneaky, and inadequate way > to couple mzML to the CV. This is readily corrected by moving the mapping > file content to the xsd which actually forms the standard, then adding > detail so that, for example, it is clear that a scan window must have both a > low mz and high mz but dwell time is optional. > > Using the CV to define terms is important, but mostly what both vendors and > users really want from a data format standard is to not be forever tweaking > readers and writers to adjust to "valid" but unexpected usages. This is > only achieved by the standard being extremely clear on what "valid" means, > something the current proposal largely flinches from doing. As currently > proposed, mzML feels like a big step backwards. > > Brian > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Chris > Taylor > Sent: Wednesday, October 17, 2007 2:27 AM > To: Mass spectrometry standard development > Cc: Daniel Schober > Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments > > Hiya. > > Just a few points: > > The CV is deliberately as simple as possible -- just the > barebones -- enough to find the term you need. In part this is a > pragmatic outcome from the lack of person-hours, but not > completely; it is also to avoid the complications of using the > more complex relationships that are available (roles, for > example, the benefit of which in this setting is unclear) and > some of the less standard (=weird) ones. > > The CV and the schema should be separable entities imho. Mostly > this is to allow the use of other CVs/ontologies as they become > available. If either of these products depends too much on the > other the result of removing that other would be crippling; this > is 'bad' bundling, basically. Because they are separate, the > mapping file for the use of that particular CV with the schema > is provided. This is a convenience thing for developers, > basically, which they would be able to figure out for themselves > given a week, and is no part of any standard. If you recall a > while ago, the MGED 'ontology' (MO, which is really a CV, hence > the quotes) got a good kicking in the literature for being > directly structured around a model/schema (MAGE); there were > many criticisms voiced there (not all valid, especially the ones > about process, but nonetheless -- who critiques the critics eh). > > On 'other' term sources, consider OBI (the successor to MO, > inter alia), which is destined ultimately to replace the CVs > generated by PSI and MGED with a proper ontology supporting all > sorts of nice things. The OBI dev calls, especially the > instrument track, would be a _great_ place to redirect this > enthusiasm to ensure that all is well. Really the PSI CVs as > they stand are fillers to use while that big job gets done. > Please I implore you if you really do have major issues/needs, > go to a few of the OBI calls. For instruments the guy to mail is > Daniel Schober at EBI (CCed on here); incidentally he also > handles the needs of the metabolomics community who have > heee-uge overlaps with PSI (on MS for example) and who will most > likely use mzML for their MS work also (I co-chair their formats > WG and have been heavily promoting PSI products to them with an > eye on the cross-domain integrative thing). Ah synergy. > > Clearly we need the basic (and rilly rilly easy to do) syntactic > validation provided by a fairly rich XML schema. But supporting > the kinds of functionality discussed (which would mean the CV > rapidly becoming a 'proper' ontology, which we don't have the > person-hours to do right btw) is really just a nice to have at > the moment. True semantic validation is just about feasible but > _isn't_ practical imho. Certainly for all but the most dedicated > coders it is a pipe dream. All that can realistically be hoped > for at the moment is correct usage (i.e. checking in an > application of some sort that the term is appropriate given its > usage), for which this wattage of CV is just fine. This is what > the MIers have done -- a java app uses hard-coded rules to check > usage (and in that simple scenario the intelligent use of > class-superclass stuff can bring benefits). But what they're not > doing is something like (for MS now) I have a Voyager so why on > earth do I have ion trap data -- sound the klaxon; this can only > come from something of the sophistication of OBI (or a _LOT_ of > bespoke coding), which is in a flavour of OWL (a cruise liner to > OBO's dinghy). > > Finally, again on where to draw the separating line; the more > detail in the schema, the more labile that schema. So the schema > should be as stable as possible (tend towards simpler). That > schema should also remain as simple to dumb-validate as possible > (so someone with barely the ability to run a simple validation > check can wheel out a standard XSD tool and be done -- again > tend towards simpler). The rest of the ~needed detail has then > to be elsewhere in that scenario; in the CV (but that also has > limits as discussed above) and the mapping file (the mortar > between the bricks). The point is that although that makes work > for those who really want to go for it on validation (to the > point of reasoning in some sense), those developing simpler > implementations will be able to keep things simple (e.g. person > X uses a simple library to check for well-formedness and > validity against the XSD, cares not-a-whole-hell-of-a-lot about > the CV terms used as they know that most came direct from the > instrument somehow with no user intervention, and just wants a > coherent file with some metadata around the data to put in a > database, which is where the CV matters most -- for retrieval). > To truly go up a level on validation (excepting the halfway > house of stating which terms [from a _particular_ source] can go > where) is unrealistic and currently the benefits are minimal I > would say (compare the effort of implementing to the benefit of > the 0.1% of files in which you catch an error by that route, or > the frequency of searches based on proteins/peptides, or on > atomic terms (possibly AND/OR-ed), to that of searches truly > exploiting the power of ontologies). > > Not that I'm against powerful ontology-based queries supported > by systems that reason like a herd of ancient g(r)eeks; it'll > truly rock when it comes and will be key to the provision of > good integrated (i.e. cross-domain) resources down the line. But > the time is not now -- we need OBI first. To forcibly mature the > MS CV to support such functionality is a waste of effort better > spent in making OBI all it can be. > > WHY can I not write a short email (that was rhetorical...) > > Cheers, Chris. > > |
From: sneumann <sne...@ip...> - 2007-10-18 11:31:38
|
On Mi, 2007-10-17 at 12:49 -0700, Brian Pratt wrote: ... > something the current proposal largely flinches from doing. As currently > proposed, mzML feels like a big step backwards. Hi, greetings from one of the "lurkers" on this list. We are operating a number of different MS. Currently, we have used Eclipse EMF to auto-generate=20 Java classes from the mzData.xsd, and from there=20 we connect to a database, using an auto-generated schema through an Object-relational Mapping ORM. The raw=20 data is read by the RAMP parser inside the Bioconductor XCMS package. I have the feeling that a data model with very little structure and a well-structured Ontology would put a lot of burden=20 on tool and database developers.=20 I expected mzML to be mainly a merger of mzXML and mzData, keeping the best of both worlds, and improving vendor=20 and tools support for a merged standard. In that light=20 I followed the Index, Binary and Wrapper Schema discussion, not responding because I saw that whatever way mzML settled, I'd be able to adopt by ignoring those features or modifying=20 our tools. At the beginning of the mzML (when it was called dataXML)=20 discussion I also remembered the idea of having a place to store the Chromatograms, I am not sure what happened to this. Starting with the CV discussion I felt that mzML is drifting away from its mz[Data|XML] parents. The rationale behind this discussion is=20 to keep up with ever-changing requirements.=20 But hey, mzData started in 2005, and will likely be applicable=20 to the majority of use cases another (at least?) 1-2 years.=20 I am not sure whether those use cases not covered by mzData=20 can easily be covered with mzML+complexCV, but for a speedy=20 adoption by both vendors please keep simplicity in mind. Remember people will be writing mzML readers in Java, C++, C# and Mono, perl, Bioconductor, Python, ... and It might turn=20 into a bad reputation for mzML if these implementations are buggy and/or incomplete merely because mzML tries to do too much and people end up hacking the parsers=20 just for their own machine and use case. Yours, Steffen --=20 IPB Halle AG Massenspektrometrie & Bioinformatik Dr. Steffen Neumann http://www.IPB-Halle.DE Weinberg 3 http://msbi.bic-gh.de 06120 Halle Tel. +49 (0) 345 5582 - 1470 +49 (0) 345 5582 - 0 sneumann(at)IPB-Halle.DE Fax. +49 (0) 345 5582 - 1409 |
From: Brian P. <bri...@in...> - 2007-10-17 19:50:34
|
Hi Chris, Most helpful to have some more background, thanks. Especially in light of the idea that the PSI CVs as they stand are fillers to use while OBI gets done, your term "bad bundling" is appropriate. If we go with a fully realized xsd wherein each element definition has a CV reference, when OBI comes to fruition we just tweak the xsd. It's a small change to the "foo" element definition, which is already declared to have the meaning found at "MS:12345", to declare it as also having the meaning found at "OB:54321". The point is that it's still a foo element so all existing mzML files remain valid, and all those mzML parsers out there don't have to be changed. In the currently contemplated mzML you'd have to go through all parsers in existence and update them to understand that <cvParam accession="OB:54321"/> is the same as <cvParam accession="MS:12345"/>, and of course older systems just won't understand it at all. Bad bundling indeed! The xsd approach is in fact the more stable one. It's odd, to say the least, to have the "mortar" of this project (the mapping file) not be part of the official standard. It's the only artifact we have at the moment, as far as I can see, that attempts to define the detailed structure of an mzML file. It's the de facto standard, and "de facto" has been identified as a Bad Thing on this list. So, to recap this and previous posts, the current proposal employs an unnecessarily elaborate, nonstandard, inflexible, sneaky, and inadequate way to couple mzML to the CV. This is readily corrected by moving the mapping file content to the xsd which actually forms the standard, then adding detail so that, for example, it is clear that a scan window must have both a low mz and high mz but dwell time is optional. Using the CV to define terms is important, but mostly what both vendors and users really want from a data format standard is to not be forever tweaking readers and writers to adjust to "valid" but unexpected usages. This is only achieved by the standard being extremely clear on what "valid" means, something the current proposal largely flinches from doing. As currently proposed, mzML feels like a big step backwards. Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Chris Taylor Sent: Wednesday, October 17, 2007 2:27 AM To: Mass spectrometry standard development Cc: Daniel Schober Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments Hiya. Just a few points: The CV is deliberately as simple as possible -- just the barebones -- enough to find the term you need. In part this is a pragmatic outcome from the lack of person-hours, but not completely; it is also to avoid the complications of using the more complex relationships that are available (roles, for example, the benefit of which in this setting is unclear) and some of the less standard (=weird) ones. The CV and the schema should be separable entities imho. Mostly this is to allow the use of other CVs/ontologies as they become available. If either of these products depends too much on the other the result of removing that other would be crippling; this is 'bad' bundling, basically. Because they are separate, the mapping file for the use of that particular CV with the schema is provided. This is a convenience thing for developers, basically, which they would be able to figure out for themselves given a week, and is no part of any standard. If you recall a while ago, the MGED 'ontology' (MO, which is really a CV, hence the quotes) got a good kicking in the literature for being directly structured around a model/schema (MAGE); there were many criticisms voiced there (not all valid, especially the ones about process, but nonetheless -- who critiques the critics eh). On 'other' term sources, consider OBI (the successor to MO, inter alia), which is destined ultimately to replace the CVs generated by PSI and MGED with a proper ontology supporting all sorts of nice things. The OBI dev calls, especially the instrument track, would be a _great_ place to redirect this enthusiasm to ensure that all is well. Really the PSI CVs as they stand are fillers to use while that big job gets done. Please I implore you if you really do have major issues/needs, go to a few of the OBI calls. For instruments the guy to mail is Daniel Schober at EBI (CCed on here); incidentally he also handles the needs of the metabolomics community who have heee-uge overlaps with PSI (on MS for example) and who will most likely use mzML for their MS work also (I co-chair their formats WG and have been heavily promoting PSI products to them with an eye on the cross-domain integrative thing). Ah synergy. Clearly we need the basic (and rilly rilly easy to do) syntactic validation provided by a fairly rich XML schema. But supporting the kinds of functionality discussed (which would mean the CV rapidly becoming a 'proper' ontology, which we don't have the person-hours to do right btw) is really just a nice to have at the moment. True semantic validation is just about feasible but _isn't_ practical imho. Certainly for all but the most dedicated coders it is a pipe dream. All that can realistically be hoped for at the moment is correct usage (i.e. checking in an application of some sort that the term is appropriate given its usage), for which this wattage of CV is just fine. This is what the MIers have done -- a java app uses hard-coded rules to check usage (and in that simple scenario the intelligent use of class-superclass stuff can bring benefits). But what they're not doing is something like (for MS now) I have a Voyager so why on earth do I have ion trap data -- sound the klaxon; this can only come from something of the sophistication of OBI (or a _LOT_ of bespoke coding), which is in a flavour of OWL (a cruise liner to OBO's dinghy). Finally, again on where to draw the separating line; the more detail in the schema, the more labile that schema. So the schema should be as stable as possible (tend towards simpler). That schema should also remain as simple to dumb-validate as possible (so someone with barely the ability to run a simple validation check can wheel out a standard XSD tool and be done -- again tend towards simpler). The rest of the ~needed detail has then to be elsewhere in that scenario; in the CV (but that also has limits as discussed above) and the mapping file (the mortar between the bricks). The point is that although that makes work for those who really want to go for it on validation (to the point of reasoning in some sense), those developing simpler implementations will be able to keep things simple (e.g. person X uses a simple library to check for well-formedness and validity against the XSD, cares not-a-whole-hell-of-a-lot about the CV terms used as they know that most came direct from the instrument somehow with no user intervention, and just wants a coherent file with some metadata around the data to put in a database, which is where the CV matters most -- for retrieval). To truly go up a level on validation (excepting the halfway house of stating which terms [from a _particular_ source] can go where) is unrealistic and currently the benefits are minimal I would say (compare the effort of implementing to the benefit of the 0.1% of files in which you catch an error by that route, or the frequency of searches based on proteins/peptides, or on atomic terms (possibly AND/OR-ed), to that of searches truly exploiting the power of ontologies). Not that I'm against powerful ontology-based queries supported by systems that reason like a herd of ancient g(r)eeks; it'll truly rock when it comes and will be key to the provision of good integrated (i.e. cross-domain) resources down the line. But the time is not now -- we need OBI first. To forcibly mature the MS CV to support such functionality is a waste of effort better spent in making OBI all it can be. WHY can I not write a short email (that was rhetorical...) Cheers, Chris. ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Chris T. <chr...@eb...> - 2007-10-17 09:27:09
|
Hiya. Just a few points: The CV is deliberately as simple as possible -- just the barebones -- enough to find the term you need. In part this is a pragmatic outcome from the lack of person-hours, but not completely; it is also to avoid the complications of using the more complex relationships that are available (roles, for example, the benefit of which in this setting is unclear) and some of the less standard (=weird) ones. The CV and the schema should be separable entities imho. Mostly this is to allow the use of other CVs/ontologies as they become available. If either of these products depends too much on the other the result of removing that other would be crippling; this is 'bad' bundling, basically. Because they are separate, the mapping file for the use of that particular CV with the schema is provided. This is a convenience thing for developers, basically, which they would be able to figure out for themselves given a week, and is no part of any standard. If you recall a while ago, the MGED 'ontology' (MO, which is really a CV, hence the quotes) got a good kicking in the literature for being directly structured around a model/schema (MAGE); there were many criticisms voiced there (not all valid, especially the ones about process, but nonetheless -- who critiques the critics eh). On 'other' term sources, consider OBI (the successor to MO, inter alia), which is destined ultimately to replace the CVs generated by PSI and MGED with a proper ontology supporting all sorts of nice things. The OBI dev calls, especially the instrument track, would be a _great_ place to redirect this enthusiasm to ensure that all is well. Really the PSI CVs as they stand are fillers to use while that big job gets done. Please I implore you if you really do have major issues/needs, go to a few of the OBI calls. For instruments the guy to mail is Daniel Schober at EBI (CCed on here); incidentally he also handles the needs of the metabolomics community who have heee-uge overlaps with PSI (on MS for example) and who will most likely use mzML for their MS work also (I co-chair their formats WG and have been heavily promoting PSI products to them with an eye on the cross-domain integrative thing). Ah synergy. Clearly we need the basic (and rilly rilly easy to do) syntactic validation provided by a fairly rich XML schema. But supporting the kinds of functionality discussed (which would mean the CV rapidly becoming a 'proper' ontology, which we don't have the person-hours to do right btw) is really just a nice to have at the moment. True semantic validation is just about feasible but _isn't_ practical imho. Certainly for all but the most dedicated coders it is a pipe dream. All that can realistically be hoped for at the moment is correct usage (i.e. checking in an application of some sort that the term is appropriate given its usage), for which this wattage of CV is just fine. This is what the MIers have done -- a java app uses hard-coded rules to check usage (and in that simple scenario the intelligent use of class-superclass stuff can bring benefits). But what they're not doing is something like (for MS now) I have a Voyager so why on earth do I have ion trap data -- sound the klaxon; this can only come from something of the sophistication of OBI (or a _LOT_ of bespoke coding), which is in a flavour of OWL (a cruise liner to OBO's dinghy). Finally, again on where to draw the separating line; the more detail in the schema, the more labile that schema. So the schema should be as stable as possible (tend towards simpler). That schema should also remain as simple to dumb-validate as possible (so someone with barely the ability to run a simple validation check can wheel out a standard XSD tool and be done -- again tend towards simpler). The rest of the ~needed detail has then to be elsewhere in that scenario; in the CV (but that also has limits as discussed above) and the mapping file (the mortar between the bricks). The point is that although that makes work for those who really want to go for it on validation (to the point of reasoning in some sense), those developing simpler implementations will be able to keep things simple (e.g. person X uses a simple library to check for well-formedness and validity against the XSD, cares not-a-whole-hell-of-a-lot about the CV terms used as they know that most came direct from the instrument somehow with no user intervention, and just wants a coherent file with some metadata around the data to put in a database, which is where the CV matters most -- for retrieval). To truly go up a level on validation (excepting the halfway house of stating which terms [from a _particular_ source] can go where) is unrealistic and currently the benefits are minimal I would say (compare the effort of implementing to the benefit of the 0.1% of files in which you catch an error by that route, or the frequency of searches based on proteins/peptides, or on atomic terms (possibly AND/OR-ed), to that of searches truly exploiting the power of ontologies). Not that I'm against powerful ontology-based queries supported by systems that reason like a herd of ancient g(r)eeks; it'll truly rock when it comes and will be key to the provision of good integrated (i.e. cross-domain) resources down the line. But the time is not now -- we need OBI first. To forcibly mature the MS CV to support such functionality is a waste of effort better spent in making OBI all it can be. WHY can I not write a short email (that was rhetorical...) Cheers, Chris. ~~~~~~~~~~~~~~~~~~~~~~~~ chr...@eb... http://mibbi.sf.net/ ~~~~~~~~~~~~~~~~~~~~~~~~ |
From: Matthew C. <mat...@va...> - 2007-10-16 20:47:23
|
Fair points, Brian. But the XSD attributes for minOccurs, maxOccurs, and required can easily be added to the relevant terms in the CV via trailing modifiers. Whatever is necessary to autogenerate the XSD can be added to the CV without over complicating it (indeed, doing so would only serve to further disambiguate it). However, I'll grant that the more XSD functionality that the CV supports, the less difference there is between autogenerating the XSD from the hand-maintained CV and hand-maintaining the XSD itself. Half a dozen of one and 6 of another... -Matt Brian Pratt wrote: > Hi Matt, > > I can only speculate on the history of PSI CV as a subset of OBO, my guess > is they just wanted to keep it simple as it was never intended to provide > the kind of granularity we need for fully automated semantic validation. > > So, I disagree on your point of CV being nearly there as an XSD replacement. > It doesn't seem to have, for example, any means of saying whether an element > or attribute is required or not, or how many times it can occur, etc etc. > That's why that whole crazy xsd-like infrastructure that the java validator > uses was built up (the ms-mapping.xml schema file is attached, for those who > don't want to dig for it), and even that I have already shown to be > inadequate. I don't want to see us follow previous groups down that rabbit > hole. > > |
From: Brian P. <bri...@in...> - 2007-10-16 20:21:57
|
Hi Matt, Matt, I can only speculate on the history of PSI CV as a subset of OBO, my guess is they just wanted to keep it simple as it was never intended to provide the kind of granularity we need for fully automated semantic validation. So, I disagree on your point of CV being nearly there as an XSD replacement. It doesn't seem to have, for example, any means of saying whether an element or attribute is required or not, or how many times it can occur, etc etc. That's why that whole crazy xsd-like infrastructure that the java validator uses was built up (the ms-mapping.xml schema file is attached, for those who don't want to dig for it), and even that I have already shown to be inadequate. I don't want to see us follow previous groups down that rabbit hole. I also think that in practice nobody is going to be all that interested in messing with the CV beyond adding the occasional machine model etc. I think a one time determination of the XSD will prove quite durable, and it's already been largely done between the existing xsd and ms-mapping.xml. You're right, for the applications I'm personally looking at right now I think the CV isn't very important. But your use case of vendor DLLs using CV to disambiguate their APIs is a perfect example of how CV can improve things. I support its development and I think mzML should play well with it. Even though the existence of a system that would actually do anything with the CV info in an mzML file is currently theoretical, it's the right direction to be heading in and it's worth caring about and doing it right. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, October 16, 2007 12:20 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments Brian Pratt wrote: > (First of all, thanks to Frank for shedding more light on the topic - heat, > we have already!) > > Heat and light are just different wavelengths on the same spectrum. ;) > Matt, > > You're right about OBO not limiting itself to is_a and part_of, but it > appears that PSI has explicitly chosen to do so. I doubt we have the > political heft to change that now, or that we should want to do so. Further > contortions to turn CV into something to rival the readily available power > of XSD are misguided, in my opinion. > If what you say is true, I at least want to see some rationale of why PSI would explicitly limit their CVs to 'is_a' and 'part_of' relationships. I agree that contorting a CV to make it work as an XSD is misguided, but it's already been done to a great extent and I just want to go that little bit further to finish it. I was suggesting that we should leverage the validation power of XSD by autogenerating an XSD from a properly done (contorted!) CV, where maintaining the CV is preferable to the XSD primarily because OBO CVs are ubiquitous in the life sciences while XSDs are not (AFAIK). Also, it means only having to maintain the CV instead of maintaining both the CV and the XSD (autogenerating the CV from the XSD is conceivable, but pointless because by then you are putting new accession number straight into the XSD along with all the baggage that needs to get passed to the CV but isn't really important to the XSD). > Frankly it seems to me that the CV doesn't really need to be all that > logically consistent: in its current bogus state it doesn't seem to have > bothered anyone, including the official validator. PSI clearly never meant > for CV to do things like datatyping and range limiting so we should stop > pushing on that rope and just allow CV to play its proper role in > disambiguating the terms we use in the XSD, by use of accession numbers in > the XSD. > I think you say this because, as things currently are, you don't plan to care much about the CV and frankly neither do I. And there is a legitimate reason to not care about a CV if it doesn't specify enough semantics of the format to truly and unambiguously define the the terms. The data type of a term is as much a part of its definition as the English description of it! Imagine different users of the CV trying to pass around instances of terms using different data types for the different instances! I don't think that constitutes an unambiguous controlled vocabulary. :) > The thing to do now is to transfer most of the intelligence in the > ms-mapping.xml schema file (for it is indeed a schema, albeit written in a > nonstandard format) to the XSD file then add the proper datatyping and range > checking. I was happy to see that this second schema contains the work I > thought we were going to have to generate from the CV itself, although I was > also somewhat surprised to learn of the existence of such a key artifact > this late in the discussion. Or maybe I just missed it somehow. > > As I've said before we should be braver than we have been so far. The > refusal to put useful content in the XSD file simply for fear of being wrong > about it is just deplorable and doesn't serve the purposes of the community. > And I'm appalled at the disingenuousness of claiming a "stable schema" when > many key parts of the spec are in fact expressed in a schema > (ms-mapping.xml) which is explicitly unstable. > I agree wholeheartedly. We only disagree about maintaining the fully specified XSD. I think it should be autogenerated from a fixed CV and a stable template schema, whereas you think it should be hand rolled. Let me get you to clear something up though: do you want there to be a single, ever-changing schema, or would you also accept a basic stable schema (without CV-related restrictions) which can be derived from in order to create the fully specified schema with the ever-changing restrictions? In the latter case, we can have a schema that is stable but doesn't serve for anything more than syntactical validation, and also a schema that can be used for full semantic validation, and which schema that a program uses is up to the program. > The charge has been leveled on this list that (paraphrasing here) some old > dogs are resisting learning new tricks when it comes to the use of CV. > That's always something to be mindful of, but after careful consideration I > really just don't see the advantage of a CV-centric approach, when all the > added complexity and reinvention still leaves us well short of where proper > use of XSD would get us. Fully realized XSD that references CV to define > its terms seems like the obvious choice for a system that wants to gain > widespread and rapid adoption. > Speaking of learning new tricks, when will the vendors' raw file reading libraries return CV accession numbers to describe terms instead of ambiguous strings? That would be nice. But if that never happens, each conversion program has to maintain its own vendor-to-CV mapping. And if a program wants to read both vendor-proprietary formats and the XML formats, your mapping problems have become nightmares. -Matt > - Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Matthew > Chambers > Sent: Tuesday, October 16, 2007 8:27 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments > > Hi Frank, I read the Guidelines you linked to and also the paper > describing the Relation Ontology (http://genomebiology.com/2005/6/5/R46) > which is referenced from the Guidelines. The Relation Ontology does not > in any way suggest that reliable OBO CVs should be limited to IS_A and > PART_OF relationships! Rather, it does a good job of defining when IS_A > and PART_OF should be used and what they really mean. I think if we > looked closely we could find quite a few cases in the CV where the use > of IS_A and PART_OF is bogus according to the Relation Ontology > definition, especially with regard to values being indistinct from > categories. > > Therefore, I take issue with the following text from the Guidelines > which has no corresponding rationale and which is currently biting us in > the arse: > > 11. Relations between RU's > As the PSI CV will be developed under the OBO umbrella [3], the > relations created between terms MUST ascribe to the definitions and > formal requirements provided in the OBO Relations Ontology (RO) paper > [7], as the relations 'is_a' and 'part_of'. > > It is not clear whether the Relation Ontology recommends or discourages > using OBO to typedef new relationship types into existence (my proposed > 'value of'), but that won't be necessary. I think we can accomplish the > same effect with the existing relationship, 'instance_of', which IS part > of the Relation Ontology. In fact, 'instance_of' is a primitive relation > in the Relation Ontology, whereas 'is_a' is not. Here is the Relation > Ontology definition for 'instance_of': > > p instance_of P - a primitive relation between a process instance and a > class which it instantiates holding independently of time > > That sounds like a pretty good way to distinguish between values > (instances) and categories (classes) to me! Further, the instance_of > relationship can be used in addition to the current part_of and is_a > relationships and it will serve to disambiguate a branch of the CV where > the actual category that a value belongs to is an ancestor instead of a > direct parent. For instance: > MS:1000173 "MAT900XP" > is a MS:1000493 "Finnigan MAT" > part of MS:1000483 "Thermo Fisher Scientific" > is a MS:1000031 "model by vendor" > part of MS:1000463 "instrument description" > part of MS:0000000 "MZ controlled vocabularies" > What category does the controlled value "MAT900XP" belong to, i.e. if we > used cvParam method B, would it look like: > <cvParam cvLabel="MS" categoryName="Finnigan MAT" > categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/> > Or would it look like: > <cvParam cvLabel="MS" categoryName="model by vendor" > categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/> > > Of course I think it should be the latter, but how would you derive that > from the CV? You can't, unless you add a new relationship or convention, > so I suggest: > MS:1000173 "MAT900XP" > instance of MS:1000031 "model by vendor" > is a MS:1000493 "Finnigan MAT" > part of MS:1000483 "Thermo Fisher Scientific" > is a MS:1000031 "model by vendor" > part of MS:1000463 "instrument description" > part of MS:0000000 "MZ controlled vocabularies" > It would also be good to get rid of the MS:1000483->MS:1000031 > relationship at that point because "Thermo Fisher Scientific" is NOT an > instrument model. > > I have to disagree with your assertion that OBO does not allow a CV to > model datatypes and cardinality. I think the trailing modifiers (which > may have been added since you last looked at the OBO language spec) > would serve to model those properties quite nicely. > > -Matt > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matthew C. <mat...@va...> - 2007-10-16 19:20:14
|
Brian Pratt wrote: > (First of all, thanks to Frank for shedding more light on the topic - heat, > we have already!) > > Heat and light are just different wavelengths on the same spectrum. ;) > Matt, > > You're right about OBO not limiting itself to is_a and part_of, but it > appears that PSI has explicitly chosen to do so. I doubt we have the > political heft to change that now, or that we should want to do so. Further > contortions to turn CV into something to rival the readily available power > of XSD are misguided, in my opinion. > If what you say is true, I at least want to see some rationale of why PSI would explicitly limit their CVs to 'is_a' and 'part_of' relationships. I agree that contorting a CV to make it work as an XSD is misguided, but it's already been done to a great extent and I just want to go that little bit further to finish it. I was suggesting that we should leverage the validation power of XSD by autogenerating an XSD from a properly done (contorted!) CV, where maintaining the CV is preferable to the XSD primarily because OBO CVs are ubiquitous in the life sciences while XSDs are not (AFAIK). Also, it means only having to maintain the CV instead of maintaining both the CV and the XSD (autogenerating the CV from the XSD is conceivable, but pointless because by then you are putting new accession number straight into the XSD along with all the baggage that needs to get passed to the CV but isn't really important to the XSD). > Frankly it seems to me that the CV doesn't really need to be all that > logically consistent: in its current bogus state it doesn't seem to have > bothered anyone, including the official validator. PSI clearly never meant > for CV to do things like datatyping and range limiting so we should stop > pushing on that rope and just allow CV to play its proper role in > disambiguating the terms we use in the XSD, by use of accession numbers in > the XSD. > I think you say this because, as things currently are, you don't plan to care much about the CV and frankly neither do I. And there is a legitimate reason to not care about a CV if it doesn't specify enough semantics of the format to truly and unambiguously define the the terms. The data type of a term is as much a part of its definition as the English description of it! Imagine different users of the CV trying to pass around instances of terms using different data types for the different instances! I don't think that constitutes an unambiguous controlled vocabulary. :) > The thing to do now is to transfer most of the intelligence in the > ms-mapping.xml schema file (for it is indeed a schema, albeit written in a > nonstandard format) to the XSD file then add the proper datatyping and range > checking. I was happy to see that this second schema contains the work I > thought we were going to have to generate from the CV itself, although I was > also somewhat surprised to learn of the existence of such a key artifact > this late in the discussion. Or maybe I just missed it somehow. > > As I've said before we should be braver than we have been so far. The > refusal to put useful content in the XSD file simply for fear of being wrong > about it is just deplorable and doesn't serve the purposes of the community. > And I'm appalled at the disingenuousness of claiming a "stable schema" when > many key parts of the spec are in fact expressed in a schema > (ms-mapping.xml) which is explicitly unstable. > I agree wholeheartedly. We only disagree about maintaining the fully specified XSD. I think it should be autogenerated from a fixed CV and a stable template schema, whereas you think it should be hand rolled. Let me get you to clear something up though: do you want there to be a single, ever-changing schema, or would you also accept a basic stable schema (without CV-related restrictions) which can be derived from in order to create the fully specified schema with the ever-changing restrictions? In the latter case, we can have a schema that is stable but doesn't serve for anything more than syntactical validation, and also a schema that can be used for full semantic validation, and which schema that a program uses is up to the program. > The charge has been leveled on this list that (paraphrasing here) some old > dogs are resisting learning new tricks when it comes to the use of CV. > That's always something to be mindful of, but after careful consideration I > really just don't see the advantage of a CV-centric approach, when all the > added complexity and reinvention still leaves us well short of where proper > use of XSD would get us. Fully realized XSD that references CV to define > its terms seems like the obvious choice for a system that wants to gain > widespread and rapid adoption. > Speaking of learning new tricks, when will the vendors' raw file reading libraries return CV accession numbers to describe terms instead of ambiguous strings? That would be nice. But if that never happens, each conversion program has to maintain its own vendor-to-CV mapping. And if a program wants to read both vendor-proprietary formats and the XML formats, your mapping problems have become nightmares. -Matt > - Brian > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Matthew > Chambers > Sent: Tuesday, October 16, 2007 8:27 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments > > Hi Frank, I read the Guidelines you linked to and also the paper > describing the Relation Ontology (http://genomebiology.com/2005/6/5/R46) > which is referenced from the Guidelines. The Relation Ontology does not > in any way suggest that reliable OBO CVs should be limited to IS_A and > PART_OF relationships! Rather, it does a good job of defining when IS_A > and PART_OF should be used and what they really mean. I think if we > looked closely we could find quite a few cases in the CV where the use > of IS_A and PART_OF is bogus according to the Relation Ontology > definition, especially with regard to values being indistinct from > categories. > > Therefore, I take issue with the following text from the Guidelines > which has no corresponding rationale and which is currently biting us in > the arse: > > 11. Relations between RU's > As the PSI CV will be developed under the OBO umbrella [3], the > relations created between terms MUST ascribe to the definitions and > formal requirements provided in the OBO Relations Ontology (RO) paper > [7], as the relations 'is_a' and 'part_of'. > > It is not clear whether the Relation Ontology recommends or discourages > using OBO to typedef new relationship types into existence (my proposed > 'value of'), but that won't be necessary. I think we can accomplish the > same effect with the existing relationship, 'instance_of', which IS part > of the Relation Ontology. In fact, 'instance_of' is a primitive relation > in the Relation Ontology, whereas 'is_a' is not. Here is the Relation > Ontology definition for 'instance_of': > > p instance_of P - a primitive relation between a process instance and a > class which it instantiates holding independently of time > > That sounds like a pretty good way to distinguish between values > (instances) and categories (classes) to me! Further, the instance_of > relationship can be used in addition to the current part_of and is_a > relationships and it will serve to disambiguate a branch of the CV where > the actual category that a value belongs to is an ancestor instead of a > direct parent. For instance: > MS:1000173 "MAT900XP" > is a MS:1000493 "Finnigan MAT" > part of MS:1000483 "Thermo Fisher Scientific" > is a MS:1000031 "model by vendor" > part of MS:1000463 "instrument description" > part of MS:0000000 "MZ controlled vocabularies" > What category does the controlled value "MAT900XP" belong to, i.e. if we > used cvParam method B, would it look like: > <cvParam cvLabel="MS" categoryName="Finnigan MAT" > categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/> > Or would it look like: > <cvParam cvLabel="MS" categoryName="model by vendor" > categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/> > > Of course I think it should be the latter, but how would you derive that > from the CV? You can't, unless you add a new relationship or convention, > so I suggest: > MS:1000173 "MAT900XP" > instance of MS:1000031 "model by vendor" > is a MS:1000493 "Finnigan MAT" > part of MS:1000483 "Thermo Fisher Scientific" > is a MS:1000031 "model by vendor" > part of MS:1000463 "instrument description" > part of MS:0000000 "MZ controlled vocabularies" > It would also be good to get rid of the MS:1000483->MS:1000031 > relationship at that point because "Thermo Fisher Scientific" is NOT an > instrument model. > > I have to disagree with your assertion that OBO does not allow a CV to > model datatypes and cardinality. I think the trailing modifiers (which > may have been added since you last looked at the OBO language spec) > would serve to model those properties quite nicely. > > -Matt > |
From: Brian P. <bri...@in...> - 2007-10-16 18:27:44
|
(First of all, thanks to Frank for shedding more light on the topic - heat, we have already!) Matt, You're right about OBO not limiting itself to is_a and part_of, but it appears that PSI has explicitly chosen to do so. I doubt we have the political heft to change that now, or that we should want to do so. Further contortions to turn CV into something to rival the readily available power of XSD are misguided, in my opinion. Frankly it seems to me that the CV doesn't really need to be all that logically consistent: in its current bogus state it doesn't seem to have bothered anyone, including the official validator. PSI clearly never meant for CV to do things like datatyping and range limiting so we should stop pushing on that rope and just allow CV to play its proper role in disambiguating the terms we use in the XSD, by use of accession numbers in the XSD. The thing to do now is to transfer most of the intelligence in the ms-mapping.xml schema file (for it is indeed a schema, albeit written in a nonstandard format) to the XSD file then add the proper datatyping and range checking. I was happy to see that this second schema contains the work I thought we were going to have to generate from the CV itself, although I was also somewhat surprised to learn of the existence of such a key artifact this late in the discussion. Or maybe I just missed it somehow. As I've said before we should be braver than we have been so far. The refusal to put useful content in the XSD file simply for fear of being wrong about it is just deplorable and doesn't serve the purposes of the community. And I'm appalled at the disingenuousness of claiming a "stable schema" when many key parts of the spec are in fact expressed in a schema (ms-mapping.xml) which is explicitly unstable. The charge has been leveled on this list that (paraphrasing here) some old dogs are resisting learning new tricks when it comes to the use of CV. That's always something to be mindful of, but after careful consideration I really just don't see the advantage of a CV-centric approach, when all the added complexity and reinvention still leaves us well short of where proper use of XSD would get us. Fully realized XSD that references CV to define its terms seems like the obvious choice for a system that wants to gain widespread and rapid adoption. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Tuesday, October 16, 2007 8:27 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99.0 comments Hi Frank, I read the Guidelines you linked to and also the paper describing the Relation Ontology (http://genomebiology.com/2005/6/5/R46) which is referenced from the Guidelines. The Relation Ontology does not in any way suggest that reliable OBO CVs should be limited to IS_A and PART_OF relationships! Rather, it does a good job of defining when IS_A and PART_OF should be used and what they really mean. I think if we looked closely we could find quite a few cases in the CV where the use of IS_A and PART_OF is bogus according to the Relation Ontology definition, especially with regard to values being indistinct from categories. Therefore, I take issue with the following text from the Guidelines which has no corresponding rationale and which is currently biting us in the arse: 11. Relations between RU's As the PSI CV will be developed under the OBO umbrella [3], the relations created between terms MUST ascribe to the definitions and formal requirements provided in the OBO Relations Ontology (RO) paper [7], as the relations 'is_a' and 'part_of'. It is not clear whether the Relation Ontology recommends or discourages using OBO to typedef new relationship types into existence (my proposed 'value of'), but that won't be necessary. I think we can accomplish the same effect with the existing relationship, 'instance_of', which IS part of the Relation Ontology. In fact, 'instance_of' is a primitive relation in the Relation Ontology, whereas 'is_a' is not. Here is the Relation Ontology definition for 'instance_of': p instance_of P - a primitive relation between a process instance and a class which it instantiates holding independently of time That sounds like a pretty good way to distinguish between values (instances) and categories (classes) to me! Further, the instance_of relationship can be used in addition to the current part_of and is_a relationships and it will serve to disambiguate a branch of the CV where the actual category that a value belongs to is an ancestor instead of a direct parent. For instance: MS:1000173 "MAT900XP" is a MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" What category does the controlled value "MAT900XP" belong to, i.e. if we used cvParam method B, would it look like: <cvParam cvLabel="MS" categoryName="Finnigan MAT" categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/> Or would it look like: <cvParam cvLabel="MS" categoryName="model by vendor" categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/> Of course I think it should be the latter, but how would you derive that from the CV? You can't, unless you add a new relationship or convention, so I suggest: MS:1000173 "MAT900XP" instance of MS:1000031 "model by vendor" is a MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" It would also be good to get rid of the MS:1000483->MS:1000031 relationship at that point because "Thermo Fisher Scientific" is NOT an instrument model. I have to disagree with your assertion that OBO does not allow a CV to model datatypes and cardinality. I think the trailing modifiers (which may have been added since you last looked at the OBO language spec) would serve to model those properties quite nicely. -Matt frank gibson wrote: > Hi > > I have been following this discussion and there seems to be some > confusion about the CV, its use, and development. Using the OBO > language this allows you to record "words" or strings. It does not > allow you to model what the words represent such as restrictions, > cardinality or datatypes for values (such as int, double and xml > datatypes). This is a limitation of the chosen language. > > The PSI have developed "Guidelines for the development of Controlled > Vocabularies" which is a final document and describes the > recommendation's and best practice in designing CVs for the PSI. It > includes and described several issues which have been raised on this > list such as what the relationships of is_a and part_of semanticaly > mean. In addition it includes how to normalise the natural language > definitions for each RA, the maintainance procedures, obselecsing tems > and the process for term addition. > > The Final document can be found at the following URL > http://psidev.info/index.php?q=node/258 > > > I hope these comments and the information contained within this > document is helpful in the development of the MS CV > > Cheers > > Frank ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Matthew C. <mat...@va...> - 2007-10-16 15:35:30
|
Oops. It killed my spaces. Let me try again. That sounds like a pretty good way to distinguish between values (instances) and categories (classes) to me! Further, the instance_of relationship can be used in addition to the current part_of and is_a relationships and it will serve to disambiguate a branch of the CV where the actual category that a value belongs to is an ancestor instead of a direct parent. For instance: MS:1000173 "MAT900XP" --is a MS:1000493 "Finnigan MAT" ----part of MS:1000483 "Thermo Fisher Scientific" ------is a MS:1000031 "model by vendor" --------part of MS:1000463 "instrument description" ----------part of MS:0000000 "MZ controlled vocabularies" What category does the controlled value "MAT900XP" belong to, i.e. if we used cvParam method B, would it look like: <cvParam cvLabel="MS" categoryName="Finnigan MAT" categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/> Or would it look like: <cvParam cvLabel="MS" categoryName="model by vendor" categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/> Of course I think it should be the latter, but how would you derive that from the CV? You can't, unless you add a new relationship or convention, so I suggest: MS:1000173 "MAT900XP" --instance of MS:1000031 "model by vendor" --is a MS:1000493 "Finnigan MAT" ----part of MS:1000483 "Thermo Fisher Scientific" ------is a MS:1000031 "model by vendor" --------part of MS:1000463 "instrument description" ----------part of MS:0000000 "MZ controlled vocabularies" It would also be good to get rid of the MS:1000483->MS:1000031 relationship at that point because "Thermo Fisher Scientific" is NOT an instrument model. -Matt Matthew Chambers wrote: > That sounds like a pretty good way to distinguish between values > (instances) and categories (classes) to me! Further, the instance_of > relationship can be used in addition to the current part_of and is_a > relationships and it will serve to disambiguate a branch of the CV > where the actual category that a value belongs to is an ancestor > instead of a direct parent. For instance: > MS:1000173 "MAT900XP" > is a MS:1000493 "Finnigan MAT" > part of MS:1000483 "Thermo Fisher Scientific" > is a MS:1000031 "model by vendor" > part of MS:1000463 "instrument description" > part of MS:0000000 "MZ controlled vocabularies" > What category does the controlled value "MAT900XP" belong to, i.e. if we > used cvParam method B, would it look like: > <cvParam cvLabel="MS" categoryName="Finnigan MAT" > categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/> > Or would it look like: > <cvParam cvLabel="MS" categoryName="model by vendor" > categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/> |
From: Matthew C. <mat...@va...> - 2007-10-16 15:26:50
|
Hi Frank, I read the Guidelines you linked to and also the paper describing the Relation Ontology (http://genomebiology.com/2005/6/5/R46) which is referenced from the Guidelines. The Relation Ontology does not in any way suggest that reliable OBO CVs should be limited to IS_A and PART_OF relationships! Rather, it does a good job of defining when IS_A and PART_OF should be used and what they really mean. I think if we looked closely we could find quite a few cases in the CV where the use of IS_A and PART_OF is bogus according to the Relation Ontology definition, especially with regard to values being indistinct from categories. Therefore, I take issue with the following text from the Guidelines which has no corresponding rationale and which is currently biting us in the arse: 11. Relations between RU’s As the PSI CV will be developed under the OBO umbrella [3], the relations created between terms MUST ascribe to the definitions and formal requirements provided in the OBO Relations Ontology (RO) paper [7], as the relations ‘is_a’ and ‘part_of’. It is not clear whether the Relation Ontology recommends or discourages using OBO to typedef new relationship types into existence (my proposed 'value of'), but that won't be necessary. I think we can accomplish the same effect with the existing relationship, 'instance_of', which IS part of the Relation Ontology. In fact, 'instance_of' is a primitive relation in the Relation Ontology, whereas 'is_a' is not. Here is the Relation Ontology definition for 'instance_of': p instance_of P - a primitive relation between a process instance and a class which it instantiates holding independently of time That sounds like a pretty good way to distinguish between values (instances) and categories (classes) to me! Further, the instance_of relationship can be used in addition to the current part_of and is_a relationships and it will serve to disambiguate a branch of the CV where the actual category that a value belongs to is an ancestor instead of a direct parent. For instance: MS:1000173 "MAT900XP" is a MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" What category does the controlled value "MAT900XP" belong to, i.e. if we used cvParam method B, would it look like: <cvParam cvLabel="MS" categoryName="Finnigan MAT" categoryAccession="MS:1000493" accession="MS:1000173" name="MAT900XP"/> Or would it look like: <cvParam cvLabel="MS" categoryName="model by vendor" categoryAccession="MS:1000031" accession="MS:1000173" name="MAT900XP"/> Of course I think it should be the latter, but how would you derive that from the CV? You can't, unless you add a new relationship or convention, so I suggest: MS:1000173 "MAT900XP" instance of MS:1000031 "model by vendor" is a MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" It would also be good to get rid of the MS:1000483->MS:1000031 relationship at that point because "Thermo Fisher Scientific" is NOT an instrument model. I have to disagree with your assertion that OBO does not allow a CV to model datatypes and cardinality. I think the trailing modifiers (which may have been added since you last looked at the OBO language spec) would serve to model those properties quite nicely. -Matt frank gibson wrote: > Hi > > I have been following this discussion and there seems to be some > confusion about the CV, its use, and development. Using the OBO > language this allows you to record "words" or strings. It does not > allow you to model what the words represent such as restrictions, > cardinality or datatypes for values (such as int, double and xml > datatypes). This is a limitation of the chosen language. > > The PSI have developed "Guidelines for the development of Controlled > Vocabularies" which is a final document and describes the > recommendation's and best practice in designing CVs for the PSI. It > includes and described several issues which have been raised on this > list such as what the relationships of is_a and part_of semanticaly > mean. In addition it includes how to normalise the natural language > definitions for each RA, the maintainance procedures, obselecsing tems > and the process for term addition. > > The Final document can be found at the following URL > http://psidev.info/index.php?q=node/258 > > > I hope these comments and the information contained within this > document is helpful in the development of the MS CV > > Cheers > > Frank |
From: frank g. <Fra...@nc...> - 2007-10-16 09:05:36
|
Hi I have been following this discussion and there seems to be some confusion about the CV, its use, and development. Using the OBO language this allows you to record "words" or strings. It does not allow you to model what the words represent such as restrictions, cardinality or datatypes for values (such as int, double and xml datatypes). This is a limitation of the chosen language. The PSI have developed "Guidelines for the development of Controlled Vocabularies" which is a final document and describes the recommendation's and best practice in designing CVs for the PSI. It includes and described several issues which have been raised on this list such as what the relationships of is_a and part_of semanticaly mean. In addition it includes how to normalise the natural language definitions for each RA, the maintainance procedures, obselecsing tems and the process for term addition. The Final document can be found at the following URL http://psidev.info/index.php?q=node/258 I hope these comments and the information contained within this document is helpful in the development of the MS CV Cheers Frank On 10/15/07, Fredrik Levander <Fre...@im...> wrote: > > Hi, > > My comments on mzML0.99.0 after reading (most of) the posts on the > mailing list and trying to convert a peak list into the format are as > follows: > > The standard is composed of a schema with little control and a lot of > cvParams that are controlled by a separate file. Updates to the CV does > not require schema updates, and the CV rules file should also be stable. > For the validation of files it would, as pointed out by several people, > be straightforward to automate generate an XSD which reflect the current > CV. Otherwise the semantic Java validator also does the job (and also > have other benefits when it comes to large files). For us it doesn't > matter which method is used, but the real issue is how to handle > versions of the CV. As long as nothing is deleted from the CV everything > should be fine from an implementation point of view though. > > A major problem would be if something is added to the CV which breaks > current parsers. A new compression type could be added to the CV without > notice, and if someone is using that compression type they're producing > standard compliant files, but parsers that are supposed to be standard > compliant would not be able to parse the file correctly. So, there are a > few places where I think the allowed values should be set under enum > constraints in the main standard schema, so that a new schema version is > enforced if these fields are changed. I have the feeling that CV version > will not be as controlled as the schema version. Fields that I propose > should be enums are (this is maybe one step back again...): > > In binaryDataArray: > > compressionType (no compression/zlib compression) > valueType (32-bit float, 64-bit float, 16-bit integer, 32-bit integer or > 64-bit integer) > > In spectrum: > > spectrumType (centroid, profile). > > these parameters could be attributes or cvParams (but under schema > control) if CV accession numbers are important. > > > Other comments: > > There is also an acquisitionList spectrumType attribute which probably > could be removed since we have spectrumDescription - > spectrumRepresentation (spectrumType). Only use would be if the > acquisitions were in profile mode but the peak picking algorithm that > worked on the spectra turned them into a centroid peak list and one > would like to specify this (?). > > If the spectrum is a combination of multiple scans (as specified using > acquistionList) one would normally not use the 'scan' element. The > question is then how to give the retention time? We did not succeed in > doing this in a valid way, see > > http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/FF_070504_MSMS_5B.mzML > > for a simple (but invalid way of doing it). More correct would be to put > the cvParam under the acquisition with the retention time, but this is > not allowed either. > > Why not allow softwareParam to be userParam or cvParam or must all > software that work on mzML be in the CV? > > How about having precursor m/z, intensity and charge state as > non-required attributes to ionSelection? These fields are really used in > every file. > > Final comment is though that all these things are really minor, and that > getting the standard released is what matters! > > Regards > > Fredrik > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- Frank Gibson Research Associate Room 2.19, Devonshire Building School of Computing Science, University of Newcastle upon Tyne, Newcastle upon Tyne, NE1 7RU United Kingdom Telephone: +44-191-246-4933 Fax: +44-191-246-4905 |
From: Matt C. <mat...@va...> - 2007-10-16 01:49:49
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Eric Deutsch wrote: <blockquote cite="mid...@he..." type="cite"> <meta http-equiv="Content-Type" content="text/html; "> <meta name="Generator" content="Microsoft Word 11 (filtered medium)"> <!--[if !mso]> <style> v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style> <![endif]--> <style> <!-- /* Font Definitions */ @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman";} h1 {margin-top:12.0pt; margin-right:0in; margin-bottom:3.0pt; margin-left:0in; text-indent:0in; page-break-after:avoid; mso-list:l0 level1 lfo7; font-size:16.0pt; font-family:Arial;} h3 {margin-top:12.0pt; margin-right:0in; margin-bottom:3.0pt; margin-left:1.0in; text-indent:0in; page-break-after:avoid; mso-list:l0 level3 lfo7; font-size:12.0pt; font-family:Arial;} h4 {margin-top:12.0pt; margin-right:0in; margin-bottom:3.0pt; margin-left:1.5in; text-indent:0in; page-break-after:avoid; mso-list:l0 level4 lfo7; font-size:14.0pt; font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} pre {margin:0in; margin-bottom:.0001pt; font-size:10.0pt; font-family:"Courier New";} span.EmailStyle18 {mso-style-type:personal; font-family:Arial; color:windowtext;} span.EmailStyle19 {mso-style-type:personal-reply; font-family:Arial; color:navy;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} /* List Definitions */ @list l0 {mso-list-id:346298316; mso-list-template-ids:67698727;} @list l0:level1 {mso-level-number-format:roman-upper; mso-level-style-link:"Heading 1"; mso-level-tab-stop:.25in; mso-level-number-position:left; margin-left:0in; text-indent:0in;} @list l0:level2 {mso-level-number-format:alpha-upper; mso-level-tab-stop:.75in; mso-level-number-position:left; margin-left:.5in; text-indent:0in;} @list l0:level3 {mso-level-style-link:"Heading 3"; mso-level-tab-stop:1.25in; mso-level-number-position:left; margin-left:1.0in; text-indent:0in;} @list l0:level4 {mso-level-number-format:alpha-lower; mso-level-style-link:"Heading 4"; mso-level-text:"%4\)"; mso-level-tab-stop:1.75in; mso-level-number-position:left; margin-left:1.5in; text-indent:0in;} @list l0:level5 {mso-level-text:"\(%5\)"; mso-level-tab-stop:2.25in; mso-level-number-position:left; margin-left:2.0in; text-indent:0in;} @list l0:level6 {mso-level-number-format:alpha-lower; mso-level-text:"\(%6\)"; mso-level-tab-stop:2.75in; mso-level-number-position:left; margin-left:2.5in; text-indent:0in;} @list l0:level7 {mso-level-number-format:roman-lower; mso-level-text:"\(%7\)"; mso-level-tab-stop:3.25in; mso-level-number-position:left; margin-left:3.0in; text-indent:0in;} @list l0:level8 {mso-level-number-format:alpha-lower; mso-level-text:"\(%8\)"; mso-level-tab-stop:3.75in; mso-level-number-position:left; margin-left:3.5in; text-indent:0in;} @list l0:level9 {mso-level-number-format:roman-lower; mso-level-text:"\(%9\)"; mso-level-tab-stop:4.25in; mso-level-number-position:left; margin-left:4.0in; text-indent:0in;} ol {margin-bottom:0in;} ul {margin-bottom:0in;} --> </style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Hi Brian, thank you for your continued input and effort. I’m sorry I’ve been slow to respond on many of your posts, I have a bunch of other pots boiling over here. However, I think I can answer your questions here and promote further testing.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">1) Regarding 2min.mzML, we’ll fix it, thanks.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">2) Regarding how does the validator know that MS:1000528 is invalid, please download:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><a href="http://tools.proteomecenter.org/software/mzMLKit/mzML_0.99.0_large.zip">http://tools.proteomecenter.org/software/mzMLKit/mzML_0.99.0_large.zip</a><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">(this is hyperlinked from the main development page <a href="http://www.psidev.info/index.php?q=node/257">http://www.psidev.info/index.php?q=node/257</a>)<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">In it, you will find the semantic validator software. One of the files in the distro is ms-mapping.xml. It is this file that encodes these rules and is what is used by the semantic validator. This file should be more prominently posted and will be.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">3) The semantic validator is FOSS, please see the PSI SVN repository and contribute!<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><a href="https://psidev.svn.sourceforge.net/svnroot/psidev/psi/mzml/">https://psidev.svn.sourceforge.net/svnroot/psidev/psi/mzml/</a><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">(this is hyperlinked from the main development page <a href="http://www.psidev.info/index.php?q=node/257">http://www.psidev.info/index.php?q=node/257</a>)<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">4) So, it turns out that the semantic validator is using an XML file to enforce the semantic rules, it is NOT reading the doc. It should be noted that this software and the mapping mechanism was developed originally for the PSI molecular interactions schema. That format uses the same built-in flexibility with semantic validation. We are borrowing that mechanism and software for mzML.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">5) Further, in the doc, the cvParams section for each element is meant to represent “Some examples of allowed cvParams (not necessarily complete)”. I will clarify that in the doc. Further, one of the things I realized that we need to do, is include in the doc the rules set forth in the ms-mapping.xml file. These rules are NOT currently in the doc, but they should be and will be. The doc is actually autogenerated from the other files, so I just need to include some code that parses this ms-mapping file and includes that information in the doc. This will be done for 0.99.1. Thanks!<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">6) Regarding your Observation Two: It is true that the standard relies on the maintenance of three artifacts: xsd, cv, mapping-ms.xml (not the doc as you had inferred; the doc is essentially autogenerated from the former) (and behind the scenes, the example instance documents also need to be maintained). This translates to the desired-stable schema, the evolving controlled vocabulary, and the evolving ruleset on how you may use the CV within the xsd. This is where we are led by the requirement that the schema be stable with provisions for flexibility in annotating many kinds of mass spec data.</span></font></p> </div> </blockquote> >From my perspective, it should be possible to hand-maintain only the CV and a templated schema which gets fleshed out by an autogenerator when the CV changes. The mapping file seems like a hack to compensate for two missing features in the CV: 1) the ability to distinguish between values and categories, and 2) to specify the types and ranges of the uncontrolled values. I think it's less contrived to extend the capabilities of the CV format that we use so that it has those features, or alternately, set conventions in the CV (like how to interpret IS_A and PART_OF relationships) which provide the illusion of those features in a well-defined way.<br> <br> I just looked up the OBO format in more detail and it seems to me that we can very legitimately use convention to solve our CV problems:<br> 1) Distinguish between value/category terms:<br> We can use [Typedef] stanzas to define a new relationship type, "value_of", like:<br> [Typedef]<br> id: value_of<br> name: value_of<br> range: OBO:TERM_OR_TYPE ! there should be some way to say "not a term which has a value_of relationship" but I don't know how and it's not really necessary<br> domain: OBO:TERM_OR_TYPE<br> def: Indicates that the subject term is a controlled value of the object term (which implies the object is a category)<br> <br> 2) Add types, min and max properties:<br> There are several ways to specify data type and min and max ranges, and OBO is even aware of XSD types in some contexts. But because I'm not exactly sure what those contexts are, it would be just as easy to add type, min, and max properties to our category terms (which don't have any value_of relationships pointing to them) as "trailing modifiers", like:<br> <pre>[Term] id: MS:1000016 name: scan time {type="decimal", min="0"} ! a missing min or max implies no limit (or limit defined by the type; "xsd:" prefix is implied for the type def: "The time taken for an acquisition by scanning analyzers." [PSI:MS] is_a: MS:1000503 ! scan attribute</pre> <br> Such a CV, combined with a stable, templated schema, we can auto-generate a full-fledged semantic validating schema and we can avoid using any non-standard approaches.<br> <br> -Matt<font color="navy"><font size="2"><font face="Arial"><br> <br> </font></font></font> <blockquote cite="mid...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <div style="border-style: none none none solid; border-color: -moz-use-text-color -moz-use-text-color -moz-use-text-color blue; border-width: medium medium medium 1.5pt; padding: 0in 0in 0in 4pt;"> <div> <div class="MsoNormal" style="text-align: center;" align="center"><font face="Times New Roman" size="3"><span style="font-size: 12pt;"> <hr tabindex="-1" align="center" size="2" width="100%"></span></font></div> <p class="MsoNormal"><b><font face="Tahoma" size="2"><span style="font-size: 10pt; font-family: Tahoma; font-weight: bold;">From:</span></font></b><font face="Tahoma" size="2"><span style="font-size: 10pt; font-family: Tahoma;"> <a class="moz-txt-link-abbreviated" href="mailto:psi...@li...">psi...@li...</a> [<a class="moz-txt-link-freetext" href="mailto:psi...@li...">mailto:psi...@li...</a>] <b><span style="font-weight: bold;">On Behalf Of </span></b>Brian Pratt<br> <b><span style="font-weight: bold;">Sent:</span></b> Monday, October 15, 2007 3:19 PM<br> <b><span style="font-weight: bold;">To:</span></b> 'Mass spectrometry standard development'<br> <b><span style="font-weight: bold;">Subject:</span></b> [Psidev-ms-dev] mzML validator experiences</span></font><o:p></o:p></p> </div> <p class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size: 12pt;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">Hello All,<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">I decided to fool around with the validator at <a href="http://eddie.thep.lu.se/prodac_validator/validator.pl">http://eddie.thep.lu.se/prodac_validator/validator.pl</a> to see how well that can be done in the presence of an inadequately specified file format. My plan was to take a valid file, mess with it, and see if the validator would notice.<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">A little hiccup at first - I gave it the automatically generated file <a href="http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instanceFile/2min.mzML">http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instanceFile/2min.mzML</a><o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">- it doesn’t actually validate, claiming a missing index element. Somebody might want to check that out.<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">Then I gave it the handrolled <a href="http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instanceFile/tiny4_LTQ-FT.mzML0.99.0.mzML">http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instanceFile/tiny4_LTQ-FT.mzML0.99.0.mzML</a> - this validates fine. So, let the mayhem begin. <o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">I tried removing the selectionWindow element surrounding the cvParams declaring the upper and lower bounds of the selection window, but the validator is XSD aware so it caught that easily. <o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">Then I tried changing the accession numbers in the selection window for others that might be honestly conceptually mistaken by an incautious output module author:<o:p></o:p></span></font></p> <p class="MsoNormal" style="text-indent: 0.5in;"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">accession="MS:1000501" name="scan m/z lower limit"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">changed to<o:p></o:p></span></font></p> <p class="MsoNormal" style="text-indent: 0.5in;"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">accession="MS:1000528" name="lowest m/z value"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">the validator caught this as well, flagging the use of accession numbers that were incorrect for that context. But the knowledge behind this doesn’t seem to come from the XSD or the CV file. So, how does the validator know? <o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><b><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; font-weight: bold;">Observation one</span></font></b><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">: the validator doesn’t appear to be open source (or if it is, a prominent link to the source should be provided). The use of a closed source tool like this in a standards effort isn’t a good idea, since it’s hard to answer questions like the one above.<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">Apparently the author of the validator made excellent use of the documentation at<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><a href="http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document/mzML0.99.0_specificationDocument.doc">http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document/mzML0.99.0_specificationDocument.doc</a><o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">which stipulates in English that the only valid cvParams in that context are:<o:p></o:p></span></font></p> <pre><font face="Courier New" size="2"><span style="font-size: 10pt;"><cvParam cvLabel="MS" accession="MS:1000501" name="scan m/z lower limit" value="400.000000"/><o:p></o:p></span></font></pre> <pre><font face="Courier New" size="2"><span style="font-size: 10pt;"><cvParam cvLabel="MS" accession="MS:1000500" name="scan m/z upper limit" value="1800.000000"/><o:p></o:p></span></font></pre> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">Ignore for the moment that this appears to be an example rather than a spec. Do note though that there’s nothing to say that one of each has to be present. Of course a reasonable human would probably infer this, but words like “reasonable human” and “infer” are not really what you want to hear when discussing a machine readable data format standard.<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><b><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; font-weight: bold;">Observation two</span></font></b><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">: I’m not at all keen on the idea of a data format that relies on the understanding and simultaneous maintenance of three different artifacts (xsd, cv, doc), one of which (.doc) is not really machine readable.<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">I think (but I can’t be 100% sure without seeing the code) that the author has done a very good job under the circumstances, but probably had a harder time then was necessary given the bizarre construction of the spec. He or she probably would have appreciated more xsd content to do the heavy lifting, and certainly had to make a few fairly safe guesses along the way like the “must have one of each of </span></font>MS:1000501 <font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">and </span></font>MS:1000500<font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> “ thing.<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">- Brian<o:p></o:p></span></font></p> </div> </div> </blockquote> <br> </body> </html> |
From: Brian P. <bri...@in...> - 2007-10-16 00:54:59
|
Hi Eric, Sorry if I missed anything obvious on the open source nature of the code. Glad to hear it, obviously! It allows me to answer a lot of questions for myself. The existence of the mapping-ms.xml file was lost on me before now, sorry. I see where it gets us a good deal of the way to where pure xsd would, but not actually all the way. For example, the validator accepts the addition of a dwell time to a selectionWindow: <cvParam cvLabel="MS" accession="MS:1000502" name="dwell time" value="1800.000000"/> although I think it's probably nonsensical since it lacks units etc. The validator also happily accepts two copies of that line, in place of the 1000500 and 1000501 lines - all it cares about is seeing two cvParams of the proper inheritance type. The semantic constraints which can be expressed by the combination of the CV and mappings-ms.xml files with the custom java validation code are pretty crude compared to the capabilities of perfectly standard and language independent XSD. This all seems terribly convoluted, approximate, and error prone... such are the wages of reinventing the wheel. Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Eric Deutsch Sent: Monday, October 15, 2007 4:37 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML validator experiences Hi Brian, thank you for your continued input and effort. I'm sorry I've been slow to respond on many of your posts, I have a bunch of other pots boiling over here. However, I think I can answer your questions here and promote further testing. 1) Regarding 2min.mzML, we'll fix it, thanks. 2) Regarding how does the validator know that MS:1000528 is invalid, please download: http://tools.proteomecenter.org/software/mzMLKit/mzML_0.99.0_large.zip (this is hyperlinked from the main development page http://www.psidev.info/index.php?q=node/257) In it, you will find the semantic validator software. One of the files in the distro is ms-mapping.xml. It is this file that encodes these rules and is what is used by the semantic validator. This file should be more prominently posted and will be. 3) The semantic validator is FOSS, please see the PSI SVN repository and contribute! https://psidev.svn.sourceforge.net/svnroot/psidev/psi/mzml/ (this is hyperlinked from the main development page http://www.psidev.info/index.php?q=node/257) 4) So, it turns out that the semantic validator is using an XML file to enforce the semantic rules, it is NOT reading the doc. It should be noted that this software and the mapping mechanism was developed originally for the PSI molecular interactions schema. That format uses the same built-in flexibility with semantic validation. We are borrowing that mechanism and software for mzML. 5) Further, in the doc, the cvParams section for each element is meant to represent "Some examples of allowed cvParams (not necessarily complete)". I will clarify that in the doc. Further, one of the things I realized that we need to do, is include in the doc the rules set forth in the ms-mapping.xml file. These rules are NOT currently in the doc, but they should be and will be. The doc is actually autogenerated from the other files, so I just need to include some code that parses this ms-mapping file and includes that information in the doc. This will be done for 0.99.1. Thanks! 6) Regarding your Observation Two: It is true that the standard relies on the maintenance of three artifacts: xsd, cv, mapping-ms.xml (not the doc as you had inferred; the doc is essentially autogenerated from the former) (and behind the scenes, the example instance documents also need to be maintained). This translates to the desired-stable schema, the evolving controlled vocabulary, and the evolving ruleset on how you may use the CV within the xsd. This is where we are led by the requirement that the schema be stable with provisions for flexibility in annotating many kinds of mass spec data. Thanks! Eric _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Monday, October 15, 2007 3:19 PM To: 'Mass spectrometry standard development' Subject: [Psidev-ms-dev] mzML validator experiences Hello All, I decided to fool around with the validator at http://eddie.thep.lu.se/prodac_validator/validator.pl to see how well that can be done in the presence of an inadequately specified file format. My plan was to take a valid file, mess with it, and see if the validator would notice. A little hiccup at first - I gave it the automatically generated file http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/2min.mzML - it doesn't actually validate, claiming a missing index element. Somebody might want to check that out. Then I gave it the handrolled http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/tiny4_LTQ-FT.mzML0.99.0.mzML - this validates fine. So, let the mayhem begin. I tried removing the selectionWindow element surrounding the cvParams declaring the upper and lower bounds of the selection window, but the validator is XSD aware so it caught that easily. Then I tried changing the accession numbers in the selection window for others that might be honestly conceptually mistaken by an incautious output module author: accession="MS:1000501" name="scan m/z lower limit" changed to accession="MS:1000528" name="lowest m/z value" the validator caught this as well, flagging the use of accession numbers that were incorrect for that context. But the knowledge behind this doesn't seem to come from the XSD or the CV file. So, how does the validator know? Observation one: the validator doesn't appear to be open source (or if it is, a prominent link to the source should be provided). The use of a closed source tool like this in a standards effort isn't a good idea, since it's hard to answer questions like the one above. Apparently the author of the validator made excellent use of the documentation at http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document /mzML0.99.0_specificationDocument.doc which stipulates in English that the only valid cvParams in that context are: <cvParam cvLabel="MS" accession="MS:1000501" name="scan m/z lower limit" value="400.000000"/> <cvParam cvLabel="MS" accession="MS:1000500" name="scan m/z upper limit" value="1800.000000"/> Ignore for the moment that this appears to be an example rather than a spec. Do note though that there's nothing to say that one of each has to be present. Of course a reasonable human would probably infer this, but words like "reasonable human" and "infer" are not really what you want to hear when discussing a machine readable data format standard. Observation two: I'm not at all keen on the idea of a data format that relies on the understanding and simultaneous maintenance of three different artifacts (xsd, cv, doc), one of which (.doc) is not really machine readable. I think (but I can't be 100% sure without seeing the code) that the author has done a very good job under the circumstances, but probably had a harder time then was necessary given the bizarre construction of the spec. He or she probably would have appreciated more xsd content to do the heavy lifting, and certainly had to make a few fairly safe guesses along the way like the "must have one of each of MS:1000501 and MS:1000500 " thing. - Brian |
From: Eric D. <ede...@sy...> - 2007-10-15 23:37:04
|
Hi Brian, thank you for your continued input and effort. I'm sorry I've been slow to respond on many of your posts, I have a bunch of other pots boiling over here. However, I think I can answer your questions here and promote further testing. =20 1) Regarding 2min.mzML, we'll fix it, thanks. =20 2) Regarding how does the validator know that MS:1000528 is invalid, please download: http://tools.proteomecenter.org/software/mzMLKit/mzML_0.99.0_large.zip (this is hyperlinked from the main development page http://www.psidev.info/index.php?q=3Dnode/257) =20 In it, you will find the semantic validator software. One of the files in the distro is ms-mapping.xml. It is this file that encodes these rules and is what is used by the semantic validator. This file should be more prominently posted and will be. =20 3) The semantic validator is FOSS, please see the PSI SVN repository and contribute! https://psidev.svn.sourceforge.net/svnroot/psidev/psi/mzml/ (this is hyperlinked from the main development page http://www.psidev.info/index.php?q=3Dnode/257) =20 4) So, it turns out that the semantic validator is using an XML file to enforce the semantic rules, it is NOT reading the doc. It should be noted that this software and the mapping mechanism was developed originally for the PSI molecular interactions schema. That format uses the same built-in flexibility with semantic validation. We are borrowing that mechanism and software for mzML. =20 5) Further, in the doc, the cvParams section for each element is meant to represent "Some examples of allowed cvParams (not necessarily complete)". I will clarify that in the doc. Further, one of the things I realized that we need to do, is include in the doc the rules set forth in the ms-mapping.xml file. These rules are NOT currently in the doc, but they should be and will be. The doc is actually autogenerated from the other files, so I just need to include some code that parses this ms-mapping file and includes that information in the doc. This will be done for 0.99.1. Thanks! =20 6) Regarding your Observation Two: It is true that the standard relies on the maintenance of three artifacts: xsd, cv, mapping-ms.xml (not the doc as you had inferred; the doc is essentially autogenerated from the former) (and behind the scenes, the example instance documents also need to be maintained). This translates to the desired-stable schema, the evolving controlled vocabulary, and the evolving ruleset on how you may use the CV within the xsd. This is where we are led by the requirement that the schema be stable with provisions for flexibility in annotating many kinds of mass spec data. =20 Thanks! Eric =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Monday, October 15, 2007 3:19 PM To: 'Mass spectrometry standard development' Subject: [Psidev-ms-dev] mzML validator experiences =20 Hello All, =20 I decided to fool around with the validator at http://eddie.thep.lu.se/prodac_validator/validator.pl to see how well that can be done in the presence of an inadequately specified file format. My plan was to take a valid file, mess with it, and see if the validator would notice. =20 A little hiccup at first - I gave it the automatically generated file http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/inst anceFile/2min.mzML - it doesn't actually validate, claiming a missing index element. Somebody might want to check that out. =20 Then I gave it the handrolled http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/inst anceFile/tiny4_LTQ-FT.mzML0.99.0.mzML - this validates fine. So, let the mayhem begin. =20 =20 I tried removing the selectionWindow element surrounding the cvParams declaring the upper and lower bounds of the selection window, but the validator is XSD aware so it caught that easily. =20 =20 Then I tried changing the accession numbers in the selection window for others that might be honestly conceptually mistaken by an incautious output module author: accession=3D"MS:1000501" name=3D"scan m/z lower limit" changed to accession=3D"MS:1000528" name=3D"lowest m/z value" the validator caught this as well, flagging the use of accession numbers that were incorrect for that context. But the knowledge behind this doesn't seem to come from the XSD or the CV file. So, how does the validator know? =20 =20 Observation one: the validator doesn't appear to be open source (or if it is, a prominent link to the source should be provided). The use of a closed source tool like this in a standards effort isn't a good idea, since it's hard to answer questions like the one above. =20 Apparently the author of the validator made excellent use of the documentation at http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/docu ment/mzML0.99.0_specificationDocument.doc which stipulates in English that the only valid cvParams in that context are: <cvParam cvLabel=3D"MS" accession=3D"MS:1000501" name=3D"scan m/z lower = limit" value=3D"400.000000"/> <cvParam cvLabel=3D"MS" accession=3D"MS:1000500" name=3D"scan m/z upper = limit" value=3D"1800.000000"/> Ignore for the moment that this appears to be an example rather than a spec. Do note though that there's nothing to say that one of each has to be present. Of course a reasonable human would probably infer this, but words like "reasonable human" and "infer" are not really what you want to hear when discussing a machine readable data format standard. =20 Observation two: I'm not at all keen on the idea of a data format that relies on the understanding and simultaneous maintenance of three different artifacts (xsd, cv, doc), one of which (.doc) is not really machine readable. =20 I think (but I can't be 100% sure without seeing the code) that the author has done a very good job under the circumstances, but probably had a harder time then was necessary given the bizarre construction of the spec. He or she probably would have appreciated more xsd content to do the heavy lifting, and certainly had to make a few fairly safe guesses along the way like the "must have one of each of MS:1000501 and MS:1000500 " thing. =20 - Brian |
From: Brian P. <bri...@in...> - 2007-10-15 22:20:10
|
Hello All, I decided to fool around with the validator at http://eddie.thep.lu.se/prodac_validator/validator.pl to see how well that can be done in the presence of an inadequately specified file format. My plan was to take a valid file, mess with it, and see if the validator would notice. A little hiccup at first - I gave it the automatically generated file http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/2min.mzML - it doesn't actually validate, claiming a missing index element. Somebody might want to check that out. Then I gave it the handrolled http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/instance File/tiny4_LTQ-FT.mzML0.99.0.mzML - this validates fine. So, let the mayhem begin. I tried removing the selectionWindow element surrounding the cvParams declaring the upper and lower bounds of the selection window, but the validator is XSD aware so it caught that easily. Then I tried changing the accession numbers in the selection window for others that might be honestly conceptually mistaken by an incautious output module author: accession="MS:1000501" name="scan m/z lower limit" changed to accession="MS:1000528" name="lowest m/z value" the validator caught this as well, flagging the use of accession numbers that were incorrect for that context. But the knowledge behind this doesn't seem to come from the XSD or the CV file. So, how does the validator know? Observation one: the validator doesn't appear to be open source (or if it is, a prominent link to the source should be provided). The use of a closed source tool like this in a standards effort isn't a good idea, since it's hard to answer questions like the one above. Apparently the author of the validator made excellent use of the documentation at http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/document /mzML0.99.0_specificationDocument.doc which stipulates in English that the only valid cvParams in that context are: <cvParam cvLabel="MS" accession="MS:1000501" name="scan m/z lower limit" value="400.000000"/> <cvParam cvLabel="MS" accession="MS:1000500" name="scan m/z upper limit" value="1800.000000"/> Ignore for the moment that this appears to be an example rather than a spec. Do note though that there's nothing to say that one of each has to be present. Of course a reasonable human would probably infer this, but words like "reasonable human" and "infer" are not really what you want to hear when discussing a machine readable data format standard. Observation two: I'm not at all keen on the idea of a data format that relies on the understanding and simultaneous maintenance of three different artifacts (xsd, cv, doc), one of which (.doc) is not really machine readable. I think (but I can't be 100% sure without seeing the code) that the author has done a very good job under the circumstances, but probably had a harder time then was necessary given the bizarre construction of the spec. He or she probably would have appreciated more xsd content to do the heavy lifting, and certainly had to make a few fairly safe guesses along the way like the "must have one of each of MS:1000501 and MS:1000500 " thing. - Brian |
From: Matthew C. <mat...@va...> - 2007-10-15 19:21:14
|
With the exception of the CV label, PSI in mzData and MS in mzML, will the CV accession ids be the same between the two formats? -Matt |