You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(3) |
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(1) |
Aug
(5) |
Sep
|
Oct
(5) |
Nov
(1) |
Dec
(2) |
2005 |
Jan
(2) |
Feb
(5) |
Mar
|
Apr
(1) |
May
(5) |
Jun
(2) |
Jul
(3) |
Aug
(7) |
Sep
(18) |
Oct
(22) |
Nov
(10) |
Dec
(15) |
2006 |
Jan
(15) |
Feb
(8) |
Mar
(16) |
Apr
(8) |
May
(2) |
Jun
(5) |
Jul
(3) |
Aug
(1) |
Sep
(34) |
Oct
(21) |
Nov
(14) |
Dec
(2) |
2007 |
Jan
|
Feb
(17) |
Mar
(10) |
Apr
(25) |
May
(11) |
Jun
(30) |
Jul
(1) |
Aug
(38) |
Sep
|
Oct
(119) |
Nov
(18) |
Dec
(3) |
2008 |
Jan
(34) |
Feb
(202) |
Mar
(57) |
Apr
(76) |
May
(44) |
Jun
(33) |
Jul
(33) |
Aug
(32) |
Sep
(41) |
Oct
(49) |
Nov
(84) |
Dec
(216) |
2009 |
Jan
(102) |
Feb
(126) |
Mar
(112) |
Apr
(26) |
May
(91) |
Jun
(54) |
Jul
(39) |
Aug
(29) |
Sep
(16) |
Oct
(18) |
Nov
(12) |
Dec
(23) |
2010 |
Jan
(29) |
Feb
(7) |
Mar
(11) |
Apr
(22) |
May
(9) |
Jun
(13) |
Jul
(7) |
Aug
(10) |
Sep
(9) |
Oct
(20) |
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
(4) |
Mar
(27) |
Apr
(15) |
May
(23) |
Jun
(13) |
Jul
(15) |
Aug
(11) |
Sep
(23) |
Oct
(18) |
Nov
(10) |
Dec
(7) |
2012 |
Jan
(23) |
Feb
(19) |
Mar
(7) |
Apr
(20) |
May
(16) |
Jun
(4) |
Jul
(6) |
Aug
(6) |
Sep
(14) |
Oct
(16) |
Nov
(31) |
Dec
(23) |
2013 |
Jan
(14) |
Feb
(19) |
Mar
(7) |
Apr
(25) |
May
(8) |
Jun
(5) |
Jul
(5) |
Aug
(6) |
Sep
(20) |
Oct
(19) |
Nov
(10) |
Dec
(12) |
2014 |
Jan
(6) |
Feb
(15) |
Mar
(6) |
Apr
(4) |
May
(16) |
Jun
(6) |
Jul
(4) |
Aug
(2) |
Sep
(3) |
Oct
(3) |
Nov
(7) |
Dec
(3) |
2015 |
Jan
(3) |
Feb
(8) |
Mar
(14) |
Apr
(3) |
May
(17) |
Jun
(9) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
(13) |
Nov
|
Dec
(6) |
2016 |
Jan
(8) |
Feb
(1) |
Mar
(20) |
Apr
(16) |
May
(11) |
Jun
(6) |
Jul
(5) |
Aug
|
Sep
(2) |
Oct
(5) |
Nov
(7) |
Dec
(2) |
2017 |
Jan
(10) |
Feb
(3) |
Mar
(17) |
Apr
(7) |
May
(5) |
Jun
(11) |
Jul
(4) |
Aug
(12) |
Sep
(9) |
Oct
(7) |
Nov
(2) |
Dec
(4) |
2018 |
Jan
(7) |
Feb
(2) |
Mar
(5) |
Apr
(6) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(1) |
Sep
(9) |
Oct
(5) |
Nov
(3) |
Dec
(5) |
2019 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(4) |
May
(2) |
Jun
(8) |
Jul
(2) |
Aug
(2) |
Sep
|
Oct
(2) |
Nov
(9) |
Dec
(1) |
2020 |
Jan
(3) |
Feb
(1) |
Mar
(2) |
Apr
|
May
(3) |
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Fredrik L. <Fre...@im...> - 2007-10-15 12:29:19
|
Hi, My comments on mzML0.99.0 after reading (most of) the posts on the mailing list and trying to convert a peak list into the format are as follows: The standard is composed of a schema with little control and a lot of cvParams that are controlled by a separate file. Updates to the CV does not require schema updates, and the CV rules file should also be stable. For the validation of files it would, as pointed out by several people, be straightforward to automate generate an XSD which reflect the current CV. Otherwise the semantic Java validator also does the job (and also have other benefits when it comes to large files). For us it doesn't matter which method is used, but the real issue is how to handle versions of the CV. As long as nothing is deleted from the CV everything should be fine from an implementation point of view though. A major problem would be if something is added to the CV which breaks current parsers. A new compression type could be added to the CV without notice, and if someone is using that compression type they're producing standard compliant files, but parsers that are supposed to be standard compliant would not be able to parse the file correctly. So, there are a few places where I think the allowed values should be set under enum constraints in the main standard schema, so that a new schema version is enforced if these fields are changed. I have the feeling that CV version will not be as controlled as the schema version. Fields that I propose should be enums are (this is maybe one step back again...): In binaryDataArray: compressionType (no compression/zlib compression) valueType (32-bit float, 64-bit float, 16-bit integer, 32-bit integer or 64-bit integer) In spectrum: spectrumType (centroid, profile). these parameters could be attributes or cvParams (but under schema control) if CV accession numbers are important. Other comments: There is also an acquisitionList spectrumType attribute which probably could be removed since we have spectrumDescription - spectrumRepresentation (spectrumType). Only use would be if the acquisitions were in profile mode but the peak picking algorithm that worked on the spectra turned them into a centroid peak list and one would like to specify this (?). If the spectrum is a combination of multiple scans (as specified using acquistionList) one would normally not use the 'scan' element. The question is then how to give the retention time? We did not succeed in doing this in a valid way, see http://trac.thep.lu.se/trac/fp6-prodac/browser/trunk/mzML/FF_070504_MSMS_5B.mzML for a simple (but invalid way of doing it). More correct would be to put the cvParam under the acquisition with the retention time, but this is not allowed either. Why not allow softwareParam to be userParam or cvParam or must all software that work on mzML be in the CV? How about having precursor m/z, intensity and charge state as non-required attributes to ionSelection? These fields are really used in every file. Final comment is though that all these things are really minor, and that getting the standard released is what matters! Regards Fredrik |
From: Brian P. <bri...@in...> - 2007-10-12 01:02:54
|
// mzML_CV_to_W3C_Schema.cpp : Defines the entry point for the console = application. // #include <iostream> #include <string> #include <fstream> #include <algorithm> #include <vector> #include <map> #include <stdexcept> enum eReadState {INIT,TERM}; enum eAttachResult {NO,NEWLY_ATTACHED,ALREADY_ATTACHED}; inline void tabs(int n) { while (n--) { std::cout << " "; } } class Term { public: Term() { m_bObsolete=3Dfalse; m_bRoot=3Dfalse; m_bIsAttr=3Dfalse; } // stuff of interest in each TERM entry std::string m_id; std::string m_name; std::vector<std::string> m_partOf; std::string m_isA; bool m_bObsolete; bool m_bRoot; // is this the root term? bool m_bIsAttr; // special case for this declared as object_attribute void print() const { std::cout << m_id << " \"" << m_name << "\"" << std::endl; } }; typedef std::map<std::string,Term *> TermMap_t; class Node { public: Node() { m_term =3D NULL; } Node(Term *term) { m_term =3D term; } Term *m_term; // term that created this node std::vector<Node *> m_members; // things claiming to be part_of this std::vector<Node *> m_subtypes; // things claiming to be is_a this // dump the tree to stdout void print(int &tabdepth,const char *msg=3DNULL) const { tabs(tabdepth); if (msg) { std::cout << msg; } m_term->print(); tabdepth++; for (int n=3D(int)m_members.size();n--;) { m_members[n]->print(tabdepth,"has "); } if (m_subtypes.size()) { for (int n=3D(int)m_subtypes.size();n--;) { tabdepth++; m_subtypes[n]->print(tabdepth,"parent of "); tabdepth--; } } tabdepth--; } // try to add the term to the tree based on part_of claim eAttachResult attachPart(Term *term,const std::string &partof) { // try = to find term's place=20 eAttachResult result =3D NO; if (partof =3D=3D m_term->m_id) { // this term is the one claimed as = the parent int i; for (i=3D(int)m_members.size();i--;) { // did we get this already? if (term=3D=3Dm_members[i]->m_term) { break; // already got this one } } if (i<0) { // new here m_members.push_back(new Node(term)); =20 result =3D NEWLY_ATTACHED; // could exit here, but I'm paranoid } else { result =3D ALREADY_ATTACHED; // could exit here, but I'm paranoid } }=20 // recurse down the tree for (int n=3D(int)m_members.size();n--;) { eAttachResult previous_result =3D result; result =3D m_members[n]->attachPart(term,partof); if (result !=3D previous_result) { if ((NEWLY_ATTACHED =3D=3D result) && (ALREADY_ATTACHED =3D=3D = previous_result)) { std::cout << "warning - " << partof << "joined multiple nodes" << = std::endl; } if (previous_result && !result) { result =3D previous_result; } // could exit here, but I'm paranoid } } return result; } // try to add the term to the tree based on is_a claim eAttachResult attachSubtype(Term *term,const std::string &isA) { // try = to find term's place=20 eAttachResult result =3D NO; if (isA =3D=3D m_term->m_id) { // this term is the one claimed as the = base type int i; for (i=3D(int)m_subtypes.size();i--;) { // did we get this already? if (term=3D=3Dm_subtypes[i]->m_term) { break; // already got this one } } if (i<0) { // new here m_subtypes.push_back(new Node(term)); =20 result =3D NEWLY_ATTACHED; // could exit here, but I'm paranoid } else { result =3D ALREADY_ATTACHED; // could exit here, but I'm paranoid } }=20 // recurse down the subtypes tree for (int n=3D(int)m_subtypes.size();n--;) { eAttachResult previous_result =3D result; result =3D m_subtypes[n]->attachSubtype(term,isA); if (result !=3D previous_result) { if ((NEWLY_ATTACHED =3D=3D result) && (ALREADY_ATTACHED =3D=3D = previous_result)) { std::cout << "warning - " << isA << "joined multiple nodes" << = std::endl; } if (previous_result && !result) { result =3D previous_result; } // could exit here, but I'm paranoid } } // recurse down the members tree for (int n=3D(int)m_members.size();n--;) { eAttachResult previous_result =3D result; result =3D m_members[n]->attachSubtype(term,isA); if (result !=3D previous_result) { if ((NEWLY_ATTACHED =3D=3D result) && (ALREADY_ATTACHED =3D=3D = previous_result)) { std::cout << "warning - " << isA << "joined multiple nodes" << = std::endl; } if (previous_result && !result) { result =3D previous_result; } // could exit here, but I'm paranoid } } return result; } }; static int ctdepth=3D0; static void climbTree(const std::vector<Term *> &terms,const Term *t) { if (ctdepth || !t->m_bRoot) { t->print(); } ctdepth++; for (int i=3D(int)terms.size();i--;) { Term *ti=3Dterms[i]; if (ti->m_id =3D=3D t->m_isA) { tabs(ctdepth); std::cout << "is a "; climbTree(terms,ti); } for (int p=3D(int)t->m_partOf.size();p--;) { if (ti->m_id =3D=3D t->m_partOf[p]) { tabs(ctdepth); std::cout << "part of "; climbTree(terms,ti); } } } ctdepth--; } int main(int argc, char* argv[]) { std::ifstream cvFile; std::ofstream w3cFile; std::string buffer; eReadState state=3DINIT; std::vector<Term *> terms; Term *term; Node head; // here we go cvFile.open(argv[1], std::ios::in); if (!cvFile) { throw std::exception("error opening file for read().\n"); } while (std::getline(cvFile, buffer)) { if (buffer=3D=3D"[Term]") { state =3D TERM; term =3D new Term; } else if (TERM=3D=3Dstate) { if (buffer=3D=3D"") { // end of item, process now if (term->m_partOf.size()&&term->m_isA.length()) { // WTF? can't is_a and part_of should be mutually eclusive std::cout << "is_a and part_of relationship for " << term->m_id << = " "=20 << term->m_name << " using is_a, ignoring part_of" << std::endl; term->m_partOf.clear(); // drop the part_of relationships } if (term->m_partOf.size()||term->m_isA.length()) { // only interested in is_a and term->m_partOf stuff if (term->m_bObsolete) { std::cout << "warning - obsolete item " << term->m_id << "in = relationship" << std::endl; } terms.push_back(term); } else if (!term->m_bObsolete) { if (term->m_id =3D=3D "MS:0000000") { // root term? term->m_bRoot =3D true; // yes, this is root terms.push_back(term); head.m_term =3D term; // set the head node } else { std::cout << "no relationship for " << term->m_id << " " << = term->m_name << std::endl; delete term; } } // done now, reset state =3D INIT; } else if (!strncmp(buffer.c_str(),"id: ",4)) { term->m_id =3D buffer.substr(4); } else if (!strncmp(buffer.c_str(),"name: ",6)) { term->m_name =3D buffer.substr(6); } else if (!strncmp(buffer.c_str(),"is_a: ",6)) { term->m_isA =3D buffer.substr(6,buffer.find_first_of(' ',6)-6); } else if (!strncmp(buffer.c_str(),"relationship: part_of ",22)) { term->m_partOf.push_back(buffer.substr(22,buffer.find_first_of(' = ',22)-22)); } else if (buffer =3D=3D "is_obsolete: true") { term->m_bObsolete =3D true; } } // end if in [TERM]=20 } if (!cvFile.eof()) // if reason of termination !=3D eof { throw std::exception("error while parsing file.\n"); } // inelegant, brute force tree build int last_n_placed =3D 0; while (1) { // place part_of relationships first int n_placed =3D 0; int n_to_place =3D 0; for (int i =3D (int)terms.size();i--;) { Term* term =3D terms[i]; for (int p=3D(int)term->m_partOf.size();p--;) { // traverse the tree to see if we can join n_to_place++; if (head.attachPart(term,term->m_partOf[p])) { n_placed++; } } } if (n_placed=3D=3Dn_to_place) { break; // done } if (n_placed =3D=3D last_n_placed) { std::cout << "inconsistent tree, unplaced part_of nodes" << = std::endl; for (int i =3D (int)terms.size();i--;) { Term* term =3D terms[i]; for (int p=3D(int)term->m_partOf.size();p--;) { // traverse the tree to see if we can join if (!head.attachPart(term,term->m_partOf[p])) { climbTree(terms,term); } } } break; } last_n_placed =3D n_placed; // watch for stallout } last_n_placed =3D 0; while (1) { // place isA relationships now int n_placed =3D 0; int n_to_place =3D 0; for (int i =3D (int)terms.size();i--;) { Term* term =3D terms[i]; if (!term->m_partOf.size()) { // traverse the tree to see if we can join n_to_place++; if (head.attachSubtype(term,term->m_isA)) { n_placed++; } } } if (n_placed=3D=3Dn_to_place) { break; // done } if (n_placed =3D=3D last_n_placed) { std::cout << "inconsistent tree, unplaced is_a nodes" << std::endl; for (int i =3D (int)terms.size();i--;) { Term* term =3D terms[i]; if (!term->m_partOf.size()) { // traverse the tree to see if we can join if (!head.attachSubtype(term,term->m_isA)) { climbTree(terms,term); } } } break; } last_n_placed =3D n_placed; // watch for stallout } int tabdepth =3D 0; head.print(tabdepth); return 0; } |
From: Matthew C. <mat...@va...> - 2007-10-11 15:29:47
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Moving on to an appropriate subject so we can reform the Church of Controlled Vocabulary... :)<br> <br> <br> Eric Deutsch wrote: <blockquote cite="mid:5BE...@he..." type="cite"> <meta http-equiv="Content-Type" content="text/html; "> <meta name="Generator" content="Microsoft Word 11 (filtered medium)"> <!--[if !mso]> <style> v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style> <![endif]--> <style> <!-- /* Font Definitions */ @font-face {font-family:Wingdings; panose-1:5 0 0 0 0 0 0 0 0 0;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:Arial; color:windowtext;} span.EmailStyle18 {mso-style-type:personal; font-family:Arial; color:navy;} span.m1 {color:blue;} span.t1 {color:#990000;} span.EmailStyle21 {mso-style-type:personal-reply; font-family:Arial; color:navy;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} /* List Definitions */ @list l0 {mso-list-id:1639410469; mso-list-type:hybrid; mso-list-template-ids:1919608944 1959066812 67698691 67698693 67698689 67698691 67698693 67698689 67698691 67698693;} @list l0:level1 {mso-level-start-at:0; mso-level-number-format:bullet; mso-level-text:-; mso-level-tab-stop:.5in; mso-level-number-position:left; text-indent:-.25in; font-family:Arial; mso-fareast-font-family:"Times New Roman";} ol {margin-bottom:0in;} ul {margin-bottom:0in;} --> </style> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Hi everyone, I’ve taken some time to think carefully about what Brian says and here is my attempt at focusing the discussion:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- First: yes, there are several problems in the CV is_a and part_of. We agreed at the CV meeting that we will tackle this to try to make it uniform.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- Here are two rules within the CV worth that may hold true and should be documented:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> - if a term’s direct parent is a “xxxx attribute”, then it must furnish a value within the cvParam element, else it cannot<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> - if a term has children, then it cannot be specified as a cvParam (except as a category/parent in option C)<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> Is this correct? Counter examples?<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- Regarding the reflectron example, I think the CV should look like this, even though it does not quite now:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> - “reflectron on” is_a “reflectron state” is_a “analyzer attribute”<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> - “reflectron off” is_a “reflectron state” is_a “analyzer attribute”<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> </div> </blockquote> These points do not address the more significant issue that the CV is apparently incapable of defining types for categories with uncontrolled values and there is no automatic way to distinguish between a category and a controlled value (i.e. an accession number that represents a category vs. an accession number that represents a value). I suggest the convention (like Angel mentions in his reply to this post) where categories have a pure PART_OF relationship and controlled values have an IS_A relationship to their parent category. I still don't know how to encapsulate the type information for uncontrolled values in the CV though. Perhaps each type (real, integer, string, etc.) could be given a special accession number which indicates the type and also indicates to the validator/parser that the value should be taken from the name/text attribute instead of the accession attribute? But then I'm not sure how to assign that accession number to the uncontrolled classes, because each type would have an IS_A relationship to multiple categories.<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- Thus cvParams would be used like this:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> Option A: <cvParam cvLabel="MS" accession="MS:1000105" name="reflectron off" value="" /><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> Option C+: <cvParam name="reflectron off" cvLabel="MS" accession="MS:1000105" parentAccession=” MS:1000021”/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> </div> </blockquote> I will regurgitate my preferred version of Option C:<br> Option E: <cvParam name="reflectron state" valueName="off" accession="MS:1000021" valueAccession="MS:1000105"/><br> Same information, but IMO more intuitive, human readable, and it avoids the potentially nasty pitfall of defining what a "parent" is (i.e. is it one level up the CV branch, all the way up, part of the way up?).<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- Brian proposed:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> <reflectronState accession=”MS:1000021” off/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> This does not seem like well formed XML to me. Or is it?? I assume he meant this:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> <reflectronState accession=”MS:1000105” name=“reflectron off”/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- If so, the real dilemma is between:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> 1) <cvParam name="reflectron off" cvLabel="MS" accession="MS:1000105" parentAccession=” MS:1000021”/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> 2) <reflectronState accession=”MS:1000105” name=“reflectron off”/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> Brian, would you agree that these are the two sides? They both seem fully complete to me. If I’ve got it wrong, then the rest would seem premature, but I’ll press on believing I’ve got it right. Because by creating an element in the schema <reflectronState>, this automatically takes the place of { cvLabel="MS" parentAccession=” MS:1000021” }</span></font></p> </div> </blockquote> Yes, that is the real dilemma. I cast my vote in for going either ALL CV or ALL schema. I don't like the idea of mixing the two. I am a bit confused though and Brian will need to clarify: he previously suggested that the entire schema would be hand-rolled and the CV would be generated FROM the schema. Would that mean that accession numbers would be assigned in the schema and propagated into the CV? I don't recall Brian proposing the <reflectronState ...> method while still filling in the schema from a separately maintained CV - that would be too much hassle.<br> <br> No matter which route we take though, we should have a fully descriptive XML schema in order to allow standard XML tools to do the semantic validation. In the case of the CV, that schema will be auto-generated every time the CV changes. In the case of the hand-rolled schema, it'll be completely self-contained.<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- So for option 1, we’re essentially at that right now (we would need to adjust option A to option 1, but it’s close)<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- For option 2, we would need to find all the CV terms that we think deserve to be promoted to element status and add them to schema. I don’t know how many there are, but there would be lots. The schema would increase in size many fold.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- A further complication is where does this element go? Does it go in the instrument description section? Or could the reflectron be turned on and off for different spectra and thus go in the scan element? I have no idea. If we put it in the schema, we’ve got to get it right now. If we don’t, then the schema will have to be updated to fix it.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- The current state is a flexible (some might say lazy or dangerous) way. We acknowledge that we don’t have all the CV terms and we’re not exactly sure where some will be used, so we leave it open. No example instance document yet has reflectron state information in it. I’d be delighted if someone could provide one.</span></font></p> </div> </blockquote> No matter which way we go, CV w/ autogenerated schema or hand-rolled schema, or cvParams or explicit elements, changing an element's valid location from one part of the document to another will break backward compatibility with the semantic validation, as well as breaking all but the smartest parsers. We should definitely try to avoid moving terms around once we've released the spec!<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- So what we can do today is provide a term “reflectron off” that almost no one really cares much about and let someone out there who does care write some mzML with this annotation in it. When this document is checked against the semantic validator, the validator will complain that you’ve used a child term of “reflectron state” in a place where it’s not allowed. But the writer insists that it should be allowed there. The PSI-MS WG is pursuaded it should be. So we update the semantic validator and the CV perhaps and these new documents are written out with reflectron state information and validate. Most software doesn’t care a hoot about the reflectron state and that cvParam can be safely ignored or dumbly displayed to the user in case the user cares. All the above can happen without a rev of the schema.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- But that’s the same thing as updating the schema except in name, you say. Perhaps.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> </div> </blockquote> I also say it's the same as updating the schema, because the schema DOES have to be updated when the CV is updated in order to reflect the new changes. Right now we have a pretty useless schema because it is inadequate to do semantic validation or write a parser.<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- So, I hope I have helped this discussion rather than confused it. Clearly the current schema has a big element of flexibility/power/danger in it. Some would believe that this will allow us to improve the format in minor ways without schema revision and provide a way for producers to express their data with annotations that make sense to them. The only thing standing between flexibility and utter mayhem is the semantic validator. Perhaps in some sense, this is half XML schema and half pseudo RDF. Can we pull it off or are we lunatics for trying it?<o:p></o:p></span></font></p> </div> </blockquote> We need to re-evaluate the idea that the schema should be perpetually unchanging. To me, that is an illogical and contradictory requirement when we also have the requirement to do semantic validation with an ever-changing CV. Why should we be afraid of schema revisions? We should, more specifically, be afraid of removing existing terms, shifting them from one part of the spec to another, and adding new features (like new compression types for the peak lists, new precision types, etc.). And I hope everyone can see that these fears should exist for both a CV-based schema and a hand-rolled schema.<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- I am clearly biased here, but I try to keep an open mind.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- To my mind, the most important unconsidered problem that Brian brings up is the data type problem. Consider the example:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> <cvParam cvLabel="MS" accession="MS:1000285" name="total ion current" value="1.66755e+007" parentAccession=”MS:1000499”/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> Brian’s proposed alternative is (I hope I’m right):<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> <spectrumAttribute accession="MS:1000285" name="total ion current" value="1.66755e+007"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> In principle, this second way would allow me to specify a data type and let XML validators enforce it. However, this may not quite work either, because what if I want:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> <spectrumAttribute accession="MS:1009999" name="spectrum subjective quality" value="10"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">To be allowed? All spectrumAttributes would have to have the same data type for that to work. The example is pretty contrived. Unless every single attribute got its own element like:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> <totalIonCurrent value="1.66755e+007"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- The latter here is fully specified and concrete. But if we get anything wrong or want to add anything, then we have to release a new version of the schema. One possible option is to full specify in schema everything we can think of now, and then for new or later things use cvParam. If we do that, then we’re still needing to apply sematic validation so we’ve only half-solved the problem. Finally, a dangerous door may be opening. If we want to expand this duality, we have a possible “more than one way to do it” problem. Some might choose to use the cvParam, and some the schema element. The only thing that could prevent that is the semantic validator again.</span></font></p> </div> </blockquote> No duality should be possible. A category should either be done with an element or with a cvParam, and I prefer that all categories should be done with one or the other instead of a mix of the two. But certainly no single category should have both an element and a cvParam method for specifying its value.<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- I wonder whether we can add a nice method of datatype validation to option 1 above? Any ideas?<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> </div> </blockquote> First we have to get data type specification into the CV that is complete and comprehensible to machines (so we can auto-generate a schema from the CV). Let's figure that out first. :) And if we CAN'T do that, we are pretty much forced to go with a hand-rolled schema because at that point I see very little reason to use the OBO CV at all.<br> <br> <br> <blockquote cite="mid:5BE...@he..." type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">I had hoped to focus the discussion, but rereading it, all I did was shake the already-opened can of worms.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Let the commentary ensue.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Regards,<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Eric<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <div style="border-style: none none none solid; border-color: -moz-use-text-color -moz-use-text-color -moz-use-text-color blue; border-width: medium medium medium 1.5pt; padding: 0in 0in 0in 4pt;"> <div> <div class="MsoNormal" style="text-align: center;" align="center"><font face="Times New Roman" size="3"><span style="font-size: 12pt;"> <hr tabindex="-1" align="center" size="2" width="100%"></span></font></div> <p class="MsoNormal"><b><font face="Tahoma" size="2"><span style="font-size: 10pt; font-family: Tahoma; font-weight: bold;">From:</span></font></b><font face="Tahoma" size="2"><span style="font-size: 10pt; font-family: Tahoma;"> <a class="moz-txt-link-abbreviated" href="mailto:psi...@li...">psi...@li...</a> [<a class="moz-txt-link-freetext" href="mailto:psi...@li...">mailto:psi...@li...</a>] <b><span style="font-weight: bold;">On Behalf Of </span></b>Brian Pratt<br> <b><span style="font-weight: bold;">Sent:</span></b> Monday, October 08, 2007 11:38 AM<br> <b><span style="font-weight: bold;">To:</span></b> 'Mass spectrometry standard development'<br> <b><span style="font-weight: bold;">Subject:</span></b> [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_oferrors?)</span></font><o:p></o:p></p> </div> <p class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size: 12pt;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Eh, it’s even more broken than I thought. I’ve amended my amendments inline below, new changes in double parenthesis. <o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">After a day so of messing with this, it is now:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">MANIFESTO TIME!<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">RESOLVED:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">The mzML specification process should be schema-centric, and the CV should be generated from the schema (should be a fairly simple matter of XSLT, since XSD is itself XML). <o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">REASON 1: THE CV-CENTRIC APPROACH IS ERROR PRONE.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">The kinds of inheritance errors shown below are, if not actually impossible, much harder to make in the context of a W3C schema when using readily available software tools to create and maintain the schema.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">REASON 2: OBO/CV IS AN INSUFFICIENT TOOL FOR THE JOB OF PRODUCING A READILY AND THOROUGHLY VALIDATABLE DATA FORMAT.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">CV apparently provides no means for specifying range or formatting of instance values. An “isolation width” (</span></font><font face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New";">MS:1000023) </span></font><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">could happily have a value of “-2”, “2”, “two”, or “extra sprinkles, please”. You could (and should) certainly put some text in the description along the lines of “this is a non-negative floating point value” but that’s no help to a validating parser. XSD on the other hand has standardized syntax for enforcing precisely these kinds of restrictions, meaning that validating parsers and code generators (for both read and write) don’t need any special-purpose logic added. <o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">There are a handful of places where value range restrictions have been attempted in the MS CV, but these are awkward because of the tools. The reflectron_state, for example, has two children “on” and “off”, but this only confuses things, since these are not *<b><span style="font-weight: bold;">values</span></b>* of reflectron state but rather *<b><span style="font-weight: bold;">are</span></b>* reflectron states, a distinction which may be meaningless in English but significant when attempting to create a data structure. Picture how this looks in an instance doc:<o:p></o:p></span></font></p> <p class="MsoNormal" style="text-indent: 0.5in;"><span class="m1"><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;"><</span></font></span><span class="t1"><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;">cvParam</span></font></span><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;"> <span class="t1"><font color="black"><span style="color: black;">cvLabel</span></font></span><span class="m1"><font color="black"><span style="color: black;">="</span></font></span><b><span style="font-weight: bold;">MS</span></b><span class="m1"><font color="black"><span style="color: black;">"</span></font></span><span class="t1"><font color="black"><span style="color: black;"> accession</span></font></span><span class="m1"><font color="black"><span style="color: black;">="</span></font></span><b><span style="font-weight: bold;">MS:1000105</span></b><span class="m1"><font color="black"><span style="color: black;">"</span></font></span><span class="t1"><font color="black"><span style="color: black;"> name</span></font></span><span class="m1"><font color="black"><span style="color: black;">="</span></font></span><b><span style="font-weight: bold;">off</span></b><span class="m1"><font color="black"><span style="color: black;">"</span></font></span><span class="t1"><font color="black"><span style="color: black;"> value</span></font></span><span class="m1"><font color="black"><span style="color: black;">="" /></span></font></span></span></font><span class="m1"><font color="black" face="Courier New"><span style="font-family: "Courier New"; color: black;"><o:p></o:p></span></font></span></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">I can’t think of anything nice to say about that. Better it should read:</span></font><font color="navy" face="Arial"><span style="font-family: Arial; color: navy;"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> </span></font><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;"><reflectronState accession=”MS:1000021” off/><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">CONCLUSION: THE CV WORK TO DATE IS IMPORTANT AND USEFUL, BUT SHOULD BE RECAST AS SCHEMA WORK<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">The CV should not attempt to be a replacement for the schema - it just hasn’t got the requisite mechanisms to do the job. The information CV can convey is only a subset of the information that is needed to fully specify a data format. The information in the CV as it stands should be folded into the mzML schema, and maintained therein moving forward. An actual OBO/CV file can be generated as needed. <o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">- Brian<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> </div> </div> </blockquote> <br> </body> </html> |
From: Angel P. <an...@ma...> - 2007-10-11 14:41:53
|
On 10/11/07, Eric Deutsch <ede...@sy...> wrote: > > > - Here are two rules within the CV worth that may hold true and should be > documented: > > - if a term's direct parent is a "xxxx attribute", then it must furnish > a value within the cvParam element, else it cannot > > - if a term has children, then it cannot be specified as a cvParam > (except as a category/parent in option C) > > Is this correct? Counter examples? > I don't think this is correct. You cannot put the second limitation on cvParam, there are just way too many cases where this rule will break down. At least this has been my experience working with other data formats that use CVs. Maybe this is not the case for the mzML schema and CV as it currently stands, but I doubt it. Also encoding and usage rules that are specified outside of the actual CV and/or spec are not a good idea. This is also experience garnered from other standards efforts (specifically the MGED ontology usage with MAGE). You'll have to trust me on this, b/c as written these rules seem simple enough, but when you get right down to using them with the schema and an OBO CV, you are going to find a lot of implementation problems. The trouble with OBO is that there are no built-in mechansims for distinguishing terms that are classes and terms that are enumerated values, or for specifying which terms can only have enumerated values, etc, etc. You can only do this via conventions that must be strictly followed both when encoding and using the CV. An example of a convention would be creating a term "EnumerationTerm" and make all leaf terms IS_A this. -angel |
From: <Jam...@wa...> - 2007-10-11 08:39:39
|
I will be out of the office starting 06/10/2007 and will not return until 01/11/2007. I will be checking my e-mail infrequently whilst I am away. =========================================================== The information in this email is confidential, and is intended solely for the addressee(s). Access to this email by anyone else is unauthorized and therefore prohibited. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. =========================================================== |
From: Eric D. <ede...@sy...> - 2007-10-11 08:36:29
|
Hi everyone, I've taken some time to think carefully about what Brian says and here is my attempt at focusing the discussion: =20 - First: yes, there are several problems in the CV is_a and part_of. We agreed at the CV meeting that we will tackle this to try to make it uniform. =20 - Here are two rules within the CV worth that may hold true and should be documented: - if a term's direct parent is a "xxxx attribute", then it must furnish a value within the cvParam element, else it cannot - if a term has children, then it cannot be specified as a cvParam (except as a category/parent in option C) Is this correct? Counter examples? =20 - Regarding the reflectron example, I think the CV should look like this, even though it does not quite now: - "reflectron on" is_a "reflectron state" is_a "analyzer attribute" - "reflectron off" is_a "reflectron state" is_a "analyzer attribute" =20 - Thus cvParams would be used like this: Option A: <cvParam cvLabel=3D"MS" accession=3D"MS:1000105" name=3D"reflectron off" value=3D"" /> Option C+: <cvParam name=3D"reflectron off" cvLabel=3D"MS" accession=3D"MS:1000105" parentAccession=3D" MS:1000021"/> =20 - Brian proposed: <reflectronState accession=3D"MS:1000021" off/> This does not seem like well formed XML to me. Or is it?? I assume he meant this: <reflectronState accession=3D"MS:1000105" name=3D"reflectron off"/> =20 - If so, the real dilemma is between: 1) <cvParam name=3D"reflectron off" cvLabel=3D"MS" = accession=3D"MS:1000105" parentAccession=3D" MS:1000021"/> 2) <reflectronState accession=3D"MS:1000105" name=3D"reflectron off"/> Brian, would you agree that these are the two sides? They both seem fully complete to me. If I've got it wrong, then the rest would seem premature, but I'll press on believing I've got it right. Because by creating an element in the schema <reflectronState>, this automatically takes the place of { cvLabel=3D"MS" parentAccession=3D" MS:1000021" } =20 - So for option 1, we're essentially at that right now (we would need to adjust option A to option 1, but it's close) =20 - For option 2, we would need to find all the CV terms that we think deserve to be promoted to element status and add them to schema. I don't know how many there are, but there would be lots. The schema would increase in size many fold. =20 - A further complication is where does this element go? Does it go in the instrument description section? Or could the reflectron be turned on and off for different spectra and thus go in the scan element? I have no idea. If we put it in the schema, we've got to get it right now. If we don't, then the schema will have to be updated to fix it. =20 - The current state is a flexible (some might say lazy or dangerous) way. We acknowledge that we don't have all the CV terms and we're not exactly sure where some will be used, so we leave it open. No example instance document yet has reflectron state information in it. I'd be delighted if someone could provide one. =20 - So what we can do today is provide a term "reflectron off" that almost no one really cares much about and let someone out there who does care write some mzML with this annotation in it. When this document is checked against the semantic validator, the validator will complain that you've used a child term of "reflectron state" in a place where it's not allowed. But the writer insists that it should be allowed there. The PSI-MS WG is pursuaded it should be. So we update the semantic validator and the CV perhaps and these new documents are written out with reflectron state information and validate. Most software doesn't care a hoot about the reflectron state and that cvParam can be safely ignored or dumbly displayed to the user in case the user cares. All the above can happen without a rev of the schema. =20 - But that's the same thing as updating the schema except in name, you say. Perhaps. =20 - So, I hope I have helped this discussion rather than confused it. Clearly the current schema has a big element of flexibility/power/danger in it. Some would believe that this will allow us to improve the format in minor ways without schema revision and provide a way for producers to express their data with annotations that make sense to them. The only thing standing between flexibility and utter mayhem is the semantic validator. Perhaps in some sense, this is half XML schema and half pseudo RDF. Can we pull it off or are we lunatics for trying it? =20 - I am clearly biased here, but I try to keep an open mind. =20 - To my mind, the most important unconsidered problem that Brian brings up is the data type problem. Consider the example: <cvParam cvLabel=3D"MS" accession=3D"MS:1000285" name=3D"total ion = current" value=3D"1.66755e+007" parentAccession=3D"MS:1000499"/> Brian's proposed alternative is (I hope I'm right): <spectrumAttribute accession=3D"MS:1000285" name=3D"total ion current" value=3D"1.66755e+007"> In principle, this second way would allow me to specify a data type and let XML validators enforce it. However, this may not quite work either, because what if I want: <spectrumAttribute accession=3D"MS:1009999" name=3D"spectrum = subjective quality" value=3D"10"> To be allowed? All spectrumAttributes would have to have the same data type for that to work. The example is pretty contrived. Unless every single attribute got its own element like: <totalIonCurrent value=3D"1.66755e+007"> =20 - The latter here is fully specified and concrete. But if we get anything wrong or want to add anything, then we have to release a new version of the schema. One possible option is to full specify in schema everything we can think of now, and then for new or later things use cvParam. If we do that, then we're still needing to apply sematic validation so we've only half-solved the problem. Finally, a dangerous door may be opening. If we want to expand this duality, we have a possible "more than one way to do it" problem. Some might choose to use the cvParam, and some the schema element. The only thing that could prevent that is the semantic validator again. =20 - I wonder whether we can add a nice method of datatype validation to option 1 above? Any ideas? =20 I had hoped to focus the discussion, but rereading it, all I did was shake the already-opened can of worms. =20 Let the commentary ensue. =20 Regards, Eric =20 =20 =20 =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Monday, October 08, 2007 11:38 AM To: 'Mass spectrometry standard development' Subject: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_oferrors?) =20 Eh, it's even more broken than I thought. I've amended my amendments inline below, new changes in double parenthesis. =20 =20 After a day so of messing with this, it is now: =20 MANIFESTO TIME! =20 RESOLVED: The mzML specification process should be schema-centric, and the CV should be generated from the schema (should be a fairly simple matter of XSLT, since XSD is itself XML). =20 =20 REASON 1: THE CV-CENTRIC APPROACH IS ERROR PRONE. The kinds of inheritance errors shown below are, if not actually impossible, much harder to make in the context of a W3C schema when using readily available software tools to create and maintain the schema. =20 REASON 2: OBO/CV IS AN INSUFFICIENT TOOL FOR THE JOB OF PRODUCING A READILY AND THOROUGHLY VALIDATABLE DATA FORMAT. CV apparently provides no means for specifying range or formatting of instance values. An "isolation width" (MS:1000023) could happily have a value of "-2", "2", "two", or "extra sprinkles, please". You could (and should) certainly put some text in the description along the lines of "this is a non-negative floating point value" but that's no help to a validating parser. XSD on the other hand has standardized syntax for enforcing precisely these kinds of restrictions, meaning that validating parsers and code generators (for both read and write) don't need any special-purpose logic added. =20 =20 There are a handful of places where value range restrictions have been attempted in the MS CV, but these are awkward because of the tools. The reflectron_state, for example, has two children "on" and "off", but this only confuses things, since these are not *values* of reflectron state but rather *are* reflectron states, a distinction which may be meaningless in English but significant when attempting to create a data structure. Picture how this looks in an instance doc: <cvParam cvLabel=3D"MS" accession=3D"MS:1000105" name=3D"off" value=3D"" = /> I can't think of anything nice to say about that. Better it should read: <reflectronState accession=3D"MS:1000021" off/> =20 =20 CONCLUSION: THE CV WORK TO DATE IS IMPORTANT AND USEFUL, BUT SHOULD BE RECAST AS SCHEMA WORK The CV should not attempt to be a replacement for the schema - it just hasn't got the requisite mechanisms to do the job. The information CV can convey is only a subset of the information that is needed to fully specify a data format. The information in the CV as it stands should be folded into the mzML schema, and maintained therein moving forward. An actual OBO/CV file can be generated as needed.=20 =20 - Brian =20 =20 ________________________________ From: Brian Pratt [mailto:bri...@in...]=20 Sent: Friday, October 05, 2007 11:52 PM To: 'Mass spectrometry standard development' Subject: more is_a vs. part_of errors? =20 There are a handful of other cases where it appears that the authors have gotten "is a" and "part_of" confused. My proposed corrections (IN CAPS) inline: =20 MS:1000025 "magnetic field strength" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000024 "final MS exponent" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description"=20 part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000022 "TOF Total Path Length" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000014 "accuracy" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 ((note, these next two are just ugly, see notes at top of message)) =20 MS:1000106 "on" is a MS:1000021 "reflectron state" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000105 "off" is a MS:1000021 "reflectron state" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 =20 =20 The following changes would make the Thermo and ABI stuff look like all the other vendors: =20 MS:1000495 "Applied Biosystems" part of (IS_A) MS:1000121 "ABI / SCIEX" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000176 "MAT95XP Trap" is a (IS_A) MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000175 "MAT95XP" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000174 "MAT900XP Trap" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000173 "MAT900XP" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 MS:1000172 "MAT253" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" =20 =20 I still think there's a schema in there, albeit jammed in slightly sideways at the moment. (( I don't think that anymore. I think there's a subset of a schema in there. )) =20 - Brian |
From: Eric D. <ede...@sy...> - 2007-10-09 20:12:06
|
Hi everyone, =20 In other news regarding mzML, Pete, Luisa and I met this morning to discuss the CV. The issues brought up here thus far as well as other important issues were discussed and another rev of the CV will be produced and posted soon for everyone's continued inspection. Thank you for the feedback. =20 I have not yet digested the manifesto, but this will be considered carefully and we'll have to have a continued discussion of it. I have hopes of digesting this in the next few days and focusing the discussion on it. =20 Thanks for your continued feedback, Eric =20 =20 ---------------------------------- Eric Deutsch, Ph.D. Institute for Systems Biology 1441 North 34th Street Seattle WA 98103 Tel: 206-732-1397 Fax: 206-732-1260 Email: ede...@sy... WWW: http://www.systemsbiology.org/Senior_Research_Scientists/Eric_Deutsch =20 |
From: Eric D. <ede...@sy...> - 2007-10-09 20:01:56
|
Splendid, we appear to be reaching a conclusion, I tally: - Brian votes to keep - Angel votes to keep - Marc votes to keep - David votes to keep - Eric votes to keep - Matt is neutral - ChrisA is neutral - Mike does not want them - everyone else abstains The ayes have it. The schema stays as is wrt count attributes. Thank you! Eric > -----Original Message----- > From: psi...@li... [mailto:psidev-ms-dev- > bo...@li...] On Behalf Of Brian Pratt > Sent: Tuesday, October 09, 2007 12:34 PM > To: 'Mass spectrometry standard development' > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks >=20 > As to performance implications of heap fragmentation, have a look at > http://www.microquill.com/ - they sell a nice heap replacement library > that > can have an impressive impact on program performance without any code > changes just by managing the heap more intelligently (I've used it, its > for > real). But if you can't have a clever heap manager then you have to be > clever in how you manage the heap. >=20 > >> I would do roughly what C++ std::vector's (or Python lists, etc.) do >=20 > I expect you are referring to the way std::vector initially allocates room > for, say, up to 10 items, then when that turns out to be not enough they > reallocate for 20, then 40, 80, 160, ..., 655360, 1310720,... - but > consider > also std::vector's reserve() method, which is a great illustration of the > usefulness of the count. It allows you to declare the *expected* final > size > of the collection without demanding it be the *actual* final size. It > preallocates enough memory to accommodate the addition of up to n elements > to the vector before any reallocation takes place, and heap fragmentation > is > thus avoided along with a great many copy constructor executions (which > engender even more heapfrag, probably). If an n+1'th element is added, > reallocation takes place and performance isn't what it could be, but the > program still runs without error. So it's a risk-free and very simple way > to use the count info. >=20 > If your collection class of choice doesn't have some means of exploiting a > hint about the expected size of the collection, well, no harm done. > Anyone > who is not using robust collection classes and is thus susceptible to > running off the end of an array allocated based on the declared count is > working harder than they need to. >=20 > But Angel is right, it's fun to trade tips and tricks but we should just > vote... I vote keep 'em. >=20 > - Brian >=20 > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Tuesday, October 09, 2007 12:07 PM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks >=20 > I knew I was going to regret that (over-)simplification. Okay, so in > reality I would never actually read the file twice--that's just easier > to describe than something more realistic. Just off the top of my > head, I would do roughly what C++ std::vector's (or Python lists, > etc.) do in terms of memory allocation. This lets you read in a > single pass, and uses memory in proportion to what is actually needed. > (There are ways to deal with fragmentation as well, but that's *way* > outside the bounds of what the mzML spec should care about.) >=20 > Also worth noting, in my not-so-humble opinion: (a) for general > computation, 32-bit hardware is dead, and (b) if you don't have enough > RAM to comfortably hold single mzML files, you probably should just > buy more. >=20 > Mike >=20 >=20 > On 10/9/07, Chris Allen <ch...@ma...> wrote: > > > > Mike Coleman wrote: > > > I can see why having a 'count' might make it easier for novice > > > programmers to *write* a processing program, but I cannot see why > > > having a 'count' would make more than a negligible difference in > > > performance, if even that. As a worst case, one could read the mzML > > > file into memory, scan it once to calculate the count, and then > > > proceed as before. The additional time required to do a sweep through > > > RAM would be trivial. > > > > Isn't one of the features of mzML to store raw scan data? If so I > > imagine it wouldn't be long before users were generating multi-GB files > > (even possibly with just peak lists) that: > > > > (i) Won't map into the 32bit address space limits of the OS; > > > > (ii) Or if you're either using 64bit or else mapping chunks, you'll hit > > i/o and paging issues as the file will have to be read twice (once for > > the scan and again for the parser) unless you have a huge amount of RAM > > of course. > > > > Not to mention that the source of the data might not support stream > > positioning anyway (eg. compressed stream) or which was simply passed as > > an open stream handle to your program/library and you can't reopen it so > > you only have one shot. > > > > Regards, > > Chris > > > > > > ------------------------------------------------------------------------ > - > > This SF.net email is sponsored by: Splunk Inc. > > Still grepping through log files to find problems? Stop. > > Now Search log events and configuration files using AJAX and a browser. > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > _______________________________________________ > > Psidev-ms-dev mailing list > > Psi...@li... > > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev >=20 >=20 > ------------------------------------------------------------------------ - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Brian P. <bri...@in...> - 2007-10-09 19:35:12
|
As to performance implications of heap fragmentation, have a look at http://www.microquill.com/ - they sell a nice heap replacement library that can have an impressive impact on program performance without any code changes just by managing the heap more intelligently (I've used it, its for real). But if you can't have a clever heap manager then you have to be clever in how you manage the heap. >> I would do roughly what C++ std::vector's (or Python lists, etc.) do I expect you are referring to the way std::vector initially allocates room for, say, up to 10 items, then when that turns out to be not enough they reallocate for 20, then 40, 80, 160, ..., 655360, 1310720,... - but consider also std::vector's reserve() method, which is a great illustration of the usefulness of the count. It allows you to declare the *expected* final size of the collection without demanding it be the *actual* final size. It preallocates enough memory to accommodate the addition of up to n elements to the vector before any reallocation takes place, and heap fragmentation is thus avoided along with a great many copy constructor executions (which engender even more heapfrag, probably). If an n+1'th element is added, reallocation takes place and performance isn't what it could be, but the program still runs without error. So it's a risk-free and very simple way to use the count info. If your collection class of choice doesn't have some means of exploiting a hint about the expected size of the collection, well, no harm done. Anyone who is not using robust collection classes and is thus susceptible to running off the end of an array allocated based on the declared count is working harder than they need to. But Angel is right, it's fun to trade tips and tricks but we should just vote... I vote keep 'em. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Tuesday, October 09, 2007 12:07 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks I knew I was going to regret that (over-)simplification. Okay, so in reality I would never actually read the file twice--that's just easier to describe than something more realistic. Just off the top of my head, I would do roughly what C++ std::vector's (or Python lists, etc.) do in terms of memory allocation. This lets you read in a single pass, and uses memory in proportion to what is actually needed. (There are ways to deal with fragmentation as well, but that's *way* outside the bounds of what the mzML spec should care about.) Also worth noting, in my not-so-humble opinion: (a) for general computation, 32-bit hardware is dead, and (b) if you don't have enough RAM to comfortably hold single mzML files, you probably should just buy more. Mike On 10/9/07, Chris Allen <ch...@ma...> wrote: > > Mike Coleman wrote: > > I can see why having a 'count' might make it easier for novice > > programmers to *write* a processing program, but I cannot see why > > having a 'count' would make more than a negligible difference in > > performance, if even that. As a worst case, one could read the mzML > > file into memory, scan it once to calculate the count, and then > > proceed as before. The additional time required to do a sweep through > > RAM would be trivial. > > Isn't one of the features of mzML to store raw scan data? If so I > imagine it wouldn't be long before users were generating multi-GB files > (even possibly with just peak lists) that: > > (i) Won't map into the 32bit address space limits of the OS; > > (ii) Or if you're either using 64bit or else mapping chunks, you'll hit > i/o and paging issues as the file will have to be read twice (once for > the scan and again for the parser) unless you have a huge amount of RAM > of course. > > Not to mention that the source of the data might not support stream > positioning anyway (eg. compressed stream) or which was simply passed as > an open stream handle to your program/library and you can't reopen it so > you only have one shot. > > Regards, > Chris > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Mike C. <tu...@gm...> - 2007-10-09 19:06:30
|
I knew I was going to regret that (over-)simplification. Okay, so in reality I would never actually read the file twice--that's just easier to describe than something more realistic. Just off the top of my head, I would do roughly what C++ std::vector's (or Python lists, etc.) do in terms of memory allocation. This lets you read in a single pass, and uses memory in proportion to what is actually needed. (There are ways to deal with fragmentation as well, but that's *way* outside the bounds of what the mzML spec should care about.) Also worth noting, in my not-so-humble opinion: (a) for general computation, 32-bit hardware is dead, and (b) if you don't have enough RAM to comfortably hold single mzML files, you probably should just buy more. Mike On 10/9/07, Chris Allen <ch...@ma...> wrote: > > Mike Coleman wrote: > > I can see why having a 'count' might make it easier for novice > > programmers to *write* a processing program, but I cannot see why > > having a 'count' would make more than a negligible difference in > > performance, if even that. As a worst case, one could read the mzML > > file into memory, scan it once to calculate the count, and then > > proceed as before. The additional time required to do a sweep through > > RAM would be trivial. > > Isn't one of the features of mzML to store raw scan data? If so I > imagine it wouldn't be long before users were generating multi-GB files > (even possibly with just peak lists) that: > > (i) Won't map into the 32bit address space limits of the OS; > > (ii) Or if you're either using 64bit or else mapping chunks, you'll hit > i/o and paging issues as the file will have to be read twice (once for > the scan and again for the parser) unless you have a huge amount of RAM > of course. > > Not to mention that the source of the data might not support stream > positioning anyway (eg. compressed stream) or which was simply passed as > an open stream handle to your program/library and you can't reopen it so > you only have one shot. > > Regards, > Chris > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > |
From: Angel P. <an...@ma...> - 2007-10-09 19:00:58
|
Hi all, I was never arguing against counts for the spectra, only *maybe* against annotations, and it seems that more people than not want them in, so I say keep 'em. In the interest of not diverting effort from more important issues, can we just take a vote and leave it at that? my vote: keep counts -angel On 10/9/07, Matthew Chambers <mat...@va...> wrote: > > We are a bit off topic but this is interesting. :) To really assess the > performance issues here you have to dig deeper than just heap > fragmentation though. Assuming a list to store the SpectrumHeaders and > vectors to store ms and intensities, and without preallocation based on > counts, because of the tree-like nature of mzML, you'd end up with a > memory footprint like: > Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header > Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader > SpectrumNMz1...P SpectrumNInten1...P > > If you preallocated the SpectrumHeaders in the list based on the count > attribute, you'd instead get a footprint like: > Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P > Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P > > So you're going to have a tradeoff of fragmentation either way. The > fragmentation in the first case would be worse for quick sequential > access to each SpectrumHeader, but better for accessing the peaks of a > particular spectrum. The fragmentation in the second case would be > better for quick sequential access to each SpectrumHeader, but worse for > accessing the peaks of a particular spectrum. Access to the peaks could > be further improved by storing the Mz and Inten values together (i.e. in > a struct { float mz, inten; } ). This is all incredibly superfluous > though and I doubt this fragmentation has an appreciable performance > impact on data with any kind of density to it. So if you needed > extremely responsive performance on very sparse spectra, you might think > about this stuff, but most of us are far more limited by the sheer > number of peaks. And if extreme responsiveness is your goal, no > conceivable XML format is going to help you! > > -Matt > > Brian Pratt wrote: > > Heap fragmentation has a performance cost that persists past the initial > > allocation(s), since it affects further allocations as well. If it can > be > > avoided with a relatively simple mechanism like this, that's a good > thing. > > > > I started coding in 1977, FWIW. Long enough to learn to prefer the > simple > > solution over the one that requires a gestalt... > > > > To be fair, having done this stuff for a long time isn't really a > predictor > > of me being any good at it, but I get by OK. > > > > - Brian > > > > > > > > -----Original Message----- > > From: psi...@li... > > [mailto:psi...@li...] On Behalf Of Mike > > Coleman > > Sent: Tuesday, October 09, 2007 9:21 AM > > To: Mass spectrometry standard development > > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks > > > > I can see why having a 'count' might make it easier for novice > > programmers to *write* a processing program, but I cannot see why > > having a 'count' would make more than a negligible difference in > > performance, if even that. As a worst case, one could read the mzML > > file into memory, scan it once to calculate the count, and then > > proceed as before. The additional time required to do a sweep through > > RAM would be trivial. > > > > Mike > > > > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Mike C. <tu...@gm...> - 2007-10-09 18:53:23
|
On 10/9/07, Brian Pratt <bri...@in...> wrote: > Heap fragmentation has a performance cost that persists past the initial > allocation(s), since it affects further allocations as well. If it can be > avoided with a relatively simple mechanism like this, that's a good thing. > > I started coding in 1977, FWIW. Long enough to learn to prefer the simple > solution over the one that requires a gestalt... I agree that there would be an affect, but my guess is that it would be minimal in real-world situations. I also appreciate the value of simplicity. The question here, in my mind, is figuring out what kinds of simplicity are best and figuring out how to trade them off. If you accept my premise that the 'count' value in the input cannot be fully trusted, then working out the cases and producing the value in the output seems more complex than just counting them as they come in. (This is a pretty minor consideration in the greater scheme of things, though.) > To be fair, having done this stuff for a long time isn't really a predictor > of me being any good at it, but I get by OK. If you had asked me at any point in my career when I had achieved basic competence as a programmer, I would have replied "about four or five years ago". So, in retrospect, my total years of "less-than-competence" are increasing as time goes by... :-) Mike |
From: Matthew C. <mat...@va...> - 2007-10-09 18:45:40
|
We are a bit off topic but this is interesting. :) To really assess the performance issues here you have to dig deeper than just heap fragmentation though. Assuming a list to store the SpectrumHeaders and vectors to store ms and intensities, and without preallocation based on counts, because of the tree-like nature of mzML, you'd end up with a memory footprint like: Spectrum1Header Spectrum1Mz1...P Spectrum1Inten1...P Spectrum2Header Spectrum2Mz1...P Spectrum2Inten1...P ... SpectrumNHeader SpectrumNMz1...P SpectrumNInten1...P If you preallocated the SpectrumHeaders in the list based on the count attribute, you'd instead get a footprint like: Spectrum2Header Spectrum2Header ... SpectrumNHeader Spectrum1Mz1...P Spectrum1Inten...P ... SpectrumNMz1...P SpectrumNInten1...P So you're going to have a tradeoff of fragmentation either way. The fragmentation in the first case would be worse for quick sequential access to each SpectrumHeader, but better for accessing the peaks of a particular spectrum. The fragmentation in the second case would be better for quick sequential access to each SpectrumHeader, but worse for accessing the peaks of a particular spectrum. Access to the peaks could be further improved by storing the Mz and Inten values together (i.e. in a struct { float mz, inten; } ). This is all incredibly superfluous though and I doubt this fragmentation has an appreciable performance impact on data with any kind of density to it. So if you needed extremely responsive performance on very sparse spectra, you might think about this stuff, but most of us are far more limited by the sheer number of peaks. And if extreme responsiveness is your goal, no conceivable XML format is going to help you! -Matt Brian Pratt wrote: > Heap fragmentation has a performance cost that persists past the initial > allocation(s), since it affects further allocations as well. If it can be > avoided with a relatively simple mechanism like this, that's a good thing. > > I started coding in 1977, FWIW. Long enough to learn to prefer the simple > solution over the one that requires a gestalt... > > To be fair, having done this stuff for a long time isn't really a predictor > of me being any good at it, but I get by OK. > > - Brian > > > > -----Original Message----- > From: psi...@li... > [mailto:psi...@li...] On Behalf Of Mike > Coleman > Sent: Tuesday, October 09, 2007 9:21 AM > To: Mass spectrometry standard development > Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks > > I can see why having a 'count' might make it easier for novice > programmers to *write* a processing program, but I cannot see why > having a 'count' would make more than a negligible difference in > performance, if even that. As a worst case, one could read the mzML > file into memory, scan it once to calculate the count, and then > proceed as before. The additional time required to do a sweep through > RAM would be trivial. > > Mike > > > |
From: Chris A. <ch...@ma...> - 2007-10-09 17:25:36
|
Mike Coleman wrote: > I can see why having a 'count' might make it easier for novice > programmers to *write* a processing program, but I cannot see why > having a 'count' would make more than a negligible difference in > performance, if even that. As a worst case, one could read the mzML > file into memory, scan it once to calculate the count, and then > proceed as before. The additional time required to do a sweep through > RAM would be trivial. Isn't one of the features of mzML to store raw scan data? If so I imagine it wouldn't be long before users were generating multi-GB files (even possibly with just peak lists) that: (i) Won't map into the 32bit address space limits of the OS; (ii) Or if you're either using 64bit or else mapping chunks, you'll hit i/o and paging issues as the file will have to be read twice (once for the scan and again for the parser) unless you have a huge amount of RAM of course. Not to mention that the source of the data might not support stream positioning anyway (eg. compressed stream) or which was simply passed as an open stream handle to your program/library and you can't reopen it so you only have one shot. Regards, Chris |
From: Brian P. <bri...@in...> - 2007-10-09 16:41:39
|
Heap fragmentation has a performance cost that persists past the initial allocation(s), since it affects further allocations as well. If it can be avoided with a relatively simple mechanism like this, that's a good thing. I started coding in 1977, FWIW. Long enough to learn to prefer the simple solution over the one that requires a gestalt... To be fair, having done this stuff for a long time isn't really a predictor of me being any good at it, but I get by OK. - Brian -----Original Message----- From: psi...@li... [mailto:psi...@li...] On Behalf Of Mike Coleman Sent: Tuesday, October 09, 2007 9:21 AM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] mzML 0.99 remarks I can see why having a 'count' might make it easier for novice programmers to *write* a processing program, but I cannot see why having a 'count' would make more than a negligible difference in performance, if even that. As a worst case, one could read the mzML file into memory, scan it once to calculate the count, and then proceed as before. The additional time required to do a sweep through RAM would be trivial. Mike On 10/9/07, Marc Sturm <st...@in...> wrote: > I would like the count attributes to stay, at least for the spectrum > list and peak list. > Knowing the number of elements can make a huge performance difference in > some languages e.g. C++. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Mike C. <tu...@gm...> - 2007-10-09 16:21:07
|
I can see why having a 'count' might make it easier for novice programmers to *write* a processing program, but I cannot see why having a 'count' would make more than a negligible difference in performance, if even that. As a worst case, one could read the mzML file into memory, scan it once to calculate the count, and then proceed as before. The additional time required to do a sweep through RAM would be trivial. Mike On 10/9/07, Marc Sturm <st...@in...> wrote: > I would like the count attributes to stay, at least for the spectrum > list and peak list. > Knowing the number of elements can make a huge performance difference in > some languages e.g. C++. |
From: David C. <dc...@ma...> - 2007-10-09 13:05:38
|
I'd like them to stay for the same reason. If the count is correct, then performance is improved slightly and there will be less memory fragmentation. If the count isn't correct, we won't report an error and it will just be less efficient. David Marc Sturm wrote: > I would like the count attributes to stay, at least for the spectrum > list and peak list. > Knowing the number of elements can make a huge performance difference in > some languages e.g. C++. > > - Marc > > Angel Pizarro wrote: >> Regarding this count attribute issue, I tally: >> >> - Angel discourages them >> ... >> - silence from everyone else >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev |
From: Marc S. <st...@in...> - 2007-10-09 08:54:31
|
I would like the count attributes to stay, at least for the spectrum list and peak list. Knowing the number of elements can make a huge performance difference in some languages e.g. C++. - Marc Angel Pizarro wrote: > > Regarding this count attribute issue, I tally: > > - Angel discourages them > ... > - silence from everyone else > |
From: Angel P. <an...@ma...> - 2007-10-09 01:12:01
|
All good points by Eric, RDF can indeed be simplified to just the triplet tuple he mentions. The part he glosses over though is RDF schema, an this is where the main structure of an ontology is defined and constraints are defined on what (and maybe when / where, but don't quote me) you can put terms and values in instance RDF documents. Notice I used "ontology", not standard or data format. While we can bastardize an mzML written as RDFS an= d name a version as a standard, this is not really done in the RDF world. RDFS are released on fairly short timelines as new terms or changes are approved by a committee, so RDFS/RDF is more likely to be of use to the CV development team, certainly not the schema development efforts. An overriding concern of mine, though, is time. The mzML schema specification should in no way be held up by any issues dealing with RDFS and RDF, even the CV. Everyone is to invested in getting the product out th= e door. Standards don't need to be perfect, they just have to work (credit to Norman) So I am sorry to have brought it up, but hopefully Eric has shut the door o= n that option for mzML v1 -angel On 10/8/07, Eric Deutsch <ede...@sy...> wrote: > > Regarding, RDF I would like to suggest that this is not an option at thi= s > time. RDF was in fact suggested at the DC meeting, and it was concluded t= hat > it is such a departure from current formats, that we cannot support it at > this time. We do not have the resources to pull it off. > > > > Having said that, I would summarize RDF as the antithesis of everything > you want out of mzML. RDF can be oversimplified (by me) as essentially a > listing of facts of: > > Subject verb predicate > > wherein each noun and verb is carefully defined in an ontology (not just = a > controlled vocabulary) such that true meaning can be inferred from > unstructured data. So, in pseuoRDF, our documents would go like this: > > > > Eric has_produced this_mzML_document > > Eric is_a contact > > Eric has_full_name Eric Deutsch > > Eric has_email_address ede...@fu... > > This_mzML_document is_a mzML_document > > This_mzML_document contains_a_run run1 > > Spectrum1 was_generated_in_run run1 > > Spectrum1 has_type precursor_ion_scan > > > > The structure is that there is no structure. You are free to list every > fact that is relevant in any order. However, each noun and verb must be > defined in the context of an ontology (or probably multiple ontologies). > > > > The beauty is that no one ever needs to argue about xsd schemas or two > different formats for the same thing any more. Wheee! > > > > The um, downside, is that your software to deal (effectively) with it > needs to be 10x more brilliant than the best piece of code you've written= so > far. > > > > Cheers, > > Eric > > > > > > > ------------------------------ > > *From:* psi...@li... [mailto: > psi...@li...] *On Behalf Of *Brian Pratt > *Sent:* Monday, October 08, 2007 4:14 PM > *To:* psi...@li... > *Subject:* Re: [Psidev-ms-dev] more is_a vs. part_of errors? > > > > Hi Angel, > > > > This may be a bit esoteric, but I wanted to ask what advantage RDF might > have over the older W3C XML schema (.xsd). I'm unfamiliar with RDF, and > from my 20 minutes of googling it appears rather more complex than .xsd = =96 > certainly more complex than it would need to be to handle the kinds of > things mzData and mzXML do today, but I'm sure I'm flaunting my ignorance= . > > > > I see that there are (but don't completely understand the nature of) > relationships between RDF, OWL, OBO, and CV. Presumably you see some mea= ns > of exploiting these relationships? I have a lot to learn if we go this > route, but it sounds interesting. At least we'd get to say "semantic web= " a > lot, which sounds cool. > > > > >> I believe that there is an OBO to RDF perl tools someplace. > > Maybe this (java, I think): > > http://www.cs.utexas.edu/~hamid/research/obo2owl.cgi<http://www.cs.utexas= .edu/%7Ehamid/research/obo2owl.cgi> > > > > > > Thanks, > > > > Brian > > > ------------------------------ > > *From:* psi...@li... [mailto: > psi...@li...] *On Behalf Of *Angel Pizarro > *Sent:* Saturday, October 06, 2007 5:17 PM > *To:* Mass spectrometry standard development > *Subject:* Re: [Psidev-ms-dev] more is_a vs. part_of errors? > > > > I wouldn't spend too much time trying to parse OBO files into XML schema. > The format grew out of a need for quick and dirty CV with some ontology > structure editing and there is really only one library editor that works > with it, namely the author's tools of the OBO format itself. > > As a side note, and completely my own opinion, but if mzML were to use RD= F > schema for the schema and RDF for the CV, validation and everything else > would fall into place. I believe that there is an OBO to RDF perl tools > someplace. > > - angel > > On 10/6/07, *Matt Chambers* <mat...@va...> wrote: > > Good catches in the CV. Who is in charge of maintaining it and are they > reading this list? :) I agree with auto-generating a XML schema with > full semantic relationships encoded in it, direct from the CV, but you > haven't addressed the issue I mentioned earlier. To do the > auto-generation into CV params (if we choose method A) will be very ugly > but it will allow for synonyms on the category names and value names. To > implement the cvParam categories as XML elements though, you lose the > ability to have synonyms for category names (unless you use the > accession number of the category as the element name, which makes me > shudder), but the final schema would look a lot nicer. > > -Matt > > Brian Pratt wrote: > > > > There are a handful of other cases where it appears that the authors > > have gotten "is a" and "part_of" confused. My proposed corrections (IN > > CAPS) inline: > > > > MS:1000025 "magnetic field strength" > > > > part of MS:1000480 "analyzer attribute" > > > > is a (PART_OF) MS:1000451 "analyzer description" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000024 "final MS exponent" > > > > part of MS:1000480 "analyzer attribute" > > > > is a (PART_OF) MS:1000451 "analyzer description" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000022 "TOF Total Path Length" > > > > part of MS:1000480 "analyzer attribute" > > > > is a (PART_OF) MS:1000451 "analyzer description" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000014 "accuracy" > > > > part of MS:1000480 "analyzer attribute" > > > > is a (PART_OF) MS:1000451 "analyzer description" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000106 "on" > > > > is a MS:1000021 "reflectron state" > > > > part of MS:1000480 "analyzer attribute" > > > > is a (PART_OF) MS:1000451 "analyzer description" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000105 "off" > > > > is a MS:1000021 "reflectron state" > > > > part of MS:1000480 "analyzer attribute" > > > > is a (PART_OF) MS:1000451 "analyzer description" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > The following changes would make the Thermo and ABI stuff look like > > all the other vendors: > > > > MS:1000495 "Applied Biosystems" > > > > part of (IS_A) MS:1000121 "ABI / SCIEX" > > > > is a MS:1000031 "model by vendor" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000176 "MAT95XP Trap" > > > > is a (IS_A) MS:1000493 "Finnigan MAT" > > > > part of MS:1000483 "Thermo Fisher Scientific" > > > > is a MS:1000031 "model by vendor" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000175 "MAT95XP" > > > > is a MS:1000493 "Finnigan MAT" > > > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > > > is a MS:1000031 "model by vendor" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000174 "MAT900XP Trap" > > > > is a MS:1000493 "Finnigan MAT" > > > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > > > is a MS:1000031 "model by vendor" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000173 "MAT900XP" > > > > is a MS:1000493 "Finnigan MAT" > > > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > > > is a MS:1000031 "model by vendor" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > MS:1000172 "MAT253" > > > > is a MS:1000493 "Finnigan MAT" > > > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > > > is a MS:1000031 "model by vendor" > > > > part of MS:1000463 "instrument description" > > > > part of MS:0000000 "MZ controlled vocabularies" > > > > I still think there's a schema in there, albeit jammed in slightly > > sideways at the moment. > > > > - Brian > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > > > > -- > Angel Pizarro > Director, Bioinformatics Facility > Institute for Translational Medicine and Therapeutics > University of Pennsylvania > 806 BRB II/III > 421 Curie Blvd. > Philadelphia, PA 19104-6160 > > P: 215-573-3736 > F: 215-573-9004 > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > Psidev-ms-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev > > --=20 Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Eric D. <ede...@sy...> - 2007-10-08 23:37:06
|
Regarding, RDF I would like to suggest that this is not an option at this time. RDF was in fact suggested at the DC meeting, and it was concluded that it is such a departure from current formats, that we cannot support it at this time. We do not have the resources to pull it off. =20 Having said that, I would summarize RDF as the antithesis of everything you want out of mzML. RDF can be oversimplified (by me) as essentially a listing of facts of: Subject verb predicate wherein each noun and verb is carefully defined in an ontology (not just a controlled vocabulary) such that true meaning can be inferred from unstructured data. So, in pseuoRDF, our documents would go like this: =20 Eric has_produced this_mzML_document Eric is_a contact Eric has_full_name Eric Deutsch Eric has_email_address ede...@fu... This_mzML_document is_a mzML_document This_mzML_document contains_a_run run1 Spectrum1 was_generated_in_run run1 Spectrum1 has_type precursor_ion_scan =20 The structure is that there is no structure. You are free to list every fact that is relevant in any order. However, each noun and verb must be defined in the context of an ontology (or probably multiple ontologies). =20 The beauty is that no one ever needs to argue about xsd schemas or two different formats for the same thing any more. Wheee! =20 The um, downside, is that your software to deal (effectively) with it needs to be 10x more brilliant than the best piece of code you've written so far. =20 Cheers, Eric =20 =20 =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Brian Pratt Sent: Monday, October 08, 2007 4:14 PM To: psi...@li... Subject: Re: [Psidev-ms-dev] more is_a vs. part_of errors? =20 Hi Angel, =20 This may be a bit esoteric, but I wanted to ask what advantage RDF might have over the older W3C XML schema (.xsd). I'm unfamiliar with RDF, and from my 20 minutes of googling it appears rather more complex than .xsd - certainly more complex than it would need to be to handle the kinds of things mzData and mzXML do today, but I'm sure I'm flaunting my ignorance. =20 =20 I see that there are (but don't completely understand the nature of) relationships between RDF, OWL, OBO, and CV. Presumably you see some means of exploiting these relationships? I have a lot to learn if we go this route, but it sounds interesting. At least we'd get to say "semantic web" a lot, which sounds cool. =20 >> I believe that there is an OBO to RDF perl tools someplace. Maybe this (java, I think): http://www.cs.utexas.edu/~hamid/research/obo2owl.cgi =20 =20 Thanks, =20 Brian =20 ________________________________ From: psi...@li... [mailto:psi...@li...] On Behalf Of Angel Pizarro Sent: Saturday, October 06, 2007 5:17 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] more is_a vs. part_of errors? =20 I wouldn't spend too much time trying to parse OBO files into XML schema. The format grew out of a need for quick and dirty CV with some ontology structure editing and there is really only one library editor that works with it, namely the author's tools of the OBO format itself.=20 As a side note, and completely my own opinion, but if mzML were to use RDF schema for the schema and RDF for the CV, validation and everything else would fall into place. I believe that there is an OBO to RDF perl tools someplace.=20 - angel On 10/6/07, Matt Chambers <mat...@va...> wrote: Good catches in the CV. Who is in charge of maintaining it and are they reading this list? :) I agree with auto-generating a XML schema with full semantic relationships encoded in it, direct from the CV, but you haven't addressed the issue I mentioned earlier. To do the auto-generation into CV params (if we choose method A) will be very ugly but it will allow for synonyms on the category names and value names. To implement the cvParam categories as XML elements though, you lose the=20 ability to have synonyms for category names (unless you use the accession number of the category as the element name, which makes me shudder), but the final schema would look a lot nicer. -Matt Brian Pratt wrote:=20 > > There are a handful of other cases where it appears that the authors > have gotten "is a" and "part_of" confused. My proposed corrections (IN > CAPS) inline: > > MS:1000025 "magnetic field strength"=20 > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000024 "final MS exponent" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description"=20 > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000022 "TOF Total Path Length" > > part of MS:1000480 "analyzer attribute"=20 > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000014 "accuracy" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description"=20 > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000106 "on" > > is a MS:1000021 "reflectron state" > > part of MS:1000480 "analyzer attribute"=20 > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000105 "off" > > is a MS:1000021 "reflectron state" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description"=20 > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > The following changes would make the Thermo and ABI stuff look like=20 > all the other vendors: > > MS:1000495 "Applied Biosystems" > > part of (IS_A) MS:1000121 "ABI / SCIEX" > > is a MS:1000031 "model by vendor" >=20 > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000176 "MAT95XP Trap" > > is a (IS_A) MS:1000493 "Finnigan MAT"=20 > > part of MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies"=20 > > MS:1000175 "MAT95XP" > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor"=20 > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000174 "MAT900XP Trap" > > is a MS:1000493 "Finnigan MAT"=20 > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000173 "MAT900XP" > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific"=20 > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000172 "MAT253"=20 > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description"=20 > > part of MS:0000000 "MZ controlled vocabularies" > > I still think there's a schema in there, albeit jammed in slightly > sideways at the moment. > > - Brian >=20 ------------------------------------------------------------------------ - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser.=20 Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev --=20 Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736=20 F: 215-573-9004=20 |
From: Brian P. <bri...@in...> - 2007-10-08 23:15:47
|
Hi Angel, This may be a bit esoteric, but I wanted to ask what advantage RDF might have over the older W3C XML schema (.xsd). I'm unfamiliar with RDF, and from my 20 minutes of googling it appears rather more complex than .xsd - certainly more complex than it would need to be to handle the kinds of things mzData and mzXML do today, but I'm sure I'm flaunting my ignorance. I see that there are (but don't completely understand the nature of) relationships between RDF, OWL, OBO, and CV. Presumably you see some means of exploiting these relationships? I have a lot to learn if we go this route, but it sounds interesting. At least we'd get to say "semantic web" a lot, which sounds cool. >> I believe that there is an OBO to RDF perl tools someplace. Maybe this (java, I think): http://www.cs.utexas.edu/~hamid/research/obo2owl.cgi Thanks, Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Angel Pizarro Sent: Saturday, October 06, 2007 5:17 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] more is_a vs. part_of errors? I wouldn't spend too much time trying to parse OBO files into XML schema. The format grew out of a need for quick and dirty CV with some ontology structure editing and there is really only one library editor that works with it, namely the author's tools of the OBO format itself. As a side note, and completely my own opinion, but if mzML were to use RDF schema for the schema and RDF for the CV, validation and everything else would fall into place. I believe that there is an OBO to RDF perl tools someplace. - angel On 10/6/07, Matt Chambers <mat...@va...> wrote: Good catches in the CV. Who is in charge of maintaining it and are they reading this list? :) I agree with auto-generating a XML schema with full semantic relationships encoded in it, direct from the CV, but you haven't addressed the issue I mentioned earlier. To do the auto-generation into CV params (if we choose method A) will be very ugly but it will allow for synonyms on the category names and value names. To implement the cvParam categories as XML elements though, you lose the ability to have synonyms for category names (unless you use the accession number of the category as the element name, which makes me shudder), but the final schema would look a lot nicer. -Matt Brian Pratt wrote: > > There are a handful of other cases where it appears that the authors > have gotten "is a" and "part_of" confused. My proposed corrections (IN > CAPS) inline: > > MS:1000025 "magnetic field strength" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000024 "final MS exponent" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000022 "TOF Total Path Length" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000014 "accuracy" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000106 "on" > > is a MS:1000021 "reflectron state" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000105 "off" > > is a MS:1000021 "reflectron state" > > part of MS:1000480 "analyzer attribute" > > is a (PART_OF) MS:1000451 "analyzer description" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > The following changes would make the Thermo and ABI stuff look like > all the other vendors: > > MS:1000495 "Applied Biosystems" > > part of (IS_A) MS:1000121 "ABI / SCIEX" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000176 "MAT95XP Trap" > > is a (IS_A) MS:1000493 "Finnigan MAT" > > part of MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000175 "MAT95XP" > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000174 "MAT900XP Trap" > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000173 "MAT900XP" > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > MS:1000172 "MAT253" > > is a MS:1000493 "Finnigan MAT" > > part of (IS_A) MS:1000483 "Thermo Fisher Scientific" > > is a MS:1000031 "model by vendor" > > part of MS:1000463 "instrument description" > > part of MS:0000000 "MZ controlled vocabularies" > > I still think there's a schema in there, albeit jammed in slightly > sideways at the moment. > > - Brian > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Psidev-ms-dev mailing list Psi...@li... https://lists.sourceforge.net/lists/listinfo/psidev-ms-dev -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |
From: Brian P. <bri...@in...> - 2007-10-08 21:43:11
|
Hi Matt, >> CV is organized by accession numbers, which are unique, whereas the schema is organized by element names, which are usually unique but not always. Right, but I propose to use XSD value restriction syntax to associate each element with a unique accession number. The schema can declare things like "a 'foo' element will always have an attribute named 'accession' which has exactly one legal value 'MS:12345', and a 'bar' element will always have an attribute named 'accession' which has exactly one legal value 'MS:54321'". Throw in the inheritance mechanisms of W3C schema and even if you wound up with two elements with the same name (in different branches of the inheritance tree, of course) they'd still be instantly uniquely identifiable by the value given in the accession attribute, and a validating parser could automagically intercept bogus accession numbers. Let's imagine an element "foo" with a subelement "crunchyCoating", and another element "bar" that also, as it happens, has a subelement named "crunchyCoating". Because we have assigned each element a unique accession number and used XSD restriction to enforce it, we can take an element instance completely out of context: <crunchyCoating accession="MS:1000321" 12.5/> and still understand it even though it looks unlike this other one taken out of context: <crunchyCoating accession="MS:1000777" "My cat's breath smells like cat food."/> Moving over to our CV (or schema) we can learn that MS:1000321 is defined as "Snell hardness of candy bar outer layer" and MS:10007777 is defined as "Ralph Wiggum quote". In practice, I'd probably want to declare the accession attribute optional since for most applications it's just a waste of bytes (can be derived from context+schema), but for the deeply paranoid it can be there explicitly. >> I'd rather extend the OBO format with the features we need unless such an extension would be prohibitively difficult to implement. I'm sorry, that just seems insane when the whole W3C ecosystem already exists to deal with these sorts of mundane data typing and validation issues. There must be better uses of our time. >> As for changes in the parser code, are you referring to the semantic validator or to an applied user of the format? I was thinking of the applied use of the format, but it's true in either case. From what I can tell, most anticipated changes to the ontology are really just additions to attribute value restriction lists (again with the adding a new mass spec model example), which really ought not to force changes to reader code for the most part. Returning to our favored example, it's just a different string to put in the "mass spec type" record in your database or what have you (OK, maybe the new model represents a whole new technology and your app actually needs a big rewrite, but you take my meaning, I hope). - Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Monday, October 08, 2007 2:00 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_of errors?) Brian Pratt wrote: Hi Matt, Sorry, I had meant to explicitly point out how the XSD orientation addresses your synonym concerns, although in the end I think I misunderstood them. Each element has associated with it precisely one correct accession attribute value, and you can use that to determine whether or the element is actually the thing you suspect it is since all true synonyms point back to the same accession number. The element names remain stable, as one would hope. I don't understand, though, why you're interested in making it easy to *introduce* synonyms - I was assuming that the purpose of this standardization effort was to *do away* with synonyms so as to reduce ambiguity. I agree that doing away with synonyms is pretty much the whole purpose of a controlled vocabulary, both for values and for categories, but others have apparently supported it (I think Lennart was the last one to remind me of supporting synonyms which was the reason to give controlled values accession numbers). If we had a CV format capable of representing the controlled values, then that would be a simpler way to maintain the synonyms than with the schema. This is because the CV is organized by accession numbers, which are unique, whereas the schema is organized by element names, which are usually unique but not always. I assume semantic validation is a goal, or we wouldn't have the business with reflectron state going on. In any case, a spec that doesn't lead to semantic validation is a poor sort of spec. Agreed. IS_A and PART_OF already have well defined meanings (see http://obofoundry.org/ro/), so we really can't redefine them for our own purposes. The mechanism for enumerating a value range just isn't there, so the authors have tried to hack it with the inheritance techniques available, which leads to all the gyrations over how to add a new instrument type. This is just a sign that we're trying to drive a screw with a hammer, or whatever metaphor you prefer for a "not even wrong" scenario. It seems that OBO is currently incapable of providing a way for us to control the values for our categories and that it's not really intended for that. So we either must extend it with support for that relationship (as well as specifying types and ranges for uncontrolled value categories), or forget it entirely and stick with the schema. Personally I prefer the accession numbers for the categories and for the values, so I'd rather extend the OBO format with the features we need unless such an extension would be prohibitively difficult to implement. I don't understand the assertion that pushing the maintenance load into the CV brings greater flexibility (nor the use of the term "flat" in describing the CV, which is just an obfuscated inheritance tree). Maintaining the CV directly has now been demonstrated as providing plenty of flexibility to screw up the inheritance hierarchy of the terms, but that's not a good thing, and doesn't seem inherently more flexible than doing the maintenance in an XSD. In either case, the vast majority of changes one is likely to make are along the lines of adding a new instrument type, which would not engender a change to the parser code. No, wait, it WOULD engender a change to the parser code in a CV-centric world, because the only way to express restriction lists is through inheritance instead of simple value restrictions. So, it's actually less flexible to maintain the CV. I was only referring to the support for synonyms. If synonyms are rejected, then as far as I can tell, maintaining the CV with an auto-generated schema would be worse than maintaining a hand-rolled schema by itself. As for changes in the parser code, are you referring to the semantic validator or to an applied user of the format? In the former case, the schema would either be edited by hand or be auto-generated (with updated schema restrictions) after a CV update, neither of which would require an update to the validator. In the latter case, we come back to the A, B, and C options for cvParams (not to mention D, E, and F ;) ). If the cvParams stop at the category level, then parsers needn't be updated to understand new values. If cvParams can refer to a value by itself, then the parser is a pain in the ass to write and it would need to be updated whenever the CV/schema was updated. -Matt |
From: Matthew C. <mat...@va...> - 2007-10-08 21:00:25
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> Brian Pratt wrote: <blockquote cite="mid:01e201c809e9$221ec150$0e03000a@BRIANTECRA" type="cite"> <meta http-equiv="Content-Type" content="text/html; "> <meta name="Generator" content="Microsoft Word 11 (filtered medium)"> <!--[if !mso]> <style> v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style> <![endif]--> <style> <!-- /* Font Definitions */ @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman"; color:black;} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:Arial; color:windowtext;} span.EmailStyle18 {mso-style-type:personal; font-family:Arial; color:navy;} span.m1 {color:blue;} span.t1 {color:#990000;} span.EmailStyle21 {mso-style-type:personal-reply; font-family:Arial; color:navy;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} --> </style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Hi Matt,<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Sorry, I had meant to explicitly point out how the XSD orientation addresses your synonym concerns, although in the end I think I misunderstood them. Each element has associated with it precisely one correct accession attribute value, and you can use that to determine whether or the element is actually the thing you suspect it is since all true synonyms point back to the same accession number. The element names remain stable, as one would hope. I don’t understand, though, why you’re interested in making it easy to *<b><span style="font-weight: bold;">introduce</span></b>* synonyms - I was assuming that the purpose of this standardization effort was to *<b><span style="font-weight: bold;">do away</span></b>* with synonyms so as to reduce ambiguity.</span></font></p> </div> </blockquote> I agree that doing away with synonyms is pretty much the whole purpose of a controlled vocabulary, both for values and for categories, but others have apparently supported it (I think Lennart was the last one to remind me of supporting synonyms which was the reason to give controlled values accession numbers). If we had a CV format capable of representing the controlled values, then that would be a simpler way to maintain the synonyms than with the schema. This is because the CV is organized by accession numbers, which are unique, whereas the schema is organized by element names, which are usually unique but not always.<br> <br> <blockquote cite="mid:01e201c809e9$221ec150$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">I assume semantic validation is a goal, or we wouldn’t have the business with reflectron state going on. In any case, a spec that doesn’t lead to semantic validation is a poor sort of spec.<o:p></o:p></span></font></p> </div> </blockquote> Agreed.<br> <br> <blockquote cite="mid:01e201c809e9$221ec150$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">IS_A and PART_OF already have well defined meanings (see <a moz-do-not-send="true" href="http://obofoundry.org/ro/">http://obofoundry.org/ro/</a>), so we really can’t redefine them for our own purposes. The mechanism for enumerating a value range just isn’t there, so the authors have tried to hack it with the inheritance techniques available, which leads to all the gyrations over how to add a new instrument type. This is just a sign that we’re trying to drive a screw with a hammer, or whatever metaphor you prefer for a “not even wrong” scenario.</span></font></p> </div> </blockquote> It seems that OBO is currently incapable of providing a way for us to control the values for our categories and that it's not really intended for that. So we either must extend it with support for that relationship (as well as specifying types and ranges for uncontrolled value categories), or forget it entirely and stick with the schema. Personally I prefer the accession numbers for the categories and for the values, so I'd rather extend the OBO format with the features we need unless such an extension would be prohibitively difficult to implement.<br> <br> <blockquote cite="mid:01e201c809e9$221ec150$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">I don’t understand the assertion that pushing the maintenance load into the CV brings greater flexibility (nor the use of the term “flat” in describing the CV, which is just an obfuscated inheritance tree). Maintaining the CV directly has now been demonstrated as providing plenty of flexibility to screw up the inheritance hierarchy of the terms, but that’s not a good thing, and doesn’t seem inherently more flexible than doing the maintenance in an XSD. In either case, the vast majority of changes one is likely to make are along the lines of adding a new instrument type, which would not engender a change to the parser code. No, wait, it WOULD engender a change to the parser code in a CV-centric world, because the only way to express restriction lists is through inheritance instead of simple value restrictions. So, it’s actually less flexible to maintain the CV.</span></font></p> </div> </blockquote> I was only referring to the support for synonyms. If synonyms are rejected, then as far as I can tell, maintaining the CV with an auto-generated schema would be worse than maintaining a hand-rolled schema by itself. As for changes in the parser code, are you referring to the semantic validator or to an applied user of the format? In the former case, the schema would either be edited by hand or be auto-generated (with updated schema restrictions) after a CV update, neither of which would require an update to the validator. In the latter case, we come back to the A, B, and C options for cvParams (not to mention D, E, and F ;) ). If the cvParams stop at the category level, then parsers needn't be updated to understand new values. If cvParams can refer to a value by itself, then the parser is a pain in the ass to write and it would need to be updated whenever the CV/schema was updated.<br> <br> -Matt<br> </body> </html> |
From: Brian P. <bri...@in...> - 2007-10-08 20:25:20
|
Hi Matt, Sorry, I had meant to explicitly point out how the XSD orientation addresses your synonym concerns, although in the end I think I misunderstood them. Each element has associated with it precisely one correct accession attribute value, and you can use that to determine whether or the element is actually the thing you suspect it is since all true synonyms point back to the same accession number. The element names remain stable, as one would hope. I don't understand, though, why you're interested in making it easy to *introduce* synonyms - I was assuming that the purpose of this standardization effort was to *do away* with synonyms so as to reduce ambiguity. I assume semantic validation is a goal, or we wouldn't have the business with reflectron state going on. In any case, a spec that doesn't lead to semantic validation is a poor sort of spec. IS_A and PART_OF already have well defined meanings (see http://obofoundry.org/ro/), so we really can't redefine them for our own purposes. The mechanism for enumerating a value range just isn't there, so the authors have tried to hack it with the inheritance techniques available, which leads to all the gyrations over how to add a new instrument type. This is just a sign that we're trying to drive a screw with a hammer, or whatever metaphor you prefer for a "not even wrong" scenario. I don't understand the assertion that pushing the maintenance load into the CV brings greater flexibility (nor the use of the term "flat" in describing the CV, which is just an obfuscated inheritance tree). Maintaining the CV directly has now been demonstrated as providing plenty of flexibility to screw up the inheritance hierarchy of the terms, but that's not a good thing, and doesn't seem inherently more flexible than doing the maintenance in an XSD. In either case, the vast majority of changes one is likely to make are along the lines of adding a new instrument type, which would not engender a change to the parser code. No, wait, it WOULD engender a change to the parser code in a CV-centric world, because the only way to express restriction lists is through inheritance instead of simple value restrictions. So, it's actually less flexible to maintain the CV. - Brian _____ From: psi...@li... [mailto:psi...@li...] On Behalf Of Matthew Chambers Sent: Monday, October 08, 2007 12:19 PM To: Mass spectrometry standard development Subject: Re: [Psidev-ms-dev] MANIFESTO TIME! (was RE: more is_a vs. part_of errors?) I'll respond inline to both posts, I didn't really think straight about the first post when I read it over the weekend. Brian Pratt wrote: Eh, it's even more broken than I thought. I've amended my amendments inline below, new changes in double parenthesis. After a day so of messing with this, it is now: MANIFESTO TIME! RESOLVED: The mzML specification process should be schema-centric, and the CV should be generated from the schema (should be a fairly simple matter of XSLT, since XSD is itself XML). I agree that given some arguments made on this list recently that have not been properly addressed (on the list, anyway), the ever-changing CV with a meaningless (i.e. without semantic relationships) but stable XML schema is not practically different from an ever-changing XML schema. If semantic validation is not a requirement, the CV route starts to look better, but I think we all agree that semantic validation should be possible. REASON 1: THE CV-CENTRIC APPROACH IS ERROR PRONE. The kinds of inheritance errors shown below are, if not actually impossible, much harder to make in the context of a W3C schema when using readily available software tools to create and maintain the schema. REASON 2: OBO/CV IS AN INSUFFICIENT TOOL FOR THE JOB OF PRODUCING A READILY AND THOROUGHLY VALIDATABLE DATA FORMAT. CV apparently provides no means for specifying range or formatting of instance values. An "isolation width" (MS:1000023) could happily have a value of "-2", "2", "two", or "extra sprinkles, please". You could (and should) certainly put some text in the description along the lines of "this is a non-negative floating point value" but that's no help to a validating parser. XSD on the other hand has standardized syntax for enforcing precisely these kinds of restrictions, meaning that validating parsers and code generators (for both read and write) don't need any special-purpose logic added. There are a handful of places where value range restrictions have been attempted in the MS CV, but these are awkward because of the tools. The reflectron_state, for example, has two children "on" and "off", but this only confuses things, since these are not *values* of reflectron state but rather *are* reflectron states, a distinction which may be meaningless in English but significant when attempting to create a data structure. Picture how this looks in an instance doc: <cvParam cvLabel="MS" accession="MS:1000105" name="off" value="" /> I can't think of anything nice to say about that. Better it should read: <reflectronState accession="MS:1000021" off/> I disagree somewhat with these reasons, because you have not addressed the issue of synonyms in the category names from the CV. With your "Better it should read" version of "reflectron state", it would be very impractical to introduce a synonymous category called "reflectron status" or synonymous on/off values "enabled/disabled". With the cvParam version, introducing those synonyms would be simple (in the CV, not simple in the schema which would get auto-generated from the updated CV). In order to auto-generate an XML schema from the CV, you are definitely right about the CV needing type restrictions which can map directly to XML schema types. Some CV categories have controlled values with their own accession numbers (like on and off for reflectron state), while other CV categories have uncontrolled values without accession numbers (like precursor charge, m/z, CID energy, etc.). Those relationships and restrictions must be propagated correctly, and I think that's where the distinction between IS_A and PART_OF comes in. A "<subject> IS_A <object>" relationship implies that the subject is a controlled value of the object, which should be a category. I do not think there should ever be a "<subject> IS_A <object1> IS_A <object2>" relationship. When a CV branch ends with an IS_A relationship, the auto-generator can infer that the subject is one of the possible values for the object. Categories can be within more comprehensive categories though, and as I understand it, that is the purpose of the PART_OF relationship. The other case the auto-generator needs to handle is categories which have uncontrolled values (e.g. precursor m/z) which will not end with an IS_A relationship, and that in itself should indicate to the auto-generator that the category's values are not controlled. If I am confused about the meaning of IS_A and PART_OF in the CV, let's please have a discussion about that because there really must be some way to infer whether a CV term is a value or a category. CONCLUSION: THE CV WORK TO DATE IS IMPORTANT AND USEFUL, BUT SHOULD BE RECAST AS SCHEMA WORK The CV should not attempt to be a replacement for the schema - it just hasn't got the requisite mechanisms to do the job. The information CV can convey is only a subset of the information that is needed to fully specify a data format. The information in the CV as it stands should be folded into the mzML schema, and maintained therein moving forward. An actual OBO/CV file can be generated as needed. I conclude that it's better to stick with the flat CV for the increased flexibility it provides, and for the purposes of validation, write some kind of script to take a templated version of the current schema and fill it in with all the CV relationships (which define the XSD restrictions) to produce a fully semantic schema. _____ From: Brian Pratt [mailto:bri...@in...] Sent: Friday, October 05, 2007 11:52 PM To: 'Mass spectrometry standard development' Subject: more is_a vs. part_of errors? There are a handful of other cases where it appears that the authors have gotten "is a" and "part_of" confused. My proposed corrections (IN CAPS) inline: MS:1000025 "magnetic field strength" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000024 "final MS exponent" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000022 "TOF Total Path Length" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000014 "accuracy" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" ((note, these next two are just ugly, see notes at top of message)) MS:1000106 "on" is a MS:1000021 "reflectron state" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000105 "off" is a MS:1000021 "reflectron state" part of ((IS_A)) MS:1000480 "analyzer attribute" is a (PART_OF) MS:1000451 "analyzer description" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" I disagree with all of these, because AFAIK these are subcategories which have uncontrolled values, so they shouldn't end in an IS_A relationship. The following changes would make the Thermo and ABI stuff look like all the other vendors: MS:1000495 "Applied Biosystems" part of (IS_A) MS:1000121 "ABI / SCIEX" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000176 "MAT95XP Trap" is a (IS_A) MS:1000493 "Finnigan MAT" part of MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000175 "MAT95XP" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000174 "MAT900XP Trap" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000173 "MAT900XP" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" MS:1000172 "MAT253" is a MS:1000493 "Finnigan MAT" part of (IS_A) MS:1000483 "Thermo Fisher Scientific" is a MS:1000031 "model by vendor" part of MS:1000463 "instrument description" part of MS:0000000 "MZ controlled vocabularies" I still think there's a schema in there, albeit jammed in slightly sideways at the moment. (( I don't think that anymore. I think there's a subset of a schema in there. )) I think these are screwed up. It doesn't make sense to me to have an IS_A relationship mixed in-between PART_OF relationships. -Matt |
From: Matthew C. <mat...@va...> - 2007-10-08 19:19:21
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> I'll respond inline to both posts, I didn't really think straight about the first post when I read it over the weekend.<br> <br> Brian Pratt wrote: <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <meta http-equiv="Content-Type" content="text/html; "> <meta name="Generator" content="Microsoft Word 11 (filtered medium)"> <!--[if !mso]> <style> v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style> <![endif]--> <style> <!-- /* Font Definitions */ @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal; font-family:Arial; color:windowtext;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:Arial; color:navy;} span.m1 {color:blue;} span.t1 {color:#990000;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {page:Section1;} --> </style> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">Eh, it’s even more broken than I thought. I’ve amended my amendments inline below, new changes in double parenthesis. <o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">After a day so of messing with this, it is now:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">MANIFESTO TIME!<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">RESOLVED:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">The mzML specification process should be schema-centric, and the CV should be generated from the schema (should be a fairly simple matter of XSLT, since XSD is itself XML). <o:p></o:p></span></font></p> </div> </blockquote> I agree that given some arguments made on this list recently that have not been properly addressed (on the list, anyway), the ever-changing CV with a meaningless (i.e. without semantic relationships) but stable XML schema is not practically different from an ever-changing XML schema. If semantic validation is not a requirement, the CV route starts to look better, but I think we all agree that semantic validation should be possible.<br> <br> <br> <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">REASON 1: THE CV-CENTRIC APPROACH IS ERROR PRONE.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">The kinds of inheritance errors shown below are, if not actually impossible, much harder to make in the context of a W3C schema when using readily available software tools to create and maintain the schema.<o:p></o:p></span></font></p> </div> </blockquote> <font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p></o:p></span></font> <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">REASON 2: OBO/CV IS AN INSUFFICIENT TOOL FOR THE JOB OF PRODUCING A READILY AND THOROUGHLY VALIDATABLE DATA FORMAT.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">CV apparently provides no means for specifying range or formatting of instance values. An “isolation width” (</span></font><font face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New";">MS:1000023) </span></font><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">could happily have a value of “-2”, “2”, “two”, or “extra sprinkles, please”. You could (and should) certainly put some text in the description along the lines of “this is a non-negative floating point value” but that’s no help to a validating parser. XSD on the other hand has standardized syntax for enforcing precisely these kinds of restrictions, meaning that validating parsers and code generators (for both read and write) don’t need any special-purpose logic added. <o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">There are a handful of places where value range restrictions have been attempted in the MS CV, but these are awkward because of the tools. The reflectron_state, for example, has two children “on” and “off”, but this only confuses things, since these are not *<b><span style="font-weight: bold;">values</span></b>* of reflectron state but rather *<b><span style="font-weight: bold;">are</span></b>* reflectron states, a distinction which may be meaningless in English but significant when attempting to create a data structure. Picture how this looks in an instance doc:<o:p></o:p></span></font></p> <p class="MsoNormal" style="text-indent: 0.5in;"><span class="m1"><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;"><</span></font></span><span class="t1"><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;">cvParam</span></font></span><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;"> <span class="t1"><font color="black"><span style="color: black;">cvLabel</span></font></span><span class="m1"><font color="black"><span style="color: black;">="</span></font></span><b><span style="font-weight: bold;">MS</span></b><span class="m1"><font color="black"><span style="color: black;">"</span></font></span><span class="t1"><font color="black"><span style="color: black;"> accession</span></font></span><span class="m1"><font color="black"><span style="color: black;">="</span></font></span><b><span style="font-weight: bold;">MS:1000105</span></b><span class="m1"><font color="black"><span style="color: black;">"</span></font></span><span class="t1"><font color="black"><span style="color: black;"> name</span></font></span><span class="m1"><font color="black"><span style="color: black;">="</span></font></span><b><span style="font-weight: bold;">off</span></b><span class="m1"><font color="black"><span style="color: black;">"</span></font></span><span class="t1"><font color="black"><span style="color: black;"> value</span></font></span><span class="m1"><font color="black"><span style="color: black;">="" /><o:p></o:p></span></font></span></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">I can’t think of anything nice to say about that. Better it should read:<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"> </span></font><font color="black" face="Courier New" size="2"><span style="font-size: 10pt; font-family: "Courier New"; color: black;"><reflectronState accession=”MS:1000021” off/><o:p></o:p></span></font></p> </div> </blockquote> I disagree somewhat with these reasons, because you have not addressed the issue of synonyms in the category names from the CV. With your "Better it should read" version of "reflectron state", it would be very impractical to introduce a synonymous category called "reflectron status" or synonymous on/off values "enabled/disabled". With the cvParam version, introducing those synonyms would be simple (in the CV, not simple in the schema which would get auto-generated from the updated CV).<br> <br> In order to auto-generate an XML schema from the CV, you are definitely right about the CV needing type restrictions which can map directly to XML schema types. Some CV categories have controlled values with their own accession numbers (like on and off for reflectron state), while other CV categories have uncontrolled values without accession numbers (like precursor charge, m/z, CID energy, etc.). Those relationships and restrictions must be propagated correctly, and I think that's where the distinction between IS_A and PART_OF comes in. A "<subject> IS_A <object>" relationship implies that the subject is a controlled value of the object, which should be a category. I do not think there should ever be a "<subject> IS_A <object1> IS_A <object2>" relationship. When a CV branch ends with an IS_A relationship, the auto-generator can infer that the subject is one of the possible values for the object. Categories can be within more comprehensive categories though, and as I understand it, that is the purpose of the PART_OF relationship. The other case the auto-generator needs to handle is categories which have uncontrolled values (e.g. precursor m/z) which will not end with an IS_A relationship, and that in itself should indicate to the auto-generator that the category's values are not controlled.<br> <br> If I am confused about the meaning of IS_A and PART_OF in the CV, let's please have a discussion about that because there really must be some way to infer whether a CV term is a value or a category.<br> <br> <br> <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">CONCLUSION: THE CV WORK TO DATE IS IMPORTANT AND USEFUL, BUT SHOULD BE RECAST AS SCHEMA WORK<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">The CV should not attempt to be a replacement for the schema - it just hasn’t got the requisite mechanisms to do the job. The information CV can convey is only a subset of the information that is needed to fully specify a data format. The information in the CV as it stands should be folded into the mzML schema, and maintained therein moving forward. An actual OBO/CV file can be generated as needed.</span></font></p> </div> </blockquote> I conclude that it's better to stick with the flat CV for the increased flexibility it provides, and for the purposes of validation, write some kind of script to take a templated version of the current schema and fill it in with all the CV relationships (which define the XSD restrictions) to produce a fully semantic schema.<font color="navy"><font size="2"><font face="Arial"><br> <br> <br> </font></font></font> <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <div> <div class="MsoNormal" style="text-align: center;" align="center"><font face="Times New Roman" size="3"><span style="font-size: 12pt;"> <hr tabindex="-1" align="center" size="2" width="100%"></span></font></div> <p class="MsoNormal"><b><font face="Tahoma" size="2"><span style="font-size: 10pt; font-family: Tahoma; font-weight: bold;">From:</span></font></b><font face="Tahoma" size="2"><span style="font-size: 10pt; font-family: Tahoma;"> Brian Pratt [<a class="moz-txt-link-freetext" href="mailto:bri...@in...">mailto:bri...@in...</a>] <br> <b><span style="font-weight: bold;">Sent:</span></b> Friday, October 05, 2007 11:52 PM<br> <b><span style="font-weight: bold;">To:</span></b> 'Mass spectrometry standard development'<br> <b><span style="font-weight: bold;">Subject:</span></b> more is_a vs. part_of errors?</span></font><o:p></o:p></p> </div> <p class="MsoNormal"><font face="Times New Roman" size="3"><span style="font-size: 12pt;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">There are a handful of other cases where it appears that the authors have gotten “is a” and “part_of” confused. My proposed corrections (IN CAPS) inline:<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000025 "magnetic field strength"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of <font color="navy"><span style="color: navy;">((IS_A)) </span></font>MS:1000480 "analyzer attribute"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (PART_OF) MS:1000451 "analyzer description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000024 "final MS exponent"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of <font color="navy"><span style="color: navy;">((IS_A)) </span></font>MS:1000480 "analyzer attribute"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (PART_OF) MS:1000451 "analyzer description" <o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000022 "TOF Total Path Length"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of <font color="navy"><span style="color: navy;">((IS_A)) </span></font>MS:1000480 "analyzer attribute"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (PART_OF) MS:1000451 "analyzer description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000014 "accuracy"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of <font color="navy"><span style="color: navy;">((IS_A)) </span></font>MS:1000480 "analyzer attribute"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (PART_OF) MS:1000451 "analyzer description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"</span></font></p> </div> </blockquote> <font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p></o:p></span></font> <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p><br> </o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">((note, these next two are just ugly, see notes at top of message))<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000106 "on"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000021 "reflectron state"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of <font color="navy"><span style="color: navy;">((IS_A)) </span></font>MS:1000480 "analyzer attribute"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (PART_OF) MS:1000451 "analyzer description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000105 "off"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000021 "reflectron state"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of <font color="navy"><span style="color: navy;">((IS_A)) </span></font>MS:1000480 "analyzer attribute"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (PART_OF) MS:1000451 "analyzer description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"</span></font></p> </div> </blockquote> I disagree with all of these, because AFAIK these are subcategories which have uncontrolled values, so they shouldn't end in an IS_A relationship.<br> <br> <br> <blockquote cite="mid:01b401c809da$4b604520$0e03000a@BRIANTECRA" type="cite"> <div class="Section1"> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">The following changes would make the Thermo and ABI stuff look like all the other vendors:<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000495 "Applied Biosystems"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of (IS_A) MS:1000121 "ABI / SCIEX"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000031 "model by vendor"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000176 "MAT95XP Trap"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a (IS_A) MS:1000493 "Finnigan MAT"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000483 "Thermo Fisher Scientific"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000031 "model by vendor"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000175 "MAT95XP"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000493 "Finnigan MAT"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of (IS_A) MS:1000483 "Thermo Fisher Scientific"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000031 "model by vendor"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000174 "MAT900XP Trap"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000493 "Finnigan MAT"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of (IS_A) MS:1000483 "Thermo Fisher Scientific"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000031 "model by vendor"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000173 "MAT900XP"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000493 "Finnigan MAT"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of (IS_A) MS:1000483 "Thermo Fisher Scientific"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000031 "model by vendor"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">MS:1000172 "MAT253"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000493 "Finnigan MAT"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of (IS_A) MS:1000483 "Thermo Fisher Scientific"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> is a MS:1000031 "model by vendor"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:1000463 "instrument description"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"> part of MS:0000000 "MZ controlled vocabularies"<o:p></o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;"><o:p> </o:p></span></font></p> <p class="MsoNormal"><font face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial;">I still think there’s a schema in there, albeit jammed in slightly sideways at the moment.<o:p></o:p></span></font></p> <p class="MsoNormal"><font color="navy" face="Arial" size="2"><span style="font-size: 10pt; font-family: Arial; color: navy;">(( I don’t think that anymore. I think there’s a subset of a schema in there. ))</span></font></p> </div> </blockquote> I think these are screwed up. It doesn't make sense to me to have an IS_A relationship mixed in-between PART_OF relationships.<br> <br> -Matt<br> </body> </html> |