From: Steve F. <sfi...@pc...> - 2005-02-02 22:14:17
|
folks- in gus we have a Dots.SequenceType table. here are the columns: nucleotide_type sub_type strand hierarchy [should be hierarchy_depth] parent_sequence_type_id name description First question: does anybody know of an "emerging standard" for this? If there is one, then we should include it in the Controlled Vocabs that we package with GUS. Otherwise, we have, I think, two candidate SequenceTypeCVs: - the one provided by Sanger on the wiki: http://www.gusdb.org/wiki/index.php/Bootstrap%20data#ExternalDatabase - the one currently housed in CBIL's GUS instance As part of the GUS 3.5 install, we are getting serious about making the loading of CVs much easier. A central part of that is making the CVs available from CBIL's download site (eg, the CBIL anatomy CV). So, i am thinking that CBIL should chose one (or more) sequence type CVs to provide as downloads. They could be offered in GUS XML format. Then, the automated GUS CV installer would find them from CBIL just like it will find GO from the GO Consortium. Any plugin that uses SequenceTypes should *not* hard code the transform, but, instead, take a SequenceTypeMapping file. The file specifies the mapping from input sequence type to that stored in gus (by name). The plugin should pre-scan the input file to detect if there are any illegal sequence types, and warn the user before loading any data If users find sequence types that the CBIL CV is missing, they can propose them via the mailing list. The objective is to: 1. work with the fact that different input files for a plugin may use different sequence types 2. get out of the business of ad hoc changes to the sequence types stored in the db comments? steve as a candidate CV the Sequence the SequenceTypesCV as developed by If not, then, how about this. Plugins that depend on sequence type use a standard config file for sequence type. (this might apply to other loose CVs). The config file specifies the |
From: Steve F. <sfi...@pc...> - 2005-02-02 22:34:18
|
mike raises the point that the CV we publish should not be in GUS XML format, but, in a system neutral format. then we would need a dedicated plugin like LoadCBILSequenceTypes that would read that format. steve Steve Fischer wrote: > folks- > > in gus we have a Dots.SequenceType table. > > here are the columns: > nucleotide_type > sub_type > strand > hierarchy [should be hierarchy_depth] > parent_sequence_type_id > name > description > > First question: does anybody know of an "emerging standard" for this? > > If there is one, then we should include it in the Controlled Vocabs > that we package with GUS. > > Otherwise, we have, I think, two candidate SequenceTypeCVs: > - the one provided by Sanger on the wiki: > http://www.gusdb.org/wiki/index.php/Bootstrap%20data#ExternalDatabase > - the one currently housed in CBIL's GUS instance > > As part of the GUS 3.5 install, we are getting serious about making > the loading of CVs much easier. A central part of that is making the > CVs available from CBIL's download site (eg, the CBIL anatomy CV). > > So, i am thinking that CBIL should chose one (or more) sequence type > CVs to provide as downloads. They could be offered in GUS XML format. > > Then, the automated GUS CV installer would find them from CBIL just > like it will find GO from the GO Consortium. > > Any plugin that uses SequenceTypes should *not* hard code the > transform, but, instead, take a SequenceTypeMapping file. The file > specifies the mapping from input sequence type to that stored in gus > (by name). The plugin should pre-scan the input file to detect if > there are any illegal sequence types, and warn the user before loading > any data > > If users find sequence types that the CBIL CV is missing, they can > propose them via the mailing list. > > The objective is to: > 1. work with the fact that different input files for a plugin may use > different sequence types > 2. get out of the business of ad hoc changes to the sequence types > stored in the db > > comments? > > steve > > as a candidate CV the Sequence the SequenceTypesCV as developed by > > If not, then, how about this. Plugins that depend on sequence type > use a standard config file for sequence type. (this might apply to > other loose CVs). The config file specifies the > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting > Tool for open source databases. Create drag-&-drop reports. Save time > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Chris S. <sto...@pc...> - 2005-02-02 22:44:14
|
Steve, There are two complementary standards for sequence type. One comes from the MGED Ontology. see http://mged.sourceforge.net/ontologies/MGEDontology.php#BioSequenceType The other is SO http://song.sourceforge.net/ Chris On Feb 2, 2005, at 5:14 PM, Steve Fischer wrote: > folks- > > in gus we have a Dots.SequenceType table. > > here are the columns: > nucleotide_type > sub_type > strand > hierarchy [should be hierarchy_depth] > parent_sequence_type_id > name > description > > First question: does anybody know of an "emerging standard" for this? > > If there is one, then we should include it in the Controlled Vocabs > that we package with GUS. > > Otherwise, we have, I think, two candidate SequenceTypeCVs: > - the one provided by Sanger on the wiki: > http://www.gusdb.org/wiki/index.php/Bootstrap%20data#ExternalDatabase > - the one currently housed in CBIL's GUS instance > > As part of the GUS 3.5 install, we are getting serious about making > the loading of CVs much easier. A central part of that is making the > CVs available from CBIL's download site (eg, the CBIL anatomy CV). > > So, i am thinking that CBIL should chose one (or more) sequence type > CVs to provide as downloads. They could be offered in GUS XML format. > > Then, the automated GUS CV installer would find them from CBIL just > like it will find GO from the GO Consortium. > > Any plugin that uses SequenceTypes should *not* hard code the > transform, but, instead, take a SequenceTypeMapping file. The file > specifies the mapping from input sequence type to that stored in gus > (by name). The plugin should pre-scan the input file to detect if > there are any illegal sequence types, and warn the user before loading > any data > > If users find sequence types that the CBIL CV is missing, they can > propose them via the mailing list. > > The objective is to: > 1. work with the fact that different input files for a plugin may use > different sequence types > 2. get out of the business of ad hoc changes to the sequence types > stored in the db > > comments? > > steve > > as a candidate CV the Sequence the SequenceTypesCV as developed by > > If not, then, how about this. Plugins that depend on sequence type > use a standard config file for sequence type. (this might apply to > other loose CVs). The config file specifies the > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting > Tool for open source databases. Create drag-&-drop reports. Save time > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Steve F. <sfi...@pc...> - 2005-02-03 00:03:21
|
folks- Having looked at SO and MGED, I am not sure they are capturing what I have in mind, or, what we have captured in our SequenceType table Here is the way I am thinking about breaking down "sequence type." (If somebody can show me how these map into either of the ontologies Chris has mentioned that would be great). For NA sequences: Polymer Type - DNA - RNA Molecule - chromosome - mRNA - tRNA - rRNA - oligo Strandedness - single - double Sequencing process - Genomic - EST - predicted - transcribed - what else? Source - nucleus - mitochondria - plastid - plasmid - episome Steve Chris Stoeckert wrote: > Steve, > There are two complementary standards for sequence type. One comes > from the MGED Ontology. > see > http://mged.sourceforge.net/ontologies/MGEDontology.php#BioSequenceType > The other is SO http://song.sourceforge.net/ > Chris > > On Feb 2, 2005, at 5:14 PM, Steve Fischer wrote: > >> folks- >> >> in gus we have a Dots.SequenceType table. >> >> here are the columns: >> nucleotide_type >> sub_type >> strand >> hierarchy [should be hierarchy_depth] >> parent_sequence_type_id >> name >> description >> >> First question: does anybody know of an "emerging standard" for this? >> >> If there is one, then we should include it in the Controlled Vocabs >> that we package with GUS. >> >> Otherwise, we have, I think, two candidate SequenceTypeCVs: >> - the one provided by Sanger on the wiki: >> http://www.gusdb.org/wiki/index.php/Bootstrap%20data#ExternalDatabase >> - the one currently housed in CBIL's GUS instance >> >> As part of the GUS 3.5 install, we are getting serious about making >> the loading of CVs much easier. A central part of that is making >> the CVs available from CBIL's download site (eg, the CBIL anatomy CV). >> >> So, i am thinking that CBIL should chose one (or more) sequence type >> CVs to provide as downloads. They could be offered in GUS XML format. >> >> Then, the automated GUS CV installer would find them from CBIL just >> like it will find GO from the GO Consortium. >> >> Any plugin that uses SequenceTypes should *not* hard code the >> transform, but, instead, take a SequenceTypeMapping file. The file >> specifies the mapping from input sequence type to that stored in gus >> (by name). The plugin should pre-scan the input file to detect if >> there are any illegal sequence types, and warn the user before >> loading any data >> >> If users find sequence types that the CBIL CV is missing, they can >> propose them via the mailing list. >> >> The objective is to: >> 1. work with the fact that different input files for a plugin may use >> different sequence types >> 2. get out of the business of ad hoc changes to the sequence types >> stored in the db >> >> comments? >> >> steve >> >> as a candidate CV the Sequence the SequenceTypesCV as developed by >> >> If not, then, how about this. Plugins that depend on sequence type >> use a standard config file for sequence type. (this might apply to >> other loose CVs). The config file specifies the >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting >> Tool for open source databases. Create drag-&-drop reports. Save time >> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. >> Download a FREE copy at http://www.intelliview.com/go/osdn_nl >> _______________________________________________ >> Gusdev-gusdev mailing list >> Gus...@li... >> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting > Tool for open source databases. Create drag-&-drop reports. Save time > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev |
From: Steve F. <sfi...@pc...> - 2005-02-03 12:53:07
|
Aaron encouraged me to take a second look at SO. (my first look came up dry, and i surmised that it was more "feature" oriented than "sequence" oriented) the results are below But first, here are the term in CBIL's SequenceType table: DNA RNA ds-DNA ss-DNA ss-RNA ds-RNA mRNA EST tRNA rRNA unknown predicted_mRNA virtual GSS oligonucleotide To, me, this is confuting multiple concepts: polymer type, strandedness, molecule But, I am now thinking that if we replaced that list with the following attributes and values, we would probably be just fine. SequenceType here *is* confuting multiple concepts, but, in a way that i think will satisfy intuition and reasonable querying needs: Singlestranded true false SequenceType: chromosomal mRNA rRNA tRNA EST oligo HasPieces (is virtual) true false Now for the SO survey: Polymer Type - no - DNA - no - RNA - no Molecule - no - chromosome - SO:0000340 - mRNA - SO:0000234 - tRNA - SO:0000253 - rRNA - SO:0000252 - oligo - SO:0000696 Strandedness - no - single - no - double - no Sequencing process - derived_from - Genomic - no - EST - SO:0000345 - predicted - no - transcribed - no - what else? Source - no - nucleus - no - mitochondria - no - plastid - no - plasmid - no - episome - no Guess what, all the sequence types in my proposed list above are found in the SO: - chromosome - SO:0000340 - mRNA - SO:0000234 - tRNA - SO:0000253 - rRNA - SO:0000252 - oligo - SO:0000696 - EST - SO:0000345 But, does that mean we should abolish the SequenceType table? If we do, then a sequence would point to the SO for its type. The advantage is that we will be out of the business of inventing yet another CV. The disadvantage is that now users have to wade through 400+ terms to find the 6 that we think are relevant ???? steve Steve Fischer wrote: > folks- > > Having looked at SO and MGED, I am not sure they are capturing what I > have in mind, or, what we have captured in our SequenceType table > > Here is the way I am thinking about breaking down "sequence type." > (If somebody can show me how these map into either of the ontologies > Chris has mentioned that would be great). > > For NA sequences: > > Polymer Type > - DNA > - RNA > Molecule > - chromosome > - mRNA > - tRNA > - rRNA > - oligo > Strandedness > - single > - double > Sequencing process > - Genomic > - EST > - predicted > - transcribed > - what else? > Source > - nucleus > - mitochondria > - plastid > - plasmid > - episome > > Steve > > Chris Stoeckert wrote: > >> Steve, >> There are two complementary standards for sequence type. One comes >> from the MGED Ontology. >> see >> http://mged.sourceforge.net/ontologies/MGEDontology.php#BioSequenceType >> The other is SO http://song.sourceforge.net/ >> Chris >> >> On Feb 2, 2005, at 5:14 PM, Steve Fischer wrote: >> >>> folks- >>> >>> in gus we have a Dots.SequenceType table. >>> >>> here are the columns: >>> nucleotide_type >>> sub_type >>> strand >>> hierarchy [should be hierarchy_depth] >>> parent_sequence_type_id >>> name >>> description >>> >>> First question: does anybody know of an "emerging standard" for this? >>> >>> If there is one, then we should include it in the Controlled Vocabs >>> that we package with GUS. >>> >>> Otherwise, we have, I think, two candidate SequenceTypeCVs: >>> - the one provided by Sanger on the wiki: >>> http://www.gusdb.org/wiki/index.php/Bootstrap%20data#ExternalDatabase >>> - the one currently housed in CBIL's GUS instance >>> >>> As part of the GUS 3.5 install, we are getting serious about making >>> the loading of CVs much easier. A central part of that is making >>> the CVs available from CBIL's download site (eg, the CBIL anatomy CV). >>> >>> So, i am thinking that CBIL should chose one (or more) sequence type >>> CVs to provide as downloads. They could be offered in GUS XML format. >>> >>> Then, the automated GUS CV installer would find them from CBIL just >>> like it will find GO from the GO Consortium. >>> >>> Any plugin that uses SequenceTypes should *not* hard code the >>> transform, but, instead, take a SequenceTypeMapping file. The file >>> specifies the mapping from input sequence type to that stored in gus >>> (by name). The plugin should pre-scan the input file to detect if >>> there are any illegal sequence types, and warn the user before >>> loading any data >>> >>> If users find sequence types that the CBIL CV is missing, they can >>> propose them via the mailing list. >>> >>> The objective is to: >>> 1. work with the fact that different input files for a plugin may >>> use different sequence types >>> 2. get out of the business of ad hoc changes to the sequence types >>> stored in the db >>> >>> comments? >>> >>> steve >>> >>> as a candidate CV the Sequence the SequenceTypesCV as developed by >>> >>> If not, then, how about this. Plugins that depend on sequence type >>> use a standard config file for sequence type. (this might apply to >>> other loose CVs). The config file specifies the >>> >>> >>> ------------------------------------------------------- >>> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting >>> Tool for open source databases. Create drag-&-drop reports. Save time >>> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. >>> Download a FREE copy at http://www.intelliview.com/go/osdn_nl >>> _______________________________________________ >>> Gusdev-gusdev mailing list >>> Gus...@li... >>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >> >> >> >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting >> Tool for open source databases. Create drag-&-drop reports. Save time >> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. >> Download a FREE copy at http://www.intelliview.com/go/osdn_nl >> _______________________________________________ >> Gusdev-gusdev mailing list >> Gus...@li... >> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > > |
From: Aaron J. M. <am...@pc...> - 2005-02-03 13:41:08
|
First, I would encourage you to look at SOFA, the subset of SO useful for sequence annotation (which is presumably what you're doing, right?) I would argue that these extra "attributes" you don't find explictly listed in SO are actually redundant to specific datatypes found in SO, i.e. these are encapsulated in the definition of each term. On Feb 3, 2005, at 7:54 AM, Steve Fischer wrote: > Polymer Type - no > - DNA - no > - RNA - no an mRNA is RNA, not DNA; a chromosome is DNA, not RNA (unless its a viral genome, etc). > Strandedness - no > - single - no > - double - no ditto; strandness is inherent to the definition of a type > Sequencing process - derived_from > - Genomic - no > - EST - SO:0000345 > - predicted - no > - transcribed - no > - what else? all of these are there, you just have to look for them in more biologically meaningful terms than what you have here. and "derived_from" is not a SO term, it's a relationship type. > Source - no > - nucleus - no > - mitochondria - no > - plastid - no > - plasmid - no > - episome - no ditto. SO is/was designed to recapitulate biology (as best as possible), not the awkward attribute simplifications you seem to want to use (for instance, it seems in your scheme that I could have a sequence type that was DNA, mRNA, double stranded, predicted and episomal all at once). With SO, you find the specific name for the thing you have ... To put it in a more generic context: with SO you have "integer", "unsigned integer", "long integer", "unsigned long integer", "signed integer", etc., related in a hierarchy of isa/derived_from/part_of relationships; you don't have "signed" and "unsigned", "long" and "short", etc. as singular terms. Now if you wanted to overlay a second ontology of term relationships (e.g. the "signedness" ontology), you could relate terms by these "attributes", and have the best of both worlds. -Aaron -- Aaron J. Mackey, Ph.D. Dept. of Biology, Goddard 212 University of Pennsylvania email: am...@pc... 415 S. University Avenue office: 215-898-1205 Philadelphia, PA 19104-6017 fax: 215-746-6697 |
From: Angel P. <an...@ma...> - 2005-02-03 17:19:03
|
Aaron, I can't agree more with your points. Please keep up the good suggestions. I thought a while back, GUS developers had agreed to replace CBIL's ad-hoc sequence ontology terms and gene models (not ad-hoc, but derived from EpoDB?) with SO as the major ontology for defining them. Makes sense to just extend the use of SO within GUS for all sequence categorizations. The drawback I can see is the added layer of logic that developers must now know. For instance the WDK queries *may* become more complex, sequence annotation tools must now type sequences on defined terms, possibly a hierarchy of them. But since we are ditching the annot tool project that is not a concern. WDK queries writers would need to know the semantics of the current attribute model anyway, so it is not a stratch for them to learn sematics of SO and apply them. Also that new knowledge will be applicable to more than just a WDK query, so that is a win. Two other advantages: 1) since it is widely accepted, most annotation tools should eventually provide native support for SO, and 2) we would more easily be able to share our gene models across GUS sites and othe non-GUS sites. Angel Aaron J. Mackey wrote: > First, I would encourage you to look at SOFA, the subset of SO useful > for sequence annotation (which is presumably what you're doing, right?) > > I would argue that these extra "attributes" you don't find explictly > listed in SO are actually redundant to specific datatypes found in SO, > i.e. these are encapsulated in the definition of each term. > > On Feb 3, 2005, at 7:54 AM, Steve Fischer wrote: > >> Polymer Type - no >> - DNA - no >> - RNA - no > > > an mRNA is RNA, not DNA; a chromosome is DNA, not RNA (unless its a > viral genome, etc). > >> Strandedness - no >> - single - no >> - double - no > > > ditto; strandness is inherent to the definition of a type > >> Sequencing process - derived_from >> - Genomic - no >> - EST - SO:0000345 >> - predicted - no >> - transcribed - no >> - what else? > > > all of these are there, you just have to look for them in more > biologically meaningful terms than what you have here. and > "derived_from" is not a SO term, it's a relationship type. > >> Source - no >> - nucleus - no >> - mitochondria - no >> - plastid - no >> - plasmid - no >> - episome - no > > > ditto. > > SO is/was designed to recapitulate biology (as best as possible), not > the awkward attribute simplifications you seem to want to use (for > instance, it seems in your scheme that I could have a sequence type > that was DNA, mRNA, double stranded, predicted and episomal all at > once). With SO, you find the specific name for the thing you have ... > > To put it in a more generic context: with SO you have "integer", > "unsigned integer", "long integer", "unsigned long integer", "signed > integer", etc., related in a hierarchy of isa/derived_from/part_of > relationships; you don't have "signed" and "unsigned", "long" and > "short", etc. as singular terms. Now if you wanted to overlay a > second ontology of term relationships (e.g. the "signedness" > ontology), you could relate terms by these "attributes", and have the > best of both worlds. > > -Aaron > > -- > Aaron J. Mackey, Ph.D. > Dept. of Biology, Goddard 212 > University of Pennsylvania email: am...@pc... > 415 S. University Avenue office: 215-898-1205 > Philadelphia, PA 19104-6017 fax: 215-746-6697 > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting > Tool for open source databases. Create drag-&-drop reports. Save time > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 |