Re: [Gusdev-gusdev] Sequence Type controlled vocab

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

ed-

i'm with ya on this!  i've been trying to sort that out over the past 
couple of days.

In general, plugins must ask their users for mappings when there is an 
issue like this. 

We are moving towards SO and away from the SequenceTypes table.

The most general solution is for plugins to take a  CV mapping file.   
In this case, the plugin should have an argument --seqTypeMapFile.  The 
file is a simple tab format:

Genomic DNA    SO:0000340
DNA            SO:0000340

If we really need to handle strandness, then we need a third column in 
the middle.    We are assuming (Aaron) that any combination of sequence 
type (and strandedness) will resolve into one SO term.

If the plugin is failure tolerant (ie, tracks the work it has done) then 
we simply reject records with inputTypes not mentioned in the file.  The 
user augments the mapping file accordingly and re-runs.  Otherwise, the 
plugin should scan the input and validate against the mapping file 
before it begins to run for real.

Also, since we are moving to the SO for sequence type, we should 
probably introduce values into the SequenceType table (for our project) 
that are SO ids, like SO:0000340 (for chromosomal) rather than "genomic" 
or whatever.   Then, when we lose the SequenceType table, we'll know how 
to migrate.

steve

Ed Robinson wrote:

>On a more prosaic level, matching the sequence type is a big issue for the LoadAnnotatedSeqs plugin we are writing.  Basically, the Plugin has to do a check on the molecule type declared by the input file and match it to an entry in in the dots.SequenceType table.  The problem is that nobody is consistent with how they name their sequence types in TIGR, EMBL or Genbank.  I've handled this by allowing the plugin to throw an error if it can't find the sequence type in GUS.  Then, I enter a new entry as a subtype of a more basic sequence (Genomic DNA, = DNA, unknown strandedness).  Of course, over time, this means you end up with a ton of different entries describing the same thing.
>
>So, not only do we need to settle what our VCs are, but how we programatically handle what is not in them.  This is even worse for review status.  A lot of older plugins make assumptions about what is in the database and how it relates to their data source.  LoadGeneFeaturesFromXML is a good example.  It assumes values of 0 and 1 for specific annotation values used in the TigrXML).  
>
>Any suggestions on standard business rules for how plugins match external data to internal VCs?
>
>-ed
>
>  
>
>>From: "Aaron J. Mackey" <am...@pc...>
>>Date: 2005/02/03 Thu AM 08:40:59 EST
>>To: Steve Fischer <sfi...@pc...>
>>CC: gusdev-gusdev <gus...@li...>, 
>>	Chris Stoeckert <sto...@pc...>
>>Subject: Re: [Gusdev-gusdev] Sequence Type controlled vocab
>>
>>First, I would encourage you to look at SOFA, the subset of SO useful 
>>for sequence annotation (which is presumably what you're doing, right?)
>>
>>I would argue that these extra "attributes" you don't find explictly 
>>listed in SO are actually redundant to specific datatypes found in SO, 
>>i.e. these are encapsulated in the definition of each term.
>>
>>On Feb 3, 2005, at 7:54 AM, Steve Fischer wrote:
>>
>>    
>>
>>>Polymer Type   - no
>>> - DNA  - no
>>> - RNA - no
>>>      
>>>
>>an mRNA is RNA, not DNA; a chromosome is DNA, not RNA (unless its a 
>>viral genome, etc).
>>
>>    
>>
>>>Strandedness - no
>>>- single  - no
>>>- double  - no
>>>      
>>>
>>ditto; strandness is inherent to the definition of a type
>>
>>    
>>
>>>Sequencing process   - derived_from
>>>- Genomic - no
>>>- EST  -  SO:0000345
>>>- predicted - no
>>>- transcribed - no
>>>- what else?
>>>      
>>>
>>all of these are there, you just have to look for them in more 
>>biologically meaningful terms than what you have here.  and 
>>"derived_from" is not a SO term, it's a relationship type.
>>
>>    
>>
>>>Source - no
>>>- nucleus  - no
>>>- mitochondria - no
>>>- plastid  - no
>>>- plasmid  - no
>>>- episome  - no
>>>      
>>>
>>ditto.
>>
>>SO is/was designed to recapitulate biology (as best as possible), not 
>>the awkward attribute simplifications you seem to want to use (for 
>>instance, it seems in your scheme that I could have a sequence type 
>>that was DNA, mRNA, double stranded, predicted and episomal all at 
>>once).  With SO, you find the specific name for the thing you have ...
>>
>>To put it in a more generic context: with SO you have "integer", 
>>"unsigned integer", "long integer", "unsigned long integer", "signed 
>>integer", etc., related in a hierarchy of isa/derived_from/part_of 
>>relationships; you don't have "signed" and "unsigned", "long" and 
>>"short", etc. as singular terms.  Now if you wanted to overlay a second 
>>ontology of term relationships (e.g. the "signedness" ontology), you 
>>could relate terms by these "attributes", and have the best of both 
>>worlds.
>>
>>-Aaron
>>
>>--
>>Aaron J. Mackey, Ph.D.
>>Dept. of Biology, Goddard 212
>>University of Pennsylvania       email:  am...@pc...
>>415 S. University Avenue         office: 215-898-1205
>>Philadelphia, PA  19104-6017     fax:    215-746-6697
>>
>>
>>
>>-------------------------------------------------------
>>This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
>>Tool for open source databases. Create drag-&-drop reports. Save time
>>by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
>>Download a FREE copy at http://www.intelliview.com/go/osdn_nl
>>_______________________________________________
>>Gusdev-gusdev mailing list
>>Gus...@li...
>>https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>
>>    
>>
>
>Ed Robinson
>255 Deerfield Rd
>Bogart, GA 30622
>(706)425-9181
>
>
>Sweet Caroline
>good times never seemed so good.
>I've been inclined
>to believe they never would.
>     --Neil Diamond
>
>
>We're just a bunch of idiots.
>      --Johnny Damon
>  
>