[Gusdev-gusdev] GUS 3.0 schema changes: ASCII art redux

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Arnaud-

> A quick question regarding evidences, you're mentioning that the 
> Evidence table will connect Features and Experimental evidences. Where 
> will the latter be stored ?

Hopefully others will chime in if I get this wrong...  I believe that the
relevant tables are DoTS.Comments (for free text notes/comments entered by
an annotator) and SRes.BibliographicReference (for published experiments.)
However, I don't think that we have a generic table to represent unpublished
laboratory experiments in a structured way.  Perhaps we need some use cases
here?  We do have your new table for representing RNAi constructs, but I
don't think that we have a corresponding table to represent the actual
RNAi experiment.  Do we need/want such a table (either for RNAi experiments
or in general) and, if so, how detailed does it need to be?

> Here two examples of transposable elements annotations, one is from 
> Tbrucei, the other one is a common one in procaryote genomes.
> 
> The first one in the inclusion of a INGI transposon  within an ORF, the 
> RHS gene. The transposon includes two RIME flanking repeats and another ORF.
> So in GUS, the INGI transposon could be stored as a transposable element 
> feature, attached to a RHS gene feature. The transposable element 
> feature will have three sub features, a gene feature, tagged as a 
> pseudo-gene and two repeat features, which repeat_type is RIME and with 
> a given location.

So in the "current" schema (meaning that I'm assuming we have only a single
repeat-related view, called RepeatRegionNAFeature, which is the NA equivalent
of RepeatRegionAAFeature), the picture would look like this:

                           <DoTS::GenomicSequence>
                           ^      ^      ^       ^
                           |      |      |       |
     <DoTS::GeneFeature (RHS)>    |      |       |
               ^                  |      |       |
               |                  |      |       |
    <DoTS::TransposableElement (INGI)>   |       |
    ^                ^                   |       |
    |                |                   |       |
    |  2 x <DoTS::RepeatRegionNAFeature (RIME)>  |
    |                                            |
    ------------------------<DoTS::GeneFeature (pseudo)>

-For each feature the leftmost arrow shows the parent_id, the rightmost
  arrow shows the na_sequence_id.
-All of the features will have a location specified in terms of the
  genomic sequence (because that's what their na_sequence_id references.)
-I have to create 2 RepeatRegionNAFeatures under my definition, because
  the RIME repeats are not adjacent to one another.
-Presumably the transposable element is contained in the coding region
  of a single exon, so the parent feature could be an ExonFeature instead
  of a GeneFeature.
-Note that parent_id is typically used to indicate a part-whole
  relationship, in the sense that the part *must* have a corresponding
  whole (e.g. Exon to Gene).  In the above picture and our discussions
  on this topic we've generalized its usage to also encompass the
  concept that one feature "happens to be" part of another i.e.,
  that its NALocation is strictly within the bounds of its parent's
  NALocation, but that this need not be the case by definition.

And I believe your proposal is for something that looks more like this:

                           <DoTS::GenomicSequence>
                           ^      ^      ^    ^  ^
                           |      |      |    |  |
     <DoTS::GeneFeature (RHS)>    |      |    |  |
               ^                  |      |    |  |
               |                  |      |    |  |
    <DoTS::TransposableElement (INGI)>   |    |  |
    ^                ^                   |    |  |
    |                |                   |    |  |
    |           <DoTS::RepeatRegionNAFeature> |  |
    |                   ^                     |  |
    |                   |                     |  |
    |          2 x <DoTS::RepeatFeature (RIME)>  |
    |                                            |
    |                                            |
    ------------------------<DoTS::GeneFeature (pseudo)>

In other words, the RepeatRegionNAFeature serves only to group the two RIME
repeats (which aren't even immediately adjacent to one another.)  Is this
what you had in mind?  Or did you mean to make the RepeatRegionNAFeature a
child of the GeneFeature and then make the TransposableElement a child of
the RepeatRegionNAFeature?  I'm just not clear on your definition of "repeat
region".  Specifically, can a repeat region contain things that are not
repeats, and can it contain more than one type of repeat?  And, if so, how
does one assign bounds to the region in a non-arbitrary way?

> The second example is nested transposable elements in procaryote 
> genomes, ie insertion of a transposable element within another one. Each 
> transposable element can have a similar structure including the 
> following sub features : two flanking Inverted Repeats, a gene and its 
> promoter and/or a promoter, functional on the other strand !

I won't try to draw the pictures for this one!  In both the current schema
and your proposal I think we have the problem that we haev no way of
explicitly representing the relationship between the two flanking inverted
repeats.  Apart from that, however, I think that we can handle this case
just as well as the first.  You have to create quite a few features, but
I don't think there's any way to avoid that unless we want to come up with
some "exemplar" transposons and use them to classify the instances we
encounter.  The promoter/gene that's functional on the opposite strand
would be represented simply as reverse-strand features (i.e., we'd set
the is_reversed flag in their NALocations, but still use their parent_ids
to indicate their place in the nested repeat structure.)

> So if there is no repeat feature, the flanking repeats will have to be 
> annotated part of the transposable element feature.
> Let me know what you think about these.

But shouldn't they be part of the transposable element feature?  I don't
know the details of this specific type of transposon, but are you trying
to make the distinction between: 1) the core transposon, i.e., the machinery
that enables that part of the genome (encompassing both the machinery and
perhaps some variable-sized flanking regions) to move around and 2) the
"transposed" element, i.e. the core machinery plus whatever flanking
regions happened to be carried along on the element's most recent trip
(the one that brought it to its current location.)?

>>-Modified DoTS.ProteinProperty table to reference ProteinPropertyType
>> One question I have regarding these tables is how will the units be specified?
>> Should I make the "property_value" column a varchar2 column?  It may have had 
>> this type originally, and I might have changed it without considering the 
>> consequences.  One option would be to specify in the ProteinPropertyType table
>> what units are to be used, though this is clumsy if there is more than one
>> choice of units for a given property.
>>
> Whatever the unit they're in, they should all be numbers (some would be 
> integer) so we can go for the "number" data type but float or varchar 
> could also be fine!

Right, but the question is how does somebody querying the table know what
a mass of "25" means?  Are molecular masses always expressed in the same
units, no matter what?  My recollection is that you can sometimes have
some pretty big polypeptides, but I don't know what the convention is.

> I reckon ReplicationOriginFeature would make more sense

OK, I'll make this change.

Jonathan

-- 
Jonathan Crabtree
Center for Bioinformatics, University of Pennsylvania
1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021
215-573-3115