[Gusdev-gusdev] Re: GUS 3.0 schema changes: ASCII art redux

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Quoting Jonathan Crabtree <cra...@pc...>:

> 
> Arnaud-
> 
> > Here two examples of transposable elements annotations, one is from 
> > Tbrucei, the other one is a common one in procaryote genomes.
> > 
> > The first one in the inclusion of a INGI transposon  within an ORF, the 
> > RHS gene. The transposon includes two RIME flanking repeats and another
> ORF.
> > So in GUS, the INGI transposon could be stored as a transposable element 
> > feature, attached to a RHS gene feature. The transposable element 
> > feature will have three sub features, a gene feature, tagged as a 
> > pseudo-gene and two repeat features, which repeat_type is RIME and with 
> > a given location.
> 
> So in the "current" schema (meaning that I'm assuming we have only a single
> repeat-related view, called RepeatRegionNAFeature, which is the NA
> equivalent
> of RepeatRegionAAFeature), the picture would look like this:
> 
>                            <DoTS::GenomicSequence>
>                            ^      ^      ^       ^
>                            |      |      |       |
>      <DoTS::GeneFeature (RHS)>    |      |       |
>                ^                  |      |       |
>                |                  |      |       |
>     <DoTS::TransposableElement (INGI)>   |       |
>     ^                ^                   |       |
>     |                |                   |       |
>     |  2 x <DoTS::RepeatRegionNAFeature (RIME)>  |
>     |                                            |
>     ------------------------<DoTS::GeneFeature (pseudo)>
> 
> -For each feature the leftmost arrow shows the parent_id, the rightmost
>   arrow shows the na_sequence_id.
> -All of the features will have a location specified in terms of the
>   genomic sequence (because that's what their na_sequence_id references.)
> -I have to create 2 RepeatRegionNAFeatures under my definition, because
>   the RIME repeats are not adjacent to one another.
> -Presumably the transposable element is contained in the coding region
>   of a single exon, so the parent feature could be an ExonFeature instead
>   of a GeneFeature.
> -Note that parent_id is typically used to indicate a part-whole
>   relationship, in the sense that the part *must* have a corresponding
>   whole (e.g. Exon to Gene).  In the above picture and our discussions
>   on this topic we've generalized its usage to also encompass the
>   concept that one feature "happens to be" part of another i.e.,
>   that its NALocation is strictly within the bounds of its parent's
>   NALocation, but that this need not be the case by definition.
> 
> And I believe your proposal is for something that looks more like this:
> 
>                            <DoTS::GenomicSequence>
>                            ^      ^      ^    ^  ^
>                            |      |      |    |  |
>      <DoTS::GeneFeature (RHS)>    |      |    |  |
>                ^                  |      |    |  |
>                |                  |      |    |  |
>     <DoTS::TransposableElement (INGI)>   |    |  |
>     ^                ^                   |    |  |
>     |                |                   |    |  |
>     |           <DoTS::RepeatRegionNAFeature> |  |
>     |                   ^                     |  |
>     |                   |                     |  |
>     |          2 x <DoTS::RepeatFeature (RIME)>  |
>     |                                            |
>     |                                            |
>     ------------------------<DoTS::GeneFeature (pseudo)>
> 

My proposal is this representation without the repeat region feature. I would
see the repeat region feature to cluster together a sequence, whatever the
sequence is (even one base, or more), repeated X times, but not being used in
this situation.

> In other words, the RepeatRegionNAFeature serves only to group the two RIME
> repeats (which aren't even immediately adjacent to one another.)  Is this
> what you had in mind? 

I don't think we need to group them with a repeat region feature, as the
transposable element would do it.

 Or did you mean to make the RepeatRegionNAFeature a
> child of the GeneFeature and then make the TransposableElement a child of
> the RepeatRegionNAFeature?  I'm just not clear on your definition of
> "repeat
> region".  Specifically, can a repeat region contain things that are not
> repeats,

Yes ! a gene for example !! A repeat region would be used to cluster tandemly
repeated genes. But this should be fine as long as a gene feature can be
attached to a repeat region.

 and can it contain more than one type of repeat?  

I think we agree on only one type of repeat unit and if it has more, we would
nest the repeat region features. We din't come here with a repeat region made of
interlaced repeat units which would require to make the schema more generic.

And, if so, how
> does one assign bounds to the region in a non-arbitrary way?
> 
> > The second example is nested transposable elements in procaryote 
> > genomes, ie insertion of a transposable element within another one. Each 
> > transposable element can have a similar structure including the 
> > following sub features : two flanking Inverted Repeats, a gene and its 
> > promoter and/or a promoter, functional on the other strand !
> 
> I won't try to draw the pictures for this one!  In both the current schema
> and your proposal I think we have the problem that we haev no way of
> explicitly representing the relationship between the two flanking inverted
> repeats. 

But we don't need to !?

 Apart from that, however, I think that we can handle this case
> just as well as the first.  You have to create quite a few features, but
> I don't think there's any way to avoid that unless we want to come up with
> some "exemplar" transposons and use them to classify the instances we
> encounter.  The promoter/gene that's functional on the opposite strand
> would be represented simply as reverse-strand features (i.e., we'd set
> the is_reversed flag in their NALocations, but still use their parent_ids
> to indicate their place in the nested repeat structure.)
> 
> > So if there is no repeat feature, the flanking repeats will have to be 
> > annotated part of the transposable element feature.
> > Let me know what you think about these.
> 
> But shouldn't they be part of the transposable element feature?  I don't
> know the details of this specific type of transposon, but are you trying
> to make the distinction between: 1) the core transposon, i.e., the
> machinery
> that enables that part of the genome (encompassing both the machinery and
> perhaps some variable-sized flanking regions) to move around and 2) the
> "transposed" element, i.e. the core machinery plus whatever flanking
> regions happened to be carried along on the element's most recent trip
> (the one that brought it to its current location.)?
> 
I think we want to represent a transposable element in a given context, ie at a
given location because this insertion may have consequences, (in)activating a
gene or shifting the frame of a gene etc.

A core transposon should be represented as an entity on its own like genes are.

> 
> Jonathan
> 
> -- 
> Jonathan Crabtree
> Center for Bioinformatics, University of Pennsylvania
> 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021
> 215-573-3115
> 
> 

Arnaud