#306 SO-compliant annotation of mobile element integration sites


Hi all,

I have a question and/or suggestion regarding the annotation of mobile
element integration sites. We are developing de novo transposable
element (TE) annotation software which output their results in GFF3 format.
In particular we determine and output terminal repeats (direct or
inverted), target site duplications, ORFs, and homology-based features
such as pHMM matches in the region between the terminal repeats.
On the basis of these annotations we use our feature graph processing
infrastructure to further enhance and improve the results.

In order to ensure full compliance to the SO in terms of relationship
compatibility, I am now wondering how to integrate all these data into a
feature graph. The main problem is the question of where to put the
target_site_duplications (TSD, SO:0000434) flanking the TE insertion.
Obviously they are not part of the integrated element itself, but they
are -- in my opinion -- still connected to the particular integration
site and should be connected to it in some way. For now we are
outputting the TSDs and the element annotation as children of a
repeat_region feature, e.g. in the case of LTR retrotransposons:

repeat_region (SO:0000657)
-- target_site_duplication (SO:0000434)
-- LTR_retrotransposon (SO:0000186)
-- long_terminal_repeat (SO:0000286)
-- ...

which is not really SO compliant yet.
I see that there is a derives_from relationship between the
target_site_duplication and the transposable_element, but in GFF3 only
part_of relationships are the basis for parent-child assignments, so the
TSDs would not be part of the connected component representing one
integrated element.

Is there an alternative? I could not find any SO type which represents
an insertion site in a structural way, capturing both the inserted
element and its effect on the integration site via a part_of
relationship, e.g.:

transposable_element_integration_site (new type)
-- target_site_duplication (SO:0000434)
-- LTR_retrotransposon (SO:0000186)
-- long_terminal_repeat (SO:0000286)
-- ...

or, respectively,

transposable_element_integration_site (new type)
-- target_site_duplication (SO:0000434)
-- terminal_inverted_repeat_element
-- terminal_inverted_repeat
-- ...

Another question is how to handle matches, e.g. protein_match
(SO:0000349) or ORFs (SO:0000236) correctly. As far as I can see, the
only way to have information about internal functional or coding regions
attached to a transposon annotation is via the transposable_element_gene
(SO:0000111) type. However, it is not always possible to reconstruct
genes from such matches, particularly in degenerated old insertions.
Nevertheless we would like to store the matches with the predicted
elements to allow later postprocessing (e.g. filtering etc.) on the
basis of these matches. Is there a preferred way to handle this?

For both cases (TSDs and matches) the obvious way would be to keep them
as top-level features, which would lead to a need to combine these
individual features again in our iterative pipeline. This pipeline
delivers and processes one connected component from e.g. an input GFF3
file at a time, which we naturally would prefer to be one complete
integrated element with all associated information.

I am very much looking forward to your input, thanks in advance!

Best regards,

This was followed on the mailing list by:
Hello Sascha, I'm happy to hear that other folks are wrestling with how to represent transposable elements in SO compatible GFF3. I agree that there needs to be some tweaking of SO to more correctly capture the structural biology of transposable elements. In many eukaryotic genomes, the majority of the sequence features are transposable elements and not being able to communicate these features in SO compliant GFF3 is a frustration.

It seems to me like it should be possible to recognize that a transposable element is a genome that is itself a component of a parent (host) genome. In turn the transposable element itself can serve as a host for a separate transposable element insertion (ie. an LTR retrotransposon inserted into another LTR retrotransposon). Something like this would allow transposable elements to have genes/ORFs or alignments that could be annotated as children of an appropriately defined transposable element genome.

Since target site duplications are derived from the host genome that the mobile element inserts into, it makes sense that these remain a derives_from feature that are part_of the host genome that was inserted into. In some situations the host genome would be the parent host (ie rice or maize) while in others the host in the derives from relationship would be another transposable element that was inserted into.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks