This is an advanced section. Please make sure that you understand the content of [describing_TE_landscapes] and [describing_TE_sequences] before proceeding to this section. Here we describe how to simulate special use cases that are frequently encountered in genomes such as 5' truncated non-LTRs or solo LTR insertions.
For many LTR families, solo LTR insertions have been observed.
There are two ways to simulate solo-LTRs
The solo LTR may be provided as an additional entry in the TE sequence file.
For example assuming the TE sequence file:
>gypsy AAAAATCTCTCGGTCAAAAA
and assuming that AAAAA describes the LTR of the gypsy family a separate entry may be added to the file with the TE sequences
>gypsy AAAAATCTCTCGGTCAAAAA >gypsy_LTR AAAAA
and the following pgd-file may be used
gypsy=$1 gypsy_LTR=$2
The sequence of the solo LTR may be created in the header of the pgd-file.
For example given the TE sequence file
>gypsy AAAAATCTCTCGGTCAAAAA
we may provide the following header of the pgd-file:
gypsy=$1 gypsy_LTR=gypsy[6..$] # yields gypsy_LTR="AAAAA"
The DSL code gypys[6..$] simply removes the sequences starting at base 6 (i.e. TCTCTCGGTCAAAAA will be removed).
Many non-LTRs insertions have 5' truncated sequences.
We recommend to define different 5' truncated TE in the header of the pgd file.
For example given the following TE-sequence file
>I-element AAAAATTTTTCCCCCGGGGG
we may provide the following header of the pgd-file to generate multiple 5' truncated copies of the I-element
ielement=$1 trunc1=ielement[^..5] # yields trunc1="TTTTTCCCCCGGGGG" trunc2=ielement[^..10] # yields trunc2="CCCCCGGGGG" trunc3=ielement[^..15] # yields trunc3="GGGGG"
This code generates three 5' truncated I-element insertions trunc1, trunc2, trunc3.
Many TIR insertions are internally truncated (e.g the KP-element is an internally truncated P-element)
For example given the following TE-sequence file
>P-element AACCTTGGGGTTCCAA
we may provide the following header of the pgd-file to generate multiple internally truncated copies of the P-element
pelement=$1 deletion1=pelement[7..10] # yields deletion1="AACCTTTTCCAA" deletion2=pelement[5..12] # yields deletion2="AACCCCAA" deletion3=pelement[3..14] # yields deletion3="AAAA"
We may distinguish multiple different cases
Lets assume that gypsy2 is a novel family that diverged from the gypsy family.
We recommend to define the diverged gypsy2 family in the header of the pgd-file.
For example:
gypsy=$1 gypsy2=gypsy+10%
This will generate the gypsy2 family having 10% base substitutions compared to the gypsy family. For more complex patterns of divergence (e.g. involving many indels) we recommend to provide a separate entry for the family in the TE sequence file (e.g. see solo LTR section above).
For example Ine-1 is a highly diverged TE family of Drosophila melanogaster where individual insertions are frequently diverged by more than 10% https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-2-r39
We recommend to define sequences of diverged family members in the pgd-file, as shown in the following example
ine1=$1 2 ine1+5% ine1+5% * * * * 6 ine1+5% * * * ine1+5% * 9 * * * * * ine1+5%
Note that in this example five different sequences of ine1 have been generated, each having 5% base substitutions from the original ine1. Hence randomly picked pairs of ine1 (irrespective of the insertion position) will have a total divergence of 10% (5% from each insertions).
Wiki: Home
Wiki: describing_TE_landscapes
Wiki: describing_TE_sequences