SimulaTE Wiki

Brought to you by: rokofler

special_use_cases

Introduction
Special use cases

Introduction

This is an advanced section. Please make sure that you understand the content of [describing_TE_landscapes] and [describing_TE_sequences] before proceeding to this section. Here we describe how to simulate special use cases that are frequently encountered in genomes such as 5' truncated non-LTRs or solo LTR insertions.

Special use cases

solo-LTRs

For many LTR families, solo LTR insertions have been observed.
There are two ways to simulate solo-LTRs

provide the solo LTR as additional entry in the TE sequence file

The solo LTR may be provided as an additional entry in the TE sequence file.
For example assuming the TE sequence file:

>gypsy
AAAAATCTCTCGGTCAAAAA

and assuming that AAAAA describes the LTR of the gypsy family a separate entry may be added to the file with the TE sequences

>gypsy
AAAAATCTCTCGGTCAAAAA
>gypsy_LTR
AAAAA

and the following pgd-file may be used

gypsy=$1
gypsy_LTR=$2

define the solo LTR in the header of the pgd file

The sequence of the solo LTR may be created in the header of the pgd-file.
For example given the TE sequence file

>gypsy
AAAAATCTCTCGGTCAAAAA

we may provide the following header of the pgd-file:

gypsy=$1
gypsy_LTR=gypsy[6..$]    # yields gypsy_LTR="AAAAA"

The DSL code gypys[6..$] simply removes the sequences starting at base 6 (i.e. TCTCTCGGTCAAAAA will be removed).

5' truncated non-LTRs

Many non-LTRs insertions have 5' truncated sequences.
We recommend to define different 5' truncated TE in the header of the pgd file.

For example given the following TE-sequence file

>I-element
AAAAATTTTTCCCCCGGGGG

we may provide the following header of the pgd-file to generate multiple 5' truncated copies of the I-element

ielement=$1
trunc1=ielement[^..5]    # yields trunc1="TTTTTCCCCCGGGGG"
trunc2=ielement[^..10]   # yields trunc2="CCCCCGGGGG"
trunc3=ielement[^..15]   # yields trunc3="GGGGG"

This code generates three 5' truncated I-element insertions trunc1, trunc2, trunc3.

internally truncated TIRs

Many TIR insertions are internally truncated (e.g the KP-element is an internally truncated P-element)

For example given the following TE-sequence file

>P-element
AACCTTGGGGTTCCAA

we may provide the following header of the pgd-file to generate multiple internally truncated copies of the P-element

pelement=$1
deletion1=pelement[7..10]     # yields deletion1="AACCTTTTCCAA"
deletion2=pelement[5..12]     # yields deletion2="AACCCCAA"
deletion3=pelement[3..14]     # yields deletion3="AAAA"

diverged/evolved TE families

We may distinguish multiple different cases

a diverged family

Lets assume that gypsy2 is a novel family that diverged from the gypsy family.
We recommend to define the diverged gypsy2 family in the header of the pgd-file.

For example:

gypsy=$1  
gypsy2=gypsy+10%

This will generate the gypsy2 family having 10% base substitutions compared to the gypsy family. For more complex patterns of divergence (e.g. involving many indels) we recommend to provide a separate entry for the family in the TE sequence file (e.g. see solo LTR section above).

diverged sequences of a given family

For example Ine-1 is a highly diverged TE family of Drosophila melanogaster where individual insertions are frequently diverged by more than 10% https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-2-r39

We recommend to define sequences of diverged family members in the pgd-file, as shown in the following example

ine1=$1
2 ine1+5% ine1+5% * * * *
6 ine1+5% * * * ine1+5% *
9 * * * * * ine1+5%

Note that in this example five different sequences of ine1 have been generated, each having 5% base substitutions from the original ine1. Hence randomly picked pairs of ine1 (irrespective of the insertion position) will have a total divergence of 10% (5% from each insertions).

Wiki: Home
Wiki: describing_TE_landscapes
Wiki: describing_TE_sequences