Menu

describing_TE_sequences

Robert Kofler

Introduction

We developed a simple domain specific language that allows to describe arbitrary complex TE landscapes [describing_TE_landscapes], i.e. the positions and sequences of TEs within population. An important component of this TE landscape is the sequence of each insertion. TE sequences may be directly provided in the header of the pgd-file or loaded from a fasta-file (see [describing_TE_landscapes]).
However, based on this set of initial TE sequences (e.g the consensus sequences of TE families), our domain specific language allows to define arbitrary complex derived sequences having, for example, large deletions (e.g. the KP element is a truncated P-element), complex nested insertions, and sequence divergence.

Based on the pgd-file, derivate sequences may either be defined in the header as in the following example:

hobo="TTT"
roo="CCCC"
nesthobo=hobo+{2:roo}

where we define the sequence nesthobo as a hobo insertion having a nested roo insertion at the second position.

Alternatively derivate sequences may be directly defined at the insertion site as in the following example of a pgd-file

hobo="TTT"
roo="CCCC"
9 * * * * * hobo+{2:roo}

Mostly it will not make a difference which of these two strategies is used. It will however only make a difference when a sequence divergence is provided. A diverged TE sequence that is defined in the header will only be mutated once, whereas each TE defined at the insertion site will be mutated separately. For more details on defining the sequences see [describing_TE_landscapes]

A domain specific language for defining insertion sequences

In the following examples we use a chasis consisting of the sequence "123456789". Of course, usually a DNA sequence has to be provided but this numerical sequence tremendously helps to explain our domain specific language. All shown examples have been tested and most are also implemented as unit-tests, so their proper execution is ensured.

Strand

The strand may be specified by the minus or plus sign following the TE name.
Given the following pgd-file

chasis="123456789"
h="TTT"
h_plus=h+
h_minus=h-
2 h_plus
4 h_minus

We obtain the sequence:

12TTT34AAA56789

if no strand is provided the plus strand is assumed per default

TSD

basics

The target site duplication must be provided after the strand in the form of say 3bp

Given the following pgd-file

chasis="123456789"
h="TTT"
h_plus=h+1bp
h_minus=h-2bp
2 h_plus
7 h_minus

we obtain

12TTT234567AAA6789

Per default the TSD of the parent TE is used. If a TE has no parent a TSD of zero is used.

Note the bases left to the insertion site are used for the TSD. In other words when choosing the insertion site, pick the site at the rightmost end of the TSD.

per default the TSD of the parent is used

For example with the following pgd-file:

chasis="123456789"
h="TTT"+2bp
h_plus=h+
h_minus=h-
2 h_plus
7 h_minus

we obtain the sequence:

12TTT1234567AAA6789

The parent is h with a TSD=2; The two kids h_plus and h_minus inherit the TSD of 2bp. If another TSD should be used for the child, just provide a new TSD (e.g. specify h_plus=h+3bp)

deletions/truncations

basics

Deletions or truncations within TEs may be specified using square brackets. Specify the regions that will be deleted. Multiple regions may be specified.

For examle, given the pgd-file:

chasis="123456789"
hobo="AATTCCGG"
d=hobo[3..6]
5 d

we obtain

12345AAGG6789

The syntax hobo[3..6] specifies that the bases 3 to 6 should be removed from hobo.

multiple deletions

Multiple deletions may be specified. They need to be separated by a coma (,)

Given the pgd-file:

chasis="123456789"
hobo="AATCCGG"
d=hobo[3..3,6..7]
5 d

we obtain

12345AACC6789

Note Overlapping deletions (e.g. [4..6,4..12]) are not causing an error. Every base will only be deleted once.

deletions and strand information

Deletions are introduced before reverse complimenting the TE. The rationale is that many TE have distinct truncations that always occur at the same position. For example, an important variant has been described for the P-element (2907bp length): the KP-element where the bases 808-2560 of the P-element are deleted. The KP can be generated with the syntax kp=pelement[808..2560]+ or the reverse complement with kp=pelement[808..2560]-

As another example, given:

chasis="123456789"
hobo="ATCG"
h_p=hobo[2..3]+
h_m=hobo[2..3]-
2 h_p
5 h_m

we obtain:

12AG345CT6789

special position symbols: ^ | $

The following special symbols are supported

  • ^ beginning of the sequence
  • $ end of the sequence
  • | middle of the sequence

for example, given the pgd-file:

chasis="123456789"
hobo="AATTCCGG"
d=hobo[^..2,7..$]
2 d

we obtain

12TTCC3456789

diverged insertions

basics

It is possible to introduced random mutations (base substitutions) into a sequence by simply providing the percentage of the divergence after the strand:

For example given a pgd-file:

chasis="123456789"
hobo="AAAAAAAAA"
d=hobo+50%
5 d

we obtain:

12345AGACCAAAG6789

combining TSD and sequence divergence

It is possible to provide both a TSD and a sequence divergence:

For example given the pgd-file:

chasis="123456789"
hobo="AAAAAAAAA"
d=hobo-50%3bp
5 d

we obtain:

12345TTCTAATGT3456789

nested insertions

basics

Nested insertions must be provided in curly brackets after the strand (+ or -).

given the pgd-file

chasis="123456789"
hobo="AACC"
roo="TTT"
d=hobo+{2:roo}
2 d

we obtain:

12AATTTCC3456789

Here roo will be inserted into hobo at position 2. The position always refers to the parent sequence!
Note that the special symbols ^ | $ are also permitted for the position of nested insertions.

nested insertions and strand of the parent sequence

Nested insertions are integrated into the parent sequence after reverse complementing the parent. Note that this behaviour is different from the internal truncations: truncated regions are removed before reverse complementing the parent sequence. The rationale is that truncated TEs are specific derivates of a TE family that may occur at high abundance within genomes, while nested insertions are uniqe events that mostly occur only once (usually, a nested insertion inactivates a TE). Hence, we aimed to make the description of nested insertions as intuitive and easy as possible, and we argue that inserting after reverse complementing is easier to understand, especially with multiple deeply nested insertions.

For example given the pgd-file:

chasis="123456789"
hobo="AACC"
roo="TTT"
d_plus=hobo+{1:roo}
d_minus=hobo-{1:roo}
2 d_plus
5 d_minus

we obtain:

12ATTTACC345GTTTGTT6789

nested insertions and strand of the child sequence

Of course also the strand of the nested insertions may be specified.

Given the pgd-file:

chasis="123456789"
hobo="AACC"
roo="TTT"
d_plus=hobo+{1:roo+}
d_minus=hobo+{1:roo-}
2 d_plus
5 d_minus

we obtain:

12ATTTACC345AAAAACC6789

a nested child insertions may also generate a TSD

Nested insertions may also produce a TSD.

Given the pgd-file:

chasis="123456789"
hobo="atcg"+1bp
roo="TTT"+2bp
d=hobo+{2:roo+}
5 d

we obtain:

12345atTTTatcg56789

Note that the insertion of roo resulted in duplication of the hobo sequence at while the insertion of hobo resulted in a the duplication of 5 in the chasis.

nested insetions and TSDs

A TSD of a nested insertion may be provided by adding say 3bp after declaration of the nested insertion.

Given a pgd-file:

chasis="123456789"
hobo="atcg"
roo="TTTT"
d=hobo+{2:roo+}3bp
5 d

we obtain:

12345atTTTTcg3456789

nested insertions and sequence divergence

Nested insertions and sequence divergence may be combined. Note that random mutations will only be introduced into the parent sequence not into the nested child insertions. This allows a more fine graind control and avoids multiple rounds of mutations for nested sequences.

Given the pgd-file:

chasis="123456789"
hobo="AAAAAAAAAA"
pelement="CCCCCCCCCC"
d=hobo+{5:pelement+}50%
5 d

we obtain:

12345CAACACCCCCCCCCCGGAAA6789

nested insertions and truncations

Nested insertions and truncations may be combined. Given the pgd-file:

chasis="123456789"
hobo="TTtttAA"
pelement="CCCC"
d=hobo[3..5]+{3:pelement+}
5 d

we obtain

12345TTACCCCA6789

Note : first the truncated sequence is generated (TTtttAA -> TTAA) and second the child-TE is inserted into the parent sequence. Thus when computing the insertion position of the child-TE (position 3 of hobo) the truncated regions are ignored.

multiple nested insertions

Several nested insertions may be specified; They must be separated by a coma (,).

Given the pgd-file:

chasis="123456789"
hobo="atcg"
roo="TTT"
d=hobo+{1:roo+,3:roo-}
5 d

we obtain:

12345aTTTtcAAAg6789

deeply nested insertions

Our domain specific language also allows to specify recursively nested insertions (up to an arbitrary depth).

For example given the TSD file

chasis="123456789"
hobo="atcg"
roo="TTTT"
pelement="CCC"
d=hobo+{2:roo+{3:pelement+}}
5 d

we obtain:

12345atTTTCCCTcg6789

Here we have a pelement inserted into a roo at position 3 (of roo). This roo is in turn inserted into hobo (at position 2 of hobo).

Arbitrary complex scenarios

In the final example we demonstrate that it is possible to combine deletions, nested insertions, sequence divergence and deeply nested insertions in any arbitrary way.

For example given the pgd-file:

chasis="123456789"
hobo="aattccgg"+2bp
roo="TTTT"-1bp
pelement="CCCC"
d=hobo[3..4]+{1:roo[3..$],3:pelement+{2:roo-2bp}}3bp
1 roo
5 d
8 hobo

we obtain:

1AAAA12345aAAaacCCTTTTCCCCcgg345678aattccgg789

Note the characters 0123456789 and atcg (instead of uppercase ATCG) have only be used for illustrative purposes. Solely the characters ATCG are allowed for real data.


Related

Wiki: Home
Wiki: TheClassic_SanMiguel_TELandscape
Wiki: Walkthrough
Wiki: describing_TE_landscapes
Wiki: special_use_cases

MongoDB Logo MongoDB