We developed a simple domain specific language that allows to describe arbitrary complex TE landscapes [describing_TE_landscapes], i.e. the positions and sequences of TEs within population. An important component of this TE landscape is the sequence of each insertion. TE sequences may be directly provided in the header of the pgd-file or loaded from a fasta-file (see [describing_TE_landscapes]).
However, based on this set of initial TE sequences (e.g the consensus sequences of TE families), our domain specific language allows to define arbitrary complex derived sequences having, for example, large deletions (e.g. the KP element is a truncated P-element), complex nested insertions, and sequence divergence.
Based on the pgd-file, derivate sequences may either be defined in the header as in the following example:
hobo="TTT"
roo="CCCC"
nesthobo=hobo+{2:roo}
where we define the sequence nesthobo as a hobo insertion having a nested roo insertion at the second position.
Alternatively derivate sequences may be directly defined at the insertion site as in the following example of a pgd-file
hobo="TTT"
roo="CCCC"
9 * * * * * hobo+{2:roo}
Mostly it will not make a difference which of these two strategies is used. It will however only make a difference when a sequence divergence is provided. A diverged TE sequence that is defined in the header will only be mutated once, whereas each TE defined at the insertion site will be mutated separately. For more details on defining the sequences see [describing_TE_landscapes]
In the following examples we use a chasis consisting of the sequence "123456789". Of course, usually a DNA sequence has to be provided but this numerical sequence tremendously helps to explain our domain specific language. All shown examples have been tested and most are also implemented as unit-tests, so their proper execution is ensured.
The strand may be specified by the minus or plus sign following the TE name.
Given the following pgd-file
chasis="123456789"
h="TTT"
h_plus=h+
h_minus=h-
2 h_plus
4 h_minus
We obtain the sequence:
12TTT34AAA56789
if no strand is provided the plus strand is assumed per default
The target site duplication must be provided after the strand in the form of say 3bp
Given the following pgd-file
chasis="123456789"
h="TTT"
h_plus=h+1bp
h_minus=h-2bp
2 h_plus
7 h_minus
we obtain
12TTT234567AAA6789
Per default the TSD of the parent TE is used. If a TE has no parent a TSD of zero is used.
Note the bases left to the insertion site are used for the TSD. In other words when choosing the insertion site, pick the site at the rightmost end of the TSD.
For example with the following pgd-file:
chasis="123456789"
h="TTT"+2bp
h_plus=h+
h_minus=h-
2 h_plus
7 h_minus
we obtain the sequence:
12TTT1234567AAA6789
The parent is h with a TSD=2; The two kids h_plus and h_minus inherit the TSD of 2bp. If another TSD should be used for the child, just provide a new TSD (e.g. specify h_plus=h+3bp)
Deletions or truncations within TEs may be specified using square brackets. Specify the regions that will be deleted. Multiple regions may be specified.
For examle, given the pgd-file:
chasis="123456789"
hobo="AATTCCGG"
d=hobo[3..6]
5 d
we obtain
12345AAGG6789
The syntax hobo[3..6] specifies that the bases 3 to 6 should be removed from hobo.
Multiple deletions may be specified. They need to be separated by a coma (,)
Given the pgd-file:
chasis="123456789"
hobo="AATCCGG"
d=hobo[3..3,6..7]
5 d
we obtain
12345AACC6789
Note Overlapping deletions (e.g. [4..6,4..12]) are not causing an error. Every base will only be deleted once.
Deletions are introduced before reverse complimenting the TE. The rationale is that many TE have distinct truncations that always occur at the same position. For example, an important variant has been described for the P-element (2907bp length): the KP-element where the bases 808-2560 of the P-element are deleted. The KP can be generated with the syntax kp=pelement[808..2560]+ or the reverse complement with kp=pelement[808..2560]-
As another example, given:
chasis="123456789"
hobo="ATCG"
h_p=hobo[2..3]+
h_m=hobo[2..3]-
2 h_p
5 h_m
we obtain:
12AG345CT6789
The following special symbols are supported
for example, given the pgd-file:
chasis="123456789"
hobo="AATTCCGG"
d=hobo[^..2,7..$]
2 d
we obtain
12TTCC3456789
It is possible to introduced random mutations (base substitutions) into a sequence by simply providing the percentage of the divergence after the strand:
For example given a pgd-file:
chasis="123456789"
hobo="AAAAAAAAA"
d=hobo+50%
5 d
we obtain:
12345AGACCAAAG6789
It is possible to provide both a TSD and a sequence divergence:
For example given the pgd-file:
chasis="123456789"
hobo="AAAAAAAAA"
d=hobo-50%3bp
5 d
we obtain:
12345TTCTAATGT3456789
Nested insertions must be provided in curly brackets after the strand (+ or -).
given the pgd-file
chasis="123456789"
hobo="AACC"
roo="TTT"
d=hobo+{2:roo}
2 d
we obtain:
12AATTTCC3456789
Here roo will be inserted into hobo at position 2. The position always refers to the parent sequence!
Note that the special symbols ^ | $ are also permitted for the position of nested insertions.
Nested insertions are integrated into the parent sequence after reverse complementing the parent. Note that this behaviour is different from the internal truncations: truncated regions are removed before reverse complementing the parent sequence. The rationale is that truncated TEs are specific derivates of a TE family that may occur at high abundance within genomes, while nested insertions are uniqe events that mostly occur only once (usually, a nested insertion inactivates a TE). Hence, we aimed to make the description of nested insertions as intuitive and easy as possible, and we argue that inserting after reverse complementing is easier to understand, especially with multiple deeply nested insertions.
For example given the pgd-file:
chasis="123456789"
hobo="AACC"
roo="TTT"
d_plus=hobo+{1:roo}
d_minus=hobo-{1:roo}
2 d_plus
5 d_minus
we obtain:
12ATTTACC345GTTTGTT6789
Of course also the strand of the nested insertions may be specified.
Given the pgd-file:
chasis="123456789"
hobo="AACC"
roo="TTT"
d_plus=hobo+{1:roo+}
d_minus=hobo+{1:roo-}
2 d_plus
5 d_minus
we obtain:
12ATTTACC345AAAAACC6789
Nested insertions may also produce a TSD.
Given the pgd-file:
chasis="123456789"
hobo="atcg"+1bp
roo="TTT"+2bp
d=hobo+{2:roo+}
5 d
we obtain:
12345atTTTatcg56789
Note that the insertion of roo resulted in duplication of the hobo sequence at while the insertion of hobo resulted in a the duplication of 5 in the chasis.
A TSD of a nested insertion may be provided by adding say 3bp after declaration of the nested insertion.
Given a pgd-file:
chasis="123456789"
hobo="atcg"
roo="TTTT"
d=hobo+{2:roo+}3bp
5 d
we obtain:
12345atTTTTcg3456789
Nested insertions and sequence divergence may be combined. Note that random mutations will only be introduced into the parent sequence not into the nested child insertions. This allows a more fine graind control and avoids multiple rounds of mutations for nested sequences.
Given the pgd-file:
chasis="123456789"
hobo="AAAAAAAAAA"
pelement="CCCCCCCCCC"
d=hobo+{5:pelement+}50%
5 d
we obtain:
12345CAACACCCCCCCCCCGGAAA6789
Nested insertions and truncations may be combined. Given the pgd-file:
chasis="123456789"
hobo="TTtttAA"
pelement="CCCC"
d=hobo[3..5]+{3:pelement+}
5 d
we obtain
12345TTACCCCA6789
Note : first the truncated sequence is generated (TTtttAA -> TTAA) and second the child-TE is inserted into the parent sequence. Thus when computing the insertion position of the child-TE (position 3 of hobo) the truncated regions are ignored.
Several nested insertions may be specified; They must be separated by a coma (,).
Given the pgd-file:
chasis="123456789"
hobo="atcg"
roo="TTT"
d=hobo+{1:roo+,3:roo-}
5 d
we obtain:
12345aTTTtcAAAg6789
Our domain specific language also allows to specify recursively nested insertions (up to an arbitrary depth).
For example given the TSD file
chasis="123456789"
hobo="atcg"
roo="TTTT"
pelement="CCC"
d=hobo+{2:roo+{3:pelement+}}
5 d
we obtain:
12345atTTTCCCTcg6789
Here we have a pelement inserted into a roo at position 3 (of roo). This roo is in turn inserted into hobo (at position 2 of hobo).
In the final example we demonstrate that it is possible to combine deletions, nested insertions, sequence divergence and deeply nested insertions in any arbitrary way.
For example given the pgd-file:
chasis="123456789"
hobo="aattccgg"+2bp
roo="TTTT"-1bp
pelement="CCCC"
d=hobo[3..4]+{1:roo[3..$],3:pelement+{2:roo-2bp}}3bp
1 roo
5 d
8 hobo
we obtain:
1AAAA12345aAAaacCCTTTTCCCCcgg345678aattccgg789
Note the characters 0123456789 and atcg (instead of uppercase ATCG) have only be used for illustrative purposes. Solely the characters ATCG are allowed for real data.
Wiki: Home
Wiki: TheClassic_SanMiguel_TELandscape
Wiki: Walkthrough
Wiki: describing_TE_landscapes
Wiki: special_use_cases