Manna Wiki

Brought to you by: filwierz, rokofler, sarahsignor

validation_fixed

Authors:

Attachments

val1.mhp.png (264018 bytes)

Introduction
Validation

Introduction

Here we validate the entire pipeline for comparing sequences of piRNA clusters (or other regions of interest), i.e. we first simulate DNA sequences with known TE insertions, than annotate repeats (TEs) in these sequences with RepeatMasker, align the resulting repeat-annotation with Manna, and finally compare the observed vs the expected TE landscape.

In more detail:
1) We first simulate for a particular region (e.g piRNA cluster 42AB) a population sample of 5 individuals. That is we will generate 5 fasta sequences with TE insertions. Here we simulate 2 fixed insertions for 123 TE family found in Drosophila. These sequences are simulated with our previous tool SimulaTE
This validation has four stepshttps://sourceforge.net/p/simulates/wiki/Home/
2) We run Repeatmasker to annotate the TEs in these 5 sequences.
3) We perform a Manna alignment with the Repeat annotations.
4) We compare the observed with the expected results. In particular we check if all 246 TE insertions are fixed and if the order of the TE sequences is correct.

Validation

Data

the chasis, ie. a raw sequence into which TEs will be inserted by SimulaTE. The chasis has a length of 540kb. https://sourceforge.net/projects/manna/files/validation/chasis.txt
the TE sequences https://sourceforge.net/projects/manna/files/validation/teseq.fasta
the pgd (population genome definition file) for SimulaTE https://sourceforge.net/projects/manna/files/validation/val1-fix/fix.pgd

In addition to Manna it is alos necessary to download SimulaTE.

The population genome definition

To simulate sequences with TE insertions with SimulaTE we need a pgd (population genome definition file)
Here we specify that the sequences should have 2 fixed insertions for each of the 123 TE families. For example:

....
118=$118     # HEL
119=$119     # TC3
120=$120     # Beagle2
121=$121     # Q
122=$122     # OSV
123=$123     # DME542581
1148 27 27 27 27 27
2492 107 107 107 107 107
5147 29 29 29 29 29
5655 112 112 112 112 112
...

Note that the position of the TE insertion is random within the chasis (540kb). For more details on the pgd file see the Wiki of SimulaTE https://sourceforge.net/p/simulates/wiki/Home/

Simulating 5 chromosomes with fixed TE insertions

Using the pgd file, the chasis and the TE sequences we can now generate a multifasta file containing 5 entries. Each entry will represent the sequence of a single individuals(strain) from our popualtion sample.

python ~/dev/simulates/build-population-genome.py --chassis chasis.txt --pgd fix.pgd --te-seqs teseq.fasta --pgd fix.pgd --output fix.fasta

RepeatMask

We can use RepeatMasker to identify TE sequences in all 5 sequences at the same time

RepeatMasker --frag 2000000 -pa 5 -no_is -s -nolow -dir . -lib teseq.fasta fix.fasta

Only the .out file is of interest to us https://sourceforge.net/projects/manna/files/validation/val1-fix/val1-fix.fasta.out
An example:

...
 9809    0.0  0.0  0.0  hg1       1682568 1683631    (5431) + AF541951     Unspecified      1   1064    (0)  257  
 5283    0.0  0.0  0.0  hg1       1684094 1684657    (4405) + HELITRON1_DM Unspecified      1    564    (0)  258  
26677    0.0  0.0  0.0  hg1       1685434 1688273     (789) + G5A          Unspecified      1   2840    (0)  259  
51265    0.0  0.0  0.0  hg2          1149    6609 (1682453) + DMDM11       Unspecified      1   5461    (0)  260  
30111    0.0  0.0  0.0  hg2          7954   11166 (1677896) + R1-2         Unspecified      1   3213    (0)  261  
19487    0.0  0.0  0.0  hg2         13822   15942 (1673120) + DMPOGOR11    Unspecified      1   2121    (0)  262  
45260    0.0  0.0  0.0  hg2         16451   21405 (1667657) + GYPSY8       Unspecified      1   4955    (0)  263  
66255    0.0  0.0  0.0  hg2         26650   34060 (1655002) + DME9736      Unspecified      1   7411    (0)  264  
...

Note that we can infer the query sequence from column 5 of the RepeatMasker output. In our case different query sequences refer to different individuals of a population, e.g. hg1 would be the first fly, hg2 the second fly, etc

Manna alignment

We can now directly use the RepeatMasker output with Manna, in order to generate a progressive iterative alignment of the repeat annotations of all 5 samples (hg1, hg2, hg3, hg4, hg5).

python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm fix.fasta.out > val1.manna

This yields the following file https://sourceforge.net/projects/manna/files/validation/val1-fix/val1.manna

For example:

...
DMPOGOR11       2121.0  0.0     DMPOGOR11       2121.0  0.0     DMPOGOR11       2121.0  0.0     DMPOGOR11       2121.0  0.0     DMPOGOR11       2121.0  0.0
GYPSY8  4955.0  0.0     GYPSY8  4955.0  0.0     GYPSY8  4955.0  0.0     GYPSY8  4955.0  0.0     GYPSY8  4955.0  0.0
DME9736 7411.0  0.0     DME9736 7411.0  0.0     DME9736 7411.0  0.0     DME9736 7411.0  0.0     DME9736 7411.0  0.0
DME487856       8556.0  0.0     DME487856       8556.0  0.0     DME487856       8556.0  0.0     DME487856       8556.0  0.0     DME487856       8556.0  0.0
GYPSY4  6852.0  0.0     GYPSY4  6852.0  0.0     GYPSY4  6852.0  0.0     GYPSY4  6852.0  0.0     GYPSY4  6852.0  0.0
INVADER3        5484.0  0.0     INVADER3        5484.0  0.0     INVADER3        5484.0  0.0     INVADER3        5484.0  0.0     INVADER3        5484.0  0.0
FROGGER 2483.0  0.0     FROGGER 2483.0  0.0     FROGGER 2483.0  0.0     FROGGER 2483.0  0.0     FROGGER 2483.0  0.0
DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0
...

A small problem with DOC2 and DOC4

We simulated a DOC4 insertion with 0% divergence (col 2 in RepeatMasker output). We realized that DOC2 has a small region that has a high similarity to DOC4 (23.1% divergence). Hence RepeatMasker annotates overlapping DOC2 and DOC4 insertions

387   23.1  6.5  2.7  hg1        622046  622152 (1066910) + DOC2         Unspecified    269    379 (4410)   93 *
25999    0.0  0.0  0.0  hg1        622061  624844 (1064218) + DOC4         Unspecified      1   2784    (0)   94

This does not lead to a problem with the Manna alignment. Here the example for the 5 strains.

DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1
DOC4    2784.0  0.0     DOC4    2784.0  0.0     DOC4    2784.0  0.0     DOC4    2784.0  0.0     DOC4    2784.0  0.0

It would however result in overestimating the abundance of DOC2 in the piRNA cluster (or region of interest). In our simple example we would estimate that the simulated DNA sequence contains three DOC2 insertions instead of the simulated two. For example:

cat val1.manna|grep "DOC2"
DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0
DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1    DOC2    107.0   23.1
DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0 DOC2    4789.0  0.0

To avoid this problem we may exlcude sequences with a high divergence or very short fragments of TEs. In these validations we use a maximum divergence of 5% (0% was simulated) and a minimum length of 100bp.

Results - expected vs observed TE landscape

To validate our approach we will now compare the expected (the pgd-file) with the observed (manna) TE composition in the population.

python ~/dev/manna/validation/manna-vs-pgd-mhp.py --min-len 100 --max-div 5 --manna val1.manna --pgd fix.pgd > val1.mhp

An example from the resulting file:

...
50086 DMZAM 5 expected
50086 DMZAM 5 observed
50420 DIVER2 5 expected
50420 DIVER2 5 observed
...

Which has the position of the TE (col 1) the family (col 2) the population count of the TE (col 3) and a flag indicating whether the count refers to the observed (Manna) or expected (pgd) value.

Finally this file can be visualized with R

R --vanilla --args val1.mhp < ~/dev/manna/validation/manhatten.R

Here is the output file

Conclusion

We simulated 246 fixed TE insertions in a population sample of 5 (expected). We simulated 2 fixed insertions for 123 TE families found in Drosophila.
The output file (above) shows that population frequency of all TE insertions was correctly estimated by our appraoch (i.e. RepeatMasking + Manna).
Furthermore also the order of the TE insertions was correctly estimated (the Python script manna-vs-pgd-mhp.py would not find the observed values if the order is incorrect).

Wiki: Home
Wiki: validation_fix_seg
Wiki: validation_linearlandscape
Wiki: validation_mixfamfreq