Here we validate the entire pipeline for comparing sequences of piRNA clusters (or other regions of interest), i.e. we first simulate DNA sequences with known TE insertions, than annotate repeats (TEs) in these sequences with RepeatMasker, align the resulting repeat-annotation with Manna, and finally compare the observed vs the expected TE landscape.
In more detail:
1) We first simulate for a particular region (e.g piRNA cluster 42AB) a population sample of 5 individuals. That is we will generate 5 fasta sequences with TE insertions. Here we simulate 2 fixed insertions for 123 TE family found in Drosophila. These sequences are simulated with our previous tool SimulaTE
This validation has four stepshttps://sourceforge.net/p/simulates/wiki/Home/
2) We run Repeatmasker to annotate the TEs in these 5 sequences.
3) We perform a Manna alignment with the Repeat annotations.
4) We compare the observed with the expected results. In particular we check if all 246 TE insertions are fixed and if the order of the TE sequences is correct.
In addition to Manna it is alos necessary to download SimulaTE.
To simulate sequences with TE insertions with SimulaTE we need a pgd (population genome definition file)
Here we specify that the sequences should have 2 fixed insertions for each of the 123 TE families. For example:
....
118=$118 # HEL
119=$119 # TC3
120=$120 # Beagle2
121=$121 # Q
122=$122 # OSV
123=$123 # DME542581
1148 27 27 27 27 27
2492 107 107 107 107 107
5147 29 29 29 29 29
5655 112 112 112 112 112
...
Note that the position of the TE insertion is random within the chasis (540kb). For more details on the pgd file see the Wiki of SimulaTE https://sourceforge.net/p/simulates/wiki/Home/
Using the pgd file, the chasis and the TE sequences we can now generate a multifasta file containing 5 entries. Each entry will represent the sequence of a single individuals(strain) from our popualtion sample.
python ~/dev/simulates/build-population-genome.py --chassis chasis.txt --pgd fix.pgd --te-seqs teseq.fasta --pgd fix.pgd --output fix.fasta
We can use RepeatMasker to identify TE sequences in all 5 sequences at the same time
RepeatMasker --frag 2000000 -pa 5 -no_is -s -nolow -dir . -lib teseq.fasta fix.fasta
Only the .out file is of interest to us https://sourceforge.net/projects/manna/files/validation/val1-fix/val1-fix.fasta.out
An example:
...
9809 0.0 0.0 0.0 hg1 1682568 1683631 (5431) + AF541951 Unspecified 1 1064 (0) 257
5283 0.0 0.0 0.0 hg1 1684094 1684657 (4405) + HELITRON1_DM Unspecified 1 564 (0) 258
26677 0.0 0.0 0.0 hg1 1685434 1688273 (789) + G5A Unspecified 1 2840 (0) 259
51265 0.0 0.0 0.0 hg2 1149 6609 (1682453) + DMDM11 Unspecified 1 5461 (0) 260
30111 0.0 0.0 0.0 hg2 7954 11166 (1677896) + R1-2 Unspecified 1 3213 (0) 261
19487 0.0 0.0 0.0 hg2 13822 15942 (1673120) + DMPOGOR11 Unspecified 1 2121 (0) 262
45260 0.0 0.0 0.0 hg2 16451 21405 (1667657) + GYPSY8 Unspecified 1 4955 (0) 263
66255 0.0 0.0 0.0 hg2 26650 34060 (1655002) + DME9736 Unspecified 1 7411 (0) 264
...
Note that we can infer the query sequence from column 5 of the RepeatMasker output. In our case different query sequences refer to different individuals of a population, e.g. hg1 would be the first fly, hg2 the second fly, etc
We can now directly use the RepeatMasker output with Manna, in order to generate a progressive iterative alignment of the repeat annotations of all 5 samples (hg1, hg2, hg3, hg4, hg5).
python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm fix.fasta.out > val1.manna
This yields the following file https://sourceforge.net/projects/manna/files/validation/val1-fix/val1.manna
For example:
...
DMPOGOR11 2121.0 0.0 DMPOGOR11 2121.0 0.0 DMPOGOR11 2121.0 0.0 DMPOGOR11 2121.0 0.0 DMPOGOR11 2121.0 0.0
GYPSY8 4955.0 0.0 GYPSY8 4955.0 0.0 GYPSY8 4955.0 0.0 GYPSY8 4955.0 0.0 GYPSY8 4955.0 0.0
DME9736 7411.0 0.0 DME9736 7411.0 0.0 DME9736 7411.0 0.0 DME9736 7411.0 0.0 DME9736 7411.0 0.0
DME487856 8556.0 0.0 DME487856 8556.0 0.0 DME487856 8556.0 0.0 DME487856 8556.0 0.0 DME487856 8556.0 0.0
GYPSY4 6852.0 0.0 GYPSY4 6852.0 0.0 GYPSY4 6852.0 0.0 GYPSY4 6852.0 0.0 GYPSY4 6852.0 0.0
INVADER3 5484.0 0.0 INVADER3 5484.0 0.0 INVADER3 5484.0 0.0 INVADER3 5484.0 0.0 INVADER3 5484.0 0.0
FROGGER 2483.0 0.0 FROGGER 2483.0 0.0 FROGGER 2483.0 0.0 FROGGER 2483.0 0.0 FROGGER 2483.0 0.0
DMRER1DM 5356.0 0.0 DMRER1DM 5356.0 0.0 DMRER1DM 5356.0 0.0 DMRER1DM 5356.0 0.0 DMRER1DM 5356.0 0.0
...
We simulated a DOC4 insertion with 0% divergence (col 2 in RepeatMasker output). We realized that DOC2 has a small region that has a high similarity to DOC4 (23.1% divergence). Hence RepeatMasker annotates overlapping DOC2 and DOC4 insertions
387 23.1 6.5 2.7 hg1 622046 622152 (1066910) + DOC2 Unspecified 269 379 (4410) 93 *
25999 0.0 0.0 0.0 hg1 622061 624844 (1064218) + DOC4 Unspecified 1 2784 (0) 94
This does not lead to a problem with the Manna alignment. Here the example for the 5 strains.
DOC2 107.0 23.1 DOC2 107.0 23.1 DOC2 107.0 23.1 DOC2 107.0 23.1 DOC2 107.0 23.1
DOC4 2784.0 0.0 DOC4 2784.0 0.0 DOC4 2784.0 0.0 DOC4 2784.0 0.0 DOC4 2784.0 0.0
It would however result in overestimating the abundance of DOC2 in the piRNA cluster (or region of interest). In our simple example we would estimate that the simulated DNA sequence contains three DOC2 insertions instead of the simulated two. For example:
cat val1.manna|grep "DOC2"
DOC2 4789.0 0.0 DOC2 4789.0 0.0 DOC2 4789.0 0.0 DOC2 4789.0 0.0 DOC2 4789.0 0.0
DOC2 107.0 23.1 DOC2 107.0 23.1 DOC2 107.0 23.1 DOC2 107.0 23.1 DOC2 107.0 23.1
DOC2 4789.0 0.0 DOC2 4789.0 0.0 DOC2 4789.0 0.0 DOC2 4789.0 0.0 DOC2 4789.0 0.0
To avoid this problem we may exlcude sequences with a high divergence or very short fragments of TEs. In these validations we use a maximum divergence of 5% (0% was simulated) and a minimum length of 100bp.
To validate our approach we will now compare the expected (the pgd-file) with the observed (manna) TE composition in the population.
python ~/dev/manna/validation/manna-vs-pgd-mhp.py --min-len 100 --max-div 5 --manna val1.manna --pgd fix.pgd > val1.mhp
An example from the resulting file:
...
50086 DMZAM 5 expected
50086 DMZAM 5 observed
50420 DIVER2 5 expected
50420 DIVER2 5 observed
...
Which has the position of the TE (col 1) the family (col 2) the population count of the TE (col 3) and a flag indicating whether the count refers to the observed (Manna) or expected (pgd) value.
Finally this file can be visualized with R
R --vanilla --args val1.mhp < ~/dev/manna/validation/manhatten.R
Here is the output file
We simulated 246 fixed TE insertions in a population sample of 5 (expected). We simulated 2 fixed insertions for 123 TE families found in Drosophila.
The output file (above) shows that population frequency of all TE insertions was correctly estimated by our appraoch (i.e. RepeatMasking + Manna).
Furthermore also the order of the TE insertions was correctly estimated (the Python script manna-vs-pgd-mhp.py would not find the observed values if the order is incorrect).
Wiki: Home
Wiki: validation_fix_seg
Wiki: validation_linearlandscape
Wiki: validation_mixfamfreq