In order to generate more realistic haplotpyes for the population samples we simulate a neutral coalesecent with msms.
In addition to the segregating SNPs we are introduce a number of fixed insertions at random positions.
We used the following command to perform neutral coalescent simulations with msms https://www.mabs.at/ewing/msms/index.shtml.
[0,1685]rokofler%java -jar msms.jar 5 1 -t 50
We solely use output of the haplotypes
11100101010101010011001001001010000100000000001100000011010001011000011100011101000100101001101010
11011111000001110000001000110011000001110000010100001101000001011001011111011000110100101010100000
00000000101010001100010100000100111010001100100010010000101110100000000000100010000001010100010101
11100101000001010000001010000010000100000011000100100011000001011000111100011100001110101001101010
11100101000001010000101000000010000100000000000101000011000001011110111100011100000110101001101010
Note that for our purposes 1 indicates the presence of a TE and 0 the absence.
To use the haplotypes simulated by msms with SimulaTE we need to translate the haplotypes into a gdf file. Furthermore we may introduce a variable number of fixed insertions. Note that this appraoch retains the haplotype information (and not just the allele frequency).
The pgd-template can be found here https://sourceforge.net/projects/manna/files/validation/template.pgd
The other files are availbe in the folder https://sourceforge.net/projects/manna/files/validation/val3-msms/
python pgdg-msms.py --msms states.txt --fixed 100 --template template.pgd > coal.pgd
We have 100 fixed TE insertions.
The coal.pgd may now for example look like
....
424002 * * 57 * *
424321 73 73 73 73 73
428975 24 * * 24 24
430151 * * 2 * *
432074 * * 40 * *
434002 99 99 99 99 99
438997 21 21 21 21 21
443945 12 * * 12 12
444237 * 48 * * *
450511 25 25 25 25 25
451394 101 101 101 101 101
452356 120 120 120 120 120
455739 11 11 * 11 11
456264 * * 101 * *
462011 79 79 79 79 79
464137 * * 35 * *
468605 78 78 * 78 78
477569 * * * 98 *
...
Note the presence of both, fixed and segregating insertions. The segregating insertions follow the haplotypes simulated with msms.
Next we generate the fasta sequences for the 5 haplotpyes from the pgd file
python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --te-seqs ../teseq.fasta --pgd coal.pgd --output coal.fasta
We run RepeatMasker
RepeatMasker --frag 2000000 -pa 5 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta coal.fasta &
and perform a multiple alignment with Manna
python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm coal.fasta.out > coal.manna
In this complex scenario it is difficult to analyse the output automatically, e.g. with a script. In fact we would propably need an alignemt algorithm for comparing the observed and the expected output. Therefore we opted to show both alignment next to each other, which allows to intuitively compare the expected and the observed alignment.
Each row is a TE insertion. The first 6 columns are the expected alignment (PGD; with stars indicating absent insertions; the very first column is the position in the chassis). The last five columns are the observed alignment (Manna; with dashes indicating the absence of TEs). The second row indicates the sample ID. Note that the ordering of the samples changes during a progressive alignment (the most closely related samples are aligned first. To enhance visibility we suggest to copy these results into Excel or Google Sheets.
EXPECTED (PGD) OBSERVED (manna)
SampleID hg1 hg2 hg3 hg4 hg5 hg4 hg5 hg1 hg2 hg3
3022 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4
4151 * * G3 * * - - - - G3
7184 FW3 FW3 FW3 FW3 FW3 FW3 FW3 FW3 FW3 FW3
9552 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297
9598 DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA
9924 IVK IVK IVK IVK IVK IVK IVK IVK IVK IVK
10976 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11
11817 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653
11980 * INE1 * * * - - - - FROGGER
17742 * * FROGGER * * - - - - DMDM11
18360 * INVADER * * * - - - INE1 -
21731 * ACCORD * * * - - - INVADER -
22874 * * DMDM11 * * - - - ACCORD -
23214 * INVADER2 * * * - - - INVADER2 -
25366 ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR
25535 * * GTWIN * * - - - - GTWIN
25988 DME487856 DME487856 * DME487856 DME487856 DME487856 DME487856 DME487856 DME487856 -
26227 GYPSY5 * * * * - - GYPSY5 - -
26595 STALKER3 STALKER3 * STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 -
30233 DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A
30875 DMGYPF1A * * DMGYPF1A DMGYPF1A - - - DMU89994 -
31608 * DMU89994 * * * DMGYPF1A DMGYPF1A DMGYPF1A - -
34516 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049
41129 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5
44026 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684
44457 * * AF418572 * * - - - - AF418572
47122 ROVER ROVER ROVER ROVER ROVER ROVER ROVER ROVER ROVER ROVER
47916 JUAN JUAN * JUAN JUAN JUAN JUAN JUAN JUAN -
49098 AF418572 AF418572 * AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 -
49717 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653
53328 TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR
53788 SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER
65165 ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR
69574 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2
74728 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463
78299 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM
78465 Tinker Tinker * Tinker Tinker Tinker Tinker Tinker Tinker -
84662 * INVADER4 * * * - - - INVADER4 -
86937 McCLINTOCK McCLINTOCK * McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK -
88009 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176
89407 * * BAGGINS * * - - - - BAGGINS
91526 IVK IVK IVK IVK IVK IVK IVK IVK IVK IVK
92638 * * GYPSY3 * * - - - - GYPSY3
94029 IVK IVK IVK IVK IVK IVK IVK IVK IVK IVK
94696 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2
97831 BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD
100036 * GYPSY12 * * * - - - - DME278684
102727 * * DME278684 * * - - - GYPSY12 -
103807 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8
104038 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3
105946 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3
108834 * * DMBLPP * * - - - - DMBLPP
109415 DMBARI1 * * * * - - DMBARI1 - -
111704 S2 S2 S2 S2 S2 S2 S2 S2 S2 S2
112092 OPUS OPUS OPUS OPUS OPUS OPUS OPUS OPUS OPUS OPUS
115183 * GYPSY4 * * * - - - GYPSY4 -
119099 * DOC5 * * * - - - DOC5 -
123241 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10
133622 * G6_DM * * * - - - - DMTNFB
137370 * * * * DMLINEJA - - - G6_DM -
144341 * GYPSY8 * * * - - - GYPSY8 -
145077 * * DMTNFB * * - DMLINEJA - - -
145727 * * * McCLINTOCK * McCLINTOCK - - - -
150146 RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C
154053 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3
154627 DMW1DOC DMW1DOC * DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC -
159086 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1
164031 DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA
168879 * * DOC3 * * - - - - DOC3
169092 S2 * * S2 S2 S2 S2 S2 - -
173398 TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR
173768 LOOPER1_DM * * * * - - LOOPER1_DM - -
176426 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2
180767 * GYPSY7 * * * - - - GYPSY7 -
181012 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298
181956 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1
183960 FB FB FB FB FB FB FB FB FB FB
186230 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM
191147 DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP
191740 * G4_DM * * * - - - - DMIS176
193790 * INVADER5 * * * - - - G4_DM -
196046 * * DMIS176 * * - - - INVADER5 -
197919 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2
198496 STALKER3 STALKER3 * STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 -
201078 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736
201731 DME487856 * * DME487856 DME487856 - - - - G5_DM
202402 * * G5_DM * * - - - GYPSY8 -
203348 * GYPSY8 * * * DME487856 DME487856 DME487856 - -
214859 * * * DMIS176 * DMIS176 - - - -
216305 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1
216313 * * AF222049 * * - - - - AF222049
223151 DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM
223522 TC3 TC3 * TC3 TC3 TC3 TC3 TC3 TC3 -
225542 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1
233734 Beagle Beagle * Beagle Beagle Beagle Beagle Beagle Beagle -
241049 DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A
242756 QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO
243894 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3
244296 DMZAM DMZAM * DMZAM DMZAM DMZAM DMZAM DMZAM DMZAM -
247346 * * TABOR * * - - - - TABOR
249407 Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker
249569 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM
249819 * * * * DMRTMGD1 - DMRTMGD1 - - -
252516 DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO
253816 * * TRANSIB3 * * - - - - TRANSIB3
255576 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11
258801 * * ACCORD * * - - - - ACCORD
261353 * * * * HOPPER2 - - - - STALKER2
262707 LOOPER1_DM * * * * - - - DMTN1731 -
266616 * DMTN1731 * * * - - LOOPER1_DM - -
267398 * * STALKER2 * * - HOPPER2 - - -
269657 DME010298 DME010298 * DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 -
271462 McCLINTOCK McCLINTOCK * McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK -
287721 * GYPSY7 * * * - - - GYPSY7 -
289621 * * DMIS297 * * - - - - DMIS297
296869 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298
300271 * M14653 * * * - - - M14653 -
300501 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7
301318 HEL HEL HEL HEL HEL HEL HEL HEL HEL HEL
303431 * * TRANSIB1 * * - - - - TRANSIB1
305874 * * * G6_DM G6_DM G6_DM G6_DM - - -
306279 DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP
312690 DMTHB1 DMTHB1 * DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 -
315761 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5
318747 * * DMIFACA * * - - - - DMIFACA
323746 TC1 TC1 TC1 TC1 TC1 TC1 TC1 TC1 TC1 TC1
325137 * * DMCOPIA * * - - - - DMCOPIA
325231 S2 S2 S2 S2 S2 S2 S2 S2 S2 S2
328443 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12
330136 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951
330876 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049
342517 DME9736 * * * * - - DME9736 - -
346889 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11
351166 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1
357098 * * * RT1C RT1C RT1C RT1C - - -
362583 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6
363111 * GYPSY9 * * * - - - GYPSY9 -
375153 * * * S2 * S2 - - - -
376951 DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG
379081 JUAN JUAN * JUAN JUAN JUAN JUAN JUAN JUAN -
379151 G5A G5A G5A G5A G5A G5A G5A G5A G5A G5A
382935 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2
388126 McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK -
398825 McCLINTOCK McCLINTOCK * McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK
404944 GYPSY11 * * * * - - GYPSY11 - -
406993 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3
407458 ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR
408777 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4
411977 * * INVADER2 * * - - - - INVADER2
417455 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM
420126 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463
420518 * * DMRER1DM * * - - - - DMRER1DM
427586 FW3 * * * * - - - - TC1-2
428634 HOPPER2 * * * * - - FW3 - -
429999 * * TC1-2 * * - - HOPPER2 - -
433192 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11
433959 DMCOPIA * * DMCOPIA DMCOPIA DMCOPIA DMCOPIA DMCOPIA - -
433979 GYPSY11 * * GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 - -
442398 * * * INVADER2 * INVADER2 - - - -
445002 DMW1DOC DMW1DOC * DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC -
447801 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4
448097 * * G2 * * - - - - G2
454439 PPI251 * * * * - - PPI251 - -
455258 RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C
456628 * * TC1-2 * * - - - - TC1-2
460457 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9
466042 * * * DMIFACA * DMIFACA - - - -
466224 DME010298 DME010298 * DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 -
467218 Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker
468969 DME9736 DME9736 * DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 -
471716 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1
471976 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951
473252 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11
475050 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6
475591 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4
476267 Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker
476914 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4
478084 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5
481997 * * GYPSY9 * * - - - - GYPSY9
483554 DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA
489278 BS4 BS4 * BS4 BS4 BS4 BS4 BS4 BS4 -
489390 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11
490351 FROGGER * * FROGGER FROGGER FROGGER FROGGER FROGGER - -
492299 G7 G7 G7 G7 G7 G7 G7 G7 G7 G7
493417 JUAN JUAN JUAN JUAN JUAN JUAN JUAN JUAN JUAN JUAN
500584 * * IVK * * - - - - IVK
506326 FW2 FW2 * FW2 FW2 FW2 FW2 FW2 FW2 -
508802 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5
508805 GYPSY8 GYPSY8 * GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 -
511708 JOCKEY2 * * JOCKEY2 JOCKEY2 JOCKEY2 JOCKEY2 JOCKEY2 - -
512188 G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM
512631 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3
516072 * * 1360 * * - - - - 1360
520001 * * * * ROOA_LTR - ROOA_LTR - - -
524762 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049
529348 412 412 412 412 412 412 412 412 412 412
530434 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572
533360 RT1B RT1B RT1B RT1B RT1B RT1B RT1B RT1B RT1B RT1B
The observed and the expected alignment are remarkably similar. Solely the ordering of segregating insertions may is not accurately reproduced, but this is expected.
To see why this is expected, consider the two DNA sequences 'ATG' and 'ACG'. If we perform a pairwise alignement with these two sequences using a low gap penalty we may get the two equally valid alignments:
Alignment1:
A-TG
AC-G
Alignment2:
AT-G
A-CG
However, the important information for analysing TEs in piRNA cluster (or other repetitive regions), the population frequency of the different TE insertions is accurately reproduced.
In this simulations all TE insertions were on the plus strand. In samples TEs may however be on both strands. Manna considers the strand of TEs, therefore it will never align a P-element on the plus strand with a P-element on the minus strand. Furthermore Manna will only consider overlapping sequences of TEs, it will thus not align a 5'-fragment of a TE (say the first 1000bp of the P-element) with a 3'-fragment of the TE (say the last 1000 bp of the P-element; the P-element has a length of 2907bp).