In order to generate more realistic haplotpyes for the population samples we simulate a neutral coalesecent with msms.
In addition to the segregating SNPs we are introduce a number of fixed insertions at random positions.
We used the following command to perform neutral coalescent simulations with msms https://www.mabs.at/ewing/msms/index.shtml.
[0,1685]rokofler%java -jar msms.jar 5 1 -t 50
We solely use output of the haplotypes
11100101010101010011001001001010000100000000001100000011010001011000011100011101000100101001101010 11011111000001110000001000110011000001110000010100001101000001011001011111011000110100101010100000 00000000101010001100010100000100111010001100100010010000101110100000000000100010000001010100010101 11100101000001010000001010000010000100000011000100100011000001011000111100011100001110101001101010 11100101000001010000101000000010000100000000000101000011000001011110111100011100000110101001101010
Note that for our purposes 1 indicates the presence of a TE and 0 the absence.
To use the haplotypes simulated by msms with SimulaTE we need to translate the haplotypes into a gdf file. Furthermore we may introduce a variable number of fixed insertions. Note that this appraoch retains the haplotype information (and not just the allele frequency).
The pgd-template can be found here https://sourceforge.net/projects/manna/files/validation/template.pgd
The other files are availbe in the folder https://sourceforge.net/projects/manna/files/validation/val3-msms/
python pgdg-msms.py --msms states.txt --fixed 100 --template template.pgd > coal.pgd
We have 100 fixed TE insertions.
The coal.pgd may now for example look like
.... 424002 * * 57 * * 424321 73 73 73 73 73 428975 24 * * 24 24 430151 * * 2 * * 432074 * * 40 * * 434002 99 99 99 99 99 438997 21 21 21 21 21 443945 12 * * 12 12 444237 * 48 * * * 450511 25 25 25 25 25 451394 101 101 101 101 101 452356 120 120 120 120 120 455739 11 11 * 11 11 456264 * * 101 * * 462011 79 79 79 79 79 464137 * * 35 * * 468605 78 78 * 78 78 477569 * * * 98 * ...
Note the presence of both, fixed and segregating insertions. The segregating insertions follow the haplotypes simulated with msms.
Next we generate the fasta sequences for the 5 haplotpyes from the pgd file
python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --te-seqs ../teseq.fasta --pgd coal.pgd --output coal.fasta
We run RepeatMasker
RepeatMasker --frag 2000000 -pa 5 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta coal.fasta &
and perform a multiple alignment with Manna
python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm coal.fasta.out > coal.manna
In this complex scenario it is difficult to analyse the output automatically, e.g. with a script. In fact we would propably need an alignemt algorithm for comparing the observed and the expected output. Therefore we opted to show both alignment next to each other, which allows to intuitively compare the expected and the observed alignment.
Each row is a TE insertion. The first 6 columns are the expected alignment (PGD; with stars indicating absent insertions; the very first column is the position in the chassis). The last five columns are the observed alignment (Manna; with dashes indicating the absence of TEs). The second row indicates the sample ID. Note that the ordering of the samples changes during a progressive alignment (the most closely related samples are aligned first. To enhance visibility we suggest to copy these results into Excel or Google Sheets.
EXPECTED (PGD) OBSERVED (manna) SampleID hg1 hg2 hg3 hg4 hg5 hg4 hg5 hg1 hg2 hg3 3022 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 4151 * * G3 * * - - - - G3 7184 FW3 FW3 FW3 FW3 FW3 FW3 FW3 FW3 FW3 FW3 9552 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 DMIS297 9598 DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA 9924 IVK IVK IVK IVK IVK IVK IVK IVK IVK IVK 10976 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 11817 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 11980 * INE1 * * * - - - - FROGGER 17742 * * FROGGER * * - - - - DMDM11 18360 * INVADER * * * - - - INE1 - 21731 * ACCORD * * * - - - INVADER - 22874 * * DMDM11 * * - - - ACCORD - 23214 * INVADER2 * * * - - - INVADER2 - 25366 ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR 25535 * * GTWIN * * - - - - GTWIN 25988 DME487856 DME487856 * DME487856 DME487856 DME487856 DME487856 DME487856 DME487856 - 26227 GYPSY5 * * * * - - GYPSY5 - - 26595 STALKER3 STALKER3 * STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 - 30233 DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A 30875 DMGYPF1A * * DMGYPF1A DMGYPF1A - - - DMU89994 - 31608 * DMU89994 * * * DMGYPF1A DMGYPF1A DMGYPF1A - - 34516 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 41129 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 44026 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 DME278684 44457 * * AF418572 * * - - - - AF418572 47122 ROVER ROVER ROVER ROVER ROVER ROVER ROVER ROVER ROVER ROVER 47916 JUAN JUAN * JUAN JUAN JUAN JUAN JUAN JUAN - 49098 AF418572 AF418572 * AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 - 49717 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 M14653 53328 TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR 53788 SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER SPRINGER 65165 ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR 69574 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 DIVER2 74728 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 78299 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM 78465 Tinker Tinker * Tinker Tinker Tinker Tinker Tinker Tinker - 84662 * INVADER4 * * * - - - INVADER4 - 86937 McCLINTOCK McCLINTOCK * McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK - 88009 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 DMIS176 89407 * * BAGGINS * * - - - - BAGGINS 91526 IVK IVK IVK IVK IVK IVK IVK IVK IVK IVK 92638 * * GYPSY3 * * - - - - GYPSY3 94029 IVK IVK IVK IVK IVK IVK IVK IVK IVK IVK 94696 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 STALKER2 97831 BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD BLOOD 100036 * GYPSY12 * * * - - - - DME278684 102727 * * DME278684 * * - - - GYPSY12 - 103807 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 104038 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 105946 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 DOC3 108834 * * DMBLPP * * - - - - DMBLPP 109415 DMBARI1 * * * * - - DMBARI1 - - 111704 S2 S2 S2 S2 S2 S2 S2 S2 S2 S2 112092 OPUS OPUS OPUS OPUS OPUS OPUS OPUS OPUS OPUS OPUS 115183 * GYPSY4 * * * - - - GYPSY4 - 119099 * DOC5 * * * - - - DOC5 - 123241 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 GYPSY10 133622 * G6_DM * * * - - - - DMTNFB 137370 * * * * DMLINEJA - - - G6_DM - 144341 * GYPSY8 * * * - - - GYPSY8 - 145077 * * DMTNFB * * - DMLINEJA - - - 145727 * * * McCLINTOCK * McCLINTOCK - - - - 150146 RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C 154053 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 154627 DMW1DOC DMW1DOC * DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC - 159086 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 TRANSIB1 164031 DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA 168879 * * DOC3 * * - - - - DOC3 169092 S2 * * S2 S2 S2 S2 S2 - - 173398 TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR TABOR 173768 LOOPER1_DM * * * * - - LOOPER1_DM - - 176426 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 180767 * GYPSY7 * * * - - - GYPSY7 - 181012 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 181956 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 183960 FB FB FB FB FB FB FB FB FB FB 186230 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM 191147 DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP 191740 * G4_DM * * * - - - - DMIS176 193790 * INVADER5 * * * - - - G4_DM - 196046 * * DMIS176 * * - - - INVADER5 - 197919 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 INVADER2 198496 STALKER3 STALKER3 * STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 STALKER3 - 201078 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 201731 DME487856 * * DME487856 DME487856 - - - - G5_DM 202402 * * G5_DM * * - - - GYPSY8 - 203348 * GYPSY8 * * * DME487856 DME487856 DME487856 - - 214859 * * * DMIS176 * DMIS176 - - - - 216305 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 216313 * * AF222049 * * - - - - AF222049 223151 DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM DMRER1DM 223522 TC3 TC3 * TC3 TC3 TC3 TC3 TC3 TC3 - 225542 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 DMHFL1 233734 Beagle Beagle * Beagle Beagle Beagle Beagle Beagle Beagle - 241049 DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A DMCR1A 242756 QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO QUASIMODO 243894 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 TC3 244296 DMZAM DMZAM * DMZAM DMZAM DMZAM DMZAM DMZAM DMZAM - 247346 * * TABOR * * - - - - TABOR 249407 Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker 249569 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM 249819 * * * * DMRTMGD1 - DMRTMGD1 - - - 252516 DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO DM_ROO 253816 * * TRANSIB3 * * - - - - TRANSIB3 255576 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 258801 * * ACCORD * * - - - - ACCORD 261353 * * * * HOPPER2 - - - - STALKER2 262707 LOOPER1_DM * * * * - - - DMTN1731 - 266616 * DMTN1731 * * * - - LOOPER1_DM - - 267398 * * STALKER2 * * - HOPPER2 - - - 269657 DME010298 DME010298 * DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 - 271462 McCLINTOCK McCLINTOCK * McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK - 287721 * GYPSY7 * * * - - - GYPSY7 - 289621 * * DMIS297 * * - - - - DMIS297 296869 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 300271 * M14653 * * * - - - M14653 - 300501 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 GYPSY7 301318 HEL HEL HEL HEL HEL HEL HEL HEL HEL HEL 303431 * * TRANSIB1 * * - - - - TRANSIB1 305874 * * * G6_DM G6_DM G6_DM G6_DM - - - 306279 DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP DMBLPP 312690 DMTHB1 DMTHB1 * DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 - 315761 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 INVADER5 318747 * * DMIFACA * * - - - - DMIFACA 323746 TC1 TC1 TC1 TC1 TC1 TC1 TC1 TC1 TC1 TC1 325137 * * DMCOPIA * * - - - - DMCOPIA 325231 S2 S2 S2 S2 S2 S2 S2 S2 S2 S2 328443 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 GYPSY12 330136 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 330876 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 342517 DME9736 * * * * - - DME9736 - - 346889 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 DMDM11 351166 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 DMRTMGD1 357098 * * * RT1C RT1C RT1C RT1C - - - 362583 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 INVADER6 363111 * GYPSY9 * * * - - - GYPSY9 - 375153 * * * S2 * S2 - - - - 376951 DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG DMREPG 379081 JUAN JUAN * JUAN JUAN JUAN JUAN JUAN JUAN - 379151 G5A G5A G5A G5A G5A G5A G5A G5A G5A G5A 382935 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 HOPPER2 388126 McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK - 398825 McCLINTOCK McCLINTOCK * McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK McCLINTOCK 404944 GYPSY11 * * * * - - GYPSY11 - - 406993 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 INVADER3 407458 ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR ROOA_LTR 408777 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 411977 * * INVADER2 * * - - - - INVADER2 417455 HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM HELITRON1_DM 420126 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 DM33463 420518 * * DMRER1DM * * - - - - DMRER1DM 427586 FW3 * * * * - - - - TC1-2 428634 HOPPER2 * * * * - - FW3 - - 429999 * * TC1-2 * * - - HOPPER2 - - 433192 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 433959 DMCOPIA * * DMCOPIA DMCOPIA DMCOPIA DMCOPIA DMCOPIA - - 433979 GYPSY11 * * GYPSY11 GYPSY11 GYPSY11 GYPSY11 GYPSY11 - - 442398 * * * INVADER2 * INVADER2 - - - - 445002 DMW1DOC DMW1DOC * DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC DMW1DOC - 447801 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 GYPSY4 448097 * * G2 * * - - - - G2 454439 PPI251 * * * * - - PPI251 - - 455258 RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C RT1C 456628 * * TC1-2 * * - - - - TC1-2 460457 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 GYPSY9 466042 * * * DMIFACA * DMIFACA - - - - 466224 DME010298 DME010298 * DME010298 DME010298 DME010298 DME010298 DME010298 DME010298 - 467218 Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker 468969 DME9736 DME9736 * DME9736 DME9736 DME9736 DME9736 DME9736 DME9736 - 471716 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 DMTHB1 471976 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 AF541951 473252 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 475050 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 GYPSY6 475591 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 INVADER4 476267 Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker Tinker 476914 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 TRANSIB4 478084 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 DOC5 481997 * * GYPSY9 * * - - - - GYPSY9 483554 DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA DMLINEJA 489278 BS4 BS4 * BS4 BS4 BS4 BS4 BS4 BS4 - 489390 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 DMPOGOR11 490351 FROGGER * * FROGGER FROGGER FROGGER FROGGER FROGGER - - 492299 G7 G7 G7 G7 G7 G7 G7 G7 G7 G7 493417 JUAN JUAN JUAN JUAN JUAN JUAN JUAN JUAN JUAN JUAN 500584 * * IVK * * - - - - IVK 506326 FW2 FW2 * FW2 FW2 FW2 FW2 FW2 FW2 - 508802 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 GYPSY5 508805 GYPSY8 GYPSY8 * GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 GYPSY8 - 511708 JOCKEY2 * * JOCKEY2 JOCKEY2 JOCKEY2 JOCKEY2 JOCKEY2 - - 512188 G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM G6_DM 512631 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 TRANSIB3 516072 * * 1360 * * - - - - 1360 520001 * * * * ROOA_LTR - ROOA_LTR - - - 524762 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 AF222049 529348 412 412 412 412 412 412 412 412 412 412 530434 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 AF418572 533360 RT1B RT1B RT1B RT1B RT1B RT1B RT1B RT1B RT1B RT1B
The observed and the expected alignment are remarkably similar. Solely the ordering of segregating insertions may is not accurately reproduced, but this is expected.
To see why this is expected, consider the two DNA sequences 'ATG' and 'ACG'. If we perform a pairwise alignement with these two sequences using a low gap penalty we may get the two equally valid alignments:
Alignment1: A-TG AC-G Alignment2: AT-G A-CG
However, the important information for analysing TEs in piRNA cluster (or other repetitive regions), the population frequency of the different TE insertions is accurately reproduced.
In this simulations all TE insertions were on the plus strand. In samples TEs may however be on both strands. Manna considers the strand of TEs, therefore it will never align a P-element on the plus strand with a P-element on the minus strand. Furthermore Manna will only consider overlapping sequences of TEs, it will thus not align a 5'-fragment of a TE (say the first 1000bp of the P-element) with a 3'-fragment of the TE (say the last 1000 bp of the P-element; the P-element has a length of 2907bp).