Please read [validation_fixed] first as this provides an introduction to our validation strategy.
Here we simulate a population of 5 individuals. In total 246 TE insertions are segregating in the population, 2 for each of the 123 TE families found in D.melanogaster. The position of the TE insertions are random. In contrast to the other validations, here the population frequency of the TE families is random, ie. both insertions of one family have the same population frequency. With this approach we know exactly the expected population frequency for each family, which allows us to validate the manna alignment independent of the order of the TE insertions.
The population frequency of the families ranges from 1 to 5, where 5 is fixed.
All files from this validation can be found here https://sourceforge.net/projects/manna/files/validation/val-mixpopfreq/
Following an example of the resulting pgd file.
...
430580 57 57 * * *
432634 122 * * * *
437183 115 115 115 115 115
440327 43 * * * 43
443861 * 49 * 49 49
443945 104 * * * *
447797 * * 34 * *
448679 115 115 115 115 115
...
Every family eg 115 has exactly 2 insertions. Both insertions of each family eg 115 have exactly the same population frequency (i.e fixed in the case of 115).
python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --pgd mixfamfreq.pgd --te-seqs ../teseq.fasta --output mixfamfreq.fasta
RepeatMasker --frag 2000000 -pa 1 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta mixfamfreq.fasta
python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm mixfamfreq2.fasta.out > mixfamfreq2.manna
First, we need a file containing the expected poplation frequency for each TE family. It is called 'familyfrequencies.txt' and provided in the folder mentioned above.
Here a small sample of this file:
...
G5A 109 5
GTWIN 52 1
TRANSIB1 72 3
Beagle 47 3
DMHFL1 19 2
GYPSY10 114 1
FROGGER 83 3
...
For example, the two G5A insertions should have a frequency of 5 (i.e. fixed) while the two GTWIN insertions should have a frequency of 1 (i.e. singletions).
Next we compare the observed and the expected population frequency for each TE insertion
python ~/dev/manna/validation/manna-vs-fampopfreq.py --manna mixfamfreq.manna --max-div 5 --min-len 100 --fampopfreq familyfrequencies.txt > mixfamfreq.expobs
The resulting output file for example has the structure:
...
GYPSY5 4 4
GYPSY5 4 3
DMRER1DM 4 4
DMRER1DM 4 4
1360 1 1
1360 1 1
...
For each family (col1) the expected (col2) and observed (col3) population frequeny is shown for two insertions. Note that for the second insetion of GYPSY5 the expected and observed frequencies differ. This will of course lead to an addition (a third) GYPSY5 insertion that has the population frequency 4-3=1. However only the two simulated insertions are considered for this validation.
Finally we can visualize the expected (x-axis) and the observed (y-axis) popualtion frequency for all 246 simulated TE insertions with a jitter-plot
Even in this difficult scenario for a MSA (no natural population structure) the population frequency of the individual TE insertions are quite accurately estimated. For 16 out of 246 insertions the observed population frequency was not identical to the expected one. The popualtin frequency was correct for all singletions and all fixed insertions.