Menu

validation_mixfamfreq

Robert Kofler
Attachments
val-expobs.png (194292 bytes)

Introduction

Please read [validation_fixed] first as this provides an introduction to our validation strategy.
Here we simulate a population of 5 individuals. In total 246 TE insertions are segregating in the population, 2 for each of the 123 TE families found in D.melanogaster. The position of the TE insertions are random. In contrast to the other validations, here the population frequency of the TE families is random, ie. both insertions of one family have the same population frequency. With this approach we know exactly the expected population frequency for each family, which allows us to validate the manna alignment independent of the order of the TE insertions.
The population frequency of the families ranges from 1 to 5, where 5 is fixed.

Material and Methods

Data

All files from this validation can be found here https://sourceforge.net/projects/manna/files/validation/val-mixpopfreq/

pgd

Following an example of the resulting pgd file.

...
430580 57 57 * * *
432634 122 * * * *
437183 115 115 115 115 115
440327 43 * * * 43
443861 * 49 * 49 49
443945 104 * * * *
447797 * * 34 * *
448679 115 115 115 115 115
...

Every family eg 115 has exactly 2 insertions. Both insertions of each family eg 115 have exactly the same population frequency (i.e fixed in the case of 115).

population genome

python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --pgd mixfamfreq.pgd --te-seqs ../teseq.fasta --output mixfamfreq.fasta

RepeatMasking

RepeatMasker --frag 2000000 -pa 1 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta mixfamfreq.fasta

Manna alignment

python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm mixfamfreq2.fasta.out > mixfamfreq2.manna

Expected vs observed

the expected population frequency

First, we need a file containing the expected poplation frequency for each TE family. It is called 'familyfrequencies.txt' and provided in the folder mentioned above.
Here a small sample of this file:

...
G5A     109     5
GTWIN   52      1
TRANSIB1        72      3
Beagle  47      3
DMHFL1  19      2
GYPSY10 114     1
FROGGER 83      3
...

For example, the two G5A insertions should have a frequency of 5 (i.e. fixed) while the two GTWIN insertions should have a frequency of 1 (i.e. singletions).

relate expected and observed frequency

Next we compare the observed and the expected population frequency for each TE insertion

python ~/dev/manna/validation/manna-vs-fampopfreq.py --manna mixfamfreq.manna --max-div 5 --min-len 100 --fampopfreq familyfrequencies.txt > mixfamfreq.expobs

The resulting output file for example has the structure:

...
GYPSY5  4       4
GYPSY5  4       3
DMRER1DM        4       4
DMRER1DM        4       4
1360    1       1
1360    1       1
...

For each family (col1) the expected (col2) and observed (col3) population frequeny is shown for two insertions. Note that for the second insetion of GYPSY5 the expected and observed frequencies differ. This will of course lead to an addition (a third) GYPSY5 insertion that has the population frequency 4-3=1. However only the two simulated insertions are considered for this validation.

visualize results

Finally we can visualize the expected (x-axis) and the observed (y-axis) popualtion frequency for all 246 simulated TE insertions with a jitter-plot

Conclusion

Even in this difficult scenario for a MSA (no natural population structure) the population frequency of the individual TE insertions are quite accurately estimated. For 16 out of 246 insertions the observed population frequency was not identical to the expected one. The popualtin frequency was correct for all singletions and all fixed insertions.


Related

Wiki: Home
Wiki: validation_fixed

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.