Please read [validation_fixed] first as this provides an introduction to our validation strategy.
Here we simulate a population of 5 individuals. In total 246 TE insertions are segregating in the population, 2 for each of the 123 TE families found in D.melanogaster. The position and the population frequency of the TE insertions are random. The population frequency ranges from 1 to 5, where 5 is fixed. One of the 5 individuals acts as a backbone, ie. having each of the 246 TE insertions. The other 4 individuals will have <246 TE insertions.
The "backbone" indivdiual established an unambiguous order for the TEs and thus allows to validate the population frequency as well as the order of the TE insertions.
All files are available here https://sourceforge.net/projects/manna/files/validation/val-linland/
We use a pgd that speciefies the population described in the Introduction (ie. N=5, with 246 insertions; one individual is the backbone). Here a short example from the file
...
465155 37 37 37 * *
467326 85 85 85 85 85
467834 123 123 123 * *
470274 10 10 10 10 10
475920 98 98 98 98 *
478463 15 15 * * *
479903 7 7 * * *
...
Note the first column specifies the position in the chasis. The next column specifiy the 5 individuals in the population. An integer indicates the ID of a TE (e.g. 37=DM_ROO) and a star indicates the absence of the TE in the individual. For more background on the pgd-file see https://sourceforge.net/projects/simulates/
python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --pgd linearlandscape.pgd --te-seqs ../teseq.fasta --output linearlandscape.fasta
RepeatMasker --frag 2000000 -pa 1 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta linearlandscape.fasta
Here we use an option mainly developed to speed up validation (--quick-rm). Usually it is necessary to provide for each aligend annotation a sample-ID and the annotation in a separate file. Since RepeatMasker provides the sample IDs in the outputfile it is not necessary to have the cluster annotions in a separate file. We utilize this feature with the --quick-rm option. However this option could of course also be used with any data.
python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm linearlandscape.fasta.out > linearlandscape.manna
python ~/dev/manna/validation/manna-vs-pgd-mhp.py --min-len 100 --max-div 5 --manna linearlandscape.manna --pgd linearlandscape.pgd > linearlandscape.mhp
# next use R to visualize the result
R --vanilla --args linearlandscape.mhp < ~/dev/manna/validation/manhatten.R
This gives the following figure
Black circles are the expected population frequency and red crosses are the observed TE population frequencies.
Both the population frequency (overlap between black circle and red cross) and the order of TEs is correctly reproduced by our approach. The correct order is ascertained by script manna-vs-pgd-mhp.py. In case of an incorrect order TE insertions would not be reported beginning at the first TE that is out of order.