Here we simulate a scenario with alternating fixed and segregating insertions for each TE family. For each of the 123 TE families we simulate one fixed (count of 5) and one segregating (count between 1 and 4) TE insertion. Fixed and segregating insertions are alternating.
In this validation we will be brief and mostly solely show the commands. For a more comprehensive explanation of our validation approach please read [validation_fixed] first.
All files of this validation are available at https://sourceforge.net/projects/manna/files/validation/val2-fixseg
We use a pgd file that specifies alternating fixed and segregating TE insertions (fixseg.pgd see availablilty above)
An example from the file:
...
46491 52 52 52 52 52
47653 73 * 73 73 73
48844 42 42 42 42 42
50089 * 12 12 12 *
51455 122 122 122 122 122
52164 * * 75 * *
...
Note: A star indicates that a TE (the integer number is the ID of a TE) is missing in a given sample/strain. For example an insertion of the TE '75' is segregating and only occuring in a single individual whereas the insertion of the TE 122 is fixed, occuring in all indivdiuals. For more details see documentation of SimulaTE.
python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --pgd fixseg.pgd --te-seqs ../teseq.fasta --output fixseg.fasta
RepeatMasker --frag 2000000 -pa 1 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta fixseg.fasta
python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm fixseg.fasta.out > fixseq.manna
An example from the Manna alignment with segregating insertions.
...
- - - INVADER6 4885.0 0.0 - - - INVADER6 4885.0 0.0 - - -
HELITRON1_DM 564.0 0.0 HELITRON1_DM 564.0 0.0 HELITRON1_DM 564.0 0.0 HELITRON1_DM 564.0 0.0 HELITRON1_DM 564.0 0.0
- - - - - - - - - INVADER3 5484.0 0.0 - - -
BS4 754.0 0.0 BS4 754.0 0.0 BS4 754.0 0.0 BS4 754.0 0.0 BS4 754.0 0.0
RT1C 5443.0 0.0 RT1C 5443.0 0.0 - - - - - - RT1C 5443.0 0.0
TABOR 7345.0 0.0 TABOR 7345.0 0.0 TABOR 7345.0 0.0 TABOR 7345.0 0.0 TABOR 7345.0 0.0
- - - DMRER1DM 5356.0 0.0 DMRER1DM 5356.0 0.0 - - - - - -
DME278684 5108.0 0.0 DME278684 5108.0 0.0 DME278684 5108.0 0.0 DME278684 5108.0 0.0 DME278684 5108.0 0.0
DM_ROO 9092.0 0.0 DM_ROO 9092.0 0.0 DM_ROO 9092.0 0.0 - - - - - -
...
In this example the INVADER6 is segregating and only occurin in two samples/strains, whereas HELITRON1_DM is fixed. Note that this is the intermediate output, more details (eg. the strand of the TE insertions and the position in the query can be obtained via --output-detail
python ~/dev/manna/validation/manna-vs-pgd-mhp.py --min-len 100 --max-div 5 --manna fixseq.manna --pgd fixseg.pgd > fixseq.mhp
R --vanilla --args fixseg.mhp < ~/dev/manna/validation/manhatten.R
The population frequency of all insertions, ie. segregating and fixed insertions of all 123 TE families is correctly estimated. Also the order of the insertion was correctly inferred