Menu

validation_fix_seg

Robert Kofler
Attachments
fixseg.mhp.png (497930 bytes)

Introduction

Here we simulate a scenario with alternating fixed and segregating insertions for each TE family. For each of the 123 TE families we simulate one fixed (count of 5) and one segregating (count between 1 and 4) TE insertion. Fixed and segregating insertions are alternating.
In this validation we will be brief and mostly solely show the commands. For a more comprehensive explanation of our validation approach please read [validation_fixed] first.

Material and Methods

Data

All files of this validation are available at https://sourceforge.net/projects/manna/files/validation/val2-fixseg

pgd

We use a pgd file that specifies alternating fixed and segregating TE insertions (fixseg.pgd see availablilty above)
An example from the file:

...
46491 52 52 52 52 52
47653 73 * 73 73 73
48844 42 42 42 42 42
50089 * 12 12 12 *
51455 122 122 122 122 122
52164 * * 75 * *
...

Note: A star indicates that a TE (the integer number is the ID of a TE) is missing in a given sample/strain. For example an insertion of the TE '75' is segregating and only occuring in a single individual whereas the insertion of the TE 122 is fixed, occuring in all indivdiuals. For more details see documentation of SimulaTE.

popopulation genome

python ~/dev/simulates/build-population-genome.py --chassis ../chasis.txt --pgd fixseg.pgd --te-seqs ../teseq.fasta --output fixseg.fasta     

RepeatMasking

RepeatMasker --frag 2000000 -pa 1 -no_is -s -nolow -dir . -lib ../teseq-clean-ml100noS4.fasta fixseg.fasta  

Manna alignment

python ~/dev/manna/cluster-msa.py --clusters "" --sample-IDs "" --quick-rm fixseg.fasta.out > fixseq.manna

An example from the Manna alignment with segregating insertions.

...
-       -       -       INVADER6        4885.0  0.0     -       -       -       INVADER6        4885.0  0.0     -       -       -
HELITRON1_DM    564.0   0.0     HELITRON1_DM    564.0   0.0     HELITRON1_DM    564.0   0.0     HELITRON1_DM    564.0   0.0     HELITRON1_DM    564.0   0.0
-       -       -       -       -       -       -       -       -       INVADER3        5484.0  0.0     -       -       -
BS4     754.0   0.0     BS4     754.0   0.0     BS4     754.0   0.0     BS4     754.0   0.0     BS4     754.0   0.0
RT1C    5443.0  0.0     RT1C    5443.0  0.0     -       -       -       -       -       -       RT1C    5443.0  0.0
TABOR   7345.0  0.0     TABOR   7345.0  0.0     TABOR   7345.0  0.0     TABOR   7345.0  0.0     TABOR   7345.0  0.0
-       -       -       DMRER1DM        5356.0  0.0     DMRER1DM        5356.0  0.0     -       -       -       -       -       -
DME278684       5108.0  0.0     DME278684       5108.0  0.0     DME278684       5108.0  0.0     DME278684       5108.0  0.0     DME278684       5108.0  0.0
DM_ROO  9092.0  0.0     DM_ROO  9092.0  0.0     DM_ROO  9092.0  0.0     -       -       -       -       -       -

...

In this example the INVADER6 is segregating and only occurin in two samples/strains, whereas HELITRON1_DM is fixed. Note that this is the intermediate output, more details (eg. the strand of the TE insertions and the position in the query can be obtained via --output-detail

Expected vs observed and Manhatten plot

python ~/dev/manna/validation/manna-vs-pgd-mhp.py --min-len 100 --max-div 5 --manna fixseq.manna --pgd fixseg.pgd > fixseq.mhp 
R --vanilla --args fixseg.mhp < ~/dev/manna/validation/manhatten.R

Results

Conclusion

The population frequency of all insertions, ie. segregating and fixed insertions of all 123 TE families is correctly estimated. Also the order of the insertion was correctly inferred


Related

Wiki: Home
Wiki: validation_fixed

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.