Here, we validate SimulaTE. We generate a complex TE landscape with SimulaTE and test whether the observed TE landscape (using PoPoolationTE2) agrees with the simulated one.
We simulate a TE landscape for a population of N=100 (haploid individuals) and a genome of size 1MB. The simulated TE insertions have random i) position ii) strand iii) family and iv) population frequency (within a range of 0.1 and 0.9).
Subsequently, we simulate Illumina paired-end reads with a read length of 100 and an inner distance of about 100. Next we identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TE insertions (expected) agree with the TEs identified by PoPoolationTE2 (observed).
Download the following two files:
First we generate a pgd-file (population genome description; see manual). The resulting pgd-file specifies a TE landscape with 1000 TE insertions having random family, population frequency (between 0.1 and 0.9), position and strand.
python define-landscape_random-insertions-freq-range.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --N 100 --insert-count 1000 --min-distance 700 --min-freq 0.1 --max-freq 0.9 --output landscape01-09.pgd
An example output file can be found here: https://sourceforge.net/projects/simulates/files/validation_pop2/landscape01-09.pgd/download
Based on the pgd file we create the population genome:
python build-population-genome.py --pgd landscape01-09.pgd --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --output landscape01-09.pg
First we need to decide how many paired end reads should be simulated.
If we target a coverage of 200 than we need 3328390 paired end reads (read length 100)
# first, estimate the length of the haploid genomes with samtools faidx
samtools faidx landscape01-09.pg
# second compute the length of the average haploid genome
cat landscape01-09.pg.fai |awk '{l+=$2}END{print l/NR}'
# which gives: 3.32839e+06
# thus the required number of reads is 3328390*200/200 (coverage 200; and paired end length 200)
# => 3328390
Next we simulate the Illumina paired-end reads
python read_pool-seq_illumina-PE.py --pg landscape01-09.pg --read-length 100 --inner-distance 100 --std-dev 20 --reads 3328390 --fastq1 reads_1.fastq --fastq2 reads_2.fastq
PoPoolationTE2 can be obtained here https://sourceforge.net/projects/popoolation-te2/
For details on TE identification with PoPoolationTE2 see: https://sourceforge.net/p/popoolation-te2/wiki/Home/
Mapping
cat teseq-clean-ml100noS4.fasta chasis1M.fasta > chasis_te.fasta
bwa index chasis_te.fasta
bwa bwasw -t 6 chasis_te.fasta reads_1.fastq|samtools view -Sb - > reads_1.bam
bwa bwasw -t 6 chasis_te.fasta reads_2.fastq|samtools view -Sb - > reads_2.bam
java -jar popte2.jar se2pe --fastq1 reads_1.fastq --fastq2 reads_2.fastq --bam1 reads_1.bam --bam2 reads_2.bam --sort --output reads.sort.bam
Identifying TEs with PoPoolationTE2
java -jar popte2.jar ppileup --bam reads.sort.bam --map-qual 15 --hier tehier-ml100noS4.txt --output pp.gz
java -jar popte2.jar identifySignatures --ppileup pp.gz --mode separate --min-count 2 --output te.signatures
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar updatestrand --bam reads.sort.bam --signature te.signatures --output testrand.signatures --hier tehier-ml100noS4.txt --map-qual 15 --max-disagreement 0.4
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar frequency --ppileup pp.gz --signature testrand.signatures --output te.freqsignatures
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar filterSignatures --input te.freqsignatures --output tefiltered.freqsignatures --max-otherte-count 2 --max-structvar-count 2
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar pairupsignatures --signature tefiltered.freqsignatures --ref-genome chasis_te.fasta --hier tehier-ml100noS4.txt --min-distance -200 --max-distance 300 --output tes.filtered.txt
In order to compare the expected and the observed TE landscape we require a summary of the pgd-file.
Download
compute the expected TE landscape
python statistics_TEfrequencies-pooled.py --pgd landscape01-09.pgd --hier tehier-ml100noS4.txt > expected.txt
the first ten lines of expected.txt
856 HMS-Beagle + 0.46
1640 gypsy10 + 0.46
2624 frogger + 0.16
3348 gypsy12 - 0.58
4065 Doc + 0.69
4892 looper1 - 0.34
5643 Penelope + 0.53
6396 rooA - 0.34
7308 R1A1-element + 0.42
8213 mdg3 - 0.85
The four fields from left to right are: position, family, strand and population frequency of the TE insertion
Download
compare the expected and the observed TE landscape
python compare-expected-observed.py --expected expected.txt --observed tes.filtered.txt --range 200 > results-exp-vs-obs.txt
This yields a comparision of the observed and expected results as in: https://sourceforge.net/projects/simulates/files/validation_pop2/results-exp-vs-obs.txt/download
Either use your results-exp-vs-obs.txt or download our results https://sourceforge.net/projects/simulates/files/validation_pop2/results-exp-vs-obs.txt/download
head results-exp-vs-obs.txt gives:
FOUND HMS-Beagle 856 844 + + 0.46 0.53 12.0 0.07 FR
FOUND gypsy10 1640 1614 + + 0.46 0.444 26.0 0.016 FR
FOUND frogger 2624 2628 + + 0.16 0.206 4.0 0.046 FR
FOUND gypsy12 3348 3349 - - 0.58 0.579 1.0 0.001 FR
FOUND Doc 4065 4064 + + 0.69 0.681 1.0 0.009 FR
FOUND looper1 4892 4898 - - 0.34 0.39 6.0 0.05 FR
FOUND Penelope 5643 5649 + + 0.53 0.533 6.0 0.003 FR
FOUND rooA 6396 6404 - - 0.34 0.348 8.0 0.008 FR
FOUND R1A1-element 7308 7293 + + 0.42 0.37 15.0 0.05 FR
FOUND mdg3 8213 8207 - - 0.85 0.874 6.0 0.024 FR
The fields from left to right are: found/missed/wrong, family, expected position, observed position, expected strand, observed strand, expected population frequency, observed population frequency, delta position (position difference), delta pop. freq., FR/R/F
Here we compute a summary of the results.
Download https://sourceforge.net/projects/simulates/files/validation_pop2/summary-expobs.py/download
and compute a summary with python summary-expobs.py results-exp-vs-obs.txt:
FOUND 1000
WRONG 1
FOUND STRAND 999
FOUND POS_MEAN 8.544
FOUND POS_STD 7.28093840106
FOUND AFD_MEAN 0.030669
FOUND AFD_STD 0.0239536518928
This tells us that:
Note that the slight deviations from the expected value are most likely the outcome of i) paired-end sampling variation and ii) inaccuracies of the approache for TE identification (identifiying TE insertions from Pool-Seq data is a challenge).
In the following graph we show the correlation between the expected (simulated) and the observed population frequency (estimated with PoPooaltionTE2) of the TE insertions.
We demonstrated that SimulaTE accurately simulates complex TE landscapes having for example 1000 random TE insertions.
Specificially we showed that the following key properties are accurately simulated
Wiki: Home
Wiki: Walkthrough
Wiki: Walkthrough_species_tool_compatibility