SimulaTE Wiki

Brought to you by: rokofler

Validation_Pop2

Attachments

cor.png (58894 bytes)

Introduction
- Prerequisites
Validation of SimulaTE
Results
- Summary
- Frequency distribution
Conclusion:

Introduction

Here, we validate SimulaTE. We generate a complex TE landscape with SimulaTE and test whether the observed TE landscape (using PoPoolationTE2) agrees with the simulated one.

We simulate a TE landscape for a population of N=100 (haploid individuals) and a genome of size 1MB. The simulated TE insertions have random i) position ii) strand iii) family and iv) population frequency (within a range of 0.1 and 0.9).

Subsequently, we simulate Illumina paired-end reads with a read length of 100 and an inner distance of about 100. Next we identify TE insertions using PoPoolation TE2 and the simulated reads. Finally we ask whether the simulated TE insertions (expected) agree with the TEs identified by PoPoolationTE2 (observed).

Prerequisites

Download the following two files:

chassis: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download
TE hierarchy; required for PoPoolationTE2 (assigns every fasta entry of a TE to a family) https://sourceforge.net/projects/simulates/files/validation_pop2/tehier-ml100noS4.txt/download

Validation of SimulaTE

Generate a pgd-file

First we generate a pgd-file (population genome description; see manual). The resulting pgd-file specifies a TE landscape with 1000 TE insertions having random family, population frequency (between 0.1 and 0.9), position and strand.

python define-landscape_random-insertions-freq-range.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --N 100 --insert-count 1000 --min-distance 700 --min-freq 0.1 --max-freq 0.9 --output landscape01-09.pgd

An example output file can be found here: https://sourceforge.net/projects/simulates/files/validation_pop2/landscape01-09.pgd/download

Create the population genome

Based on the pgd file we create the population genome:

python build-population-genome.py --pgd landscape01-09.pgd --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --output landscape01-09.pg

Simulate the reads

First we need to decide how many paired end reads should be simulated.

If we target a coverage of 200 than we need 3328390 paired end reads (read length 100)

# first, estimate the length of the haploid genomes with samtools faidx
samtools faidx landscape01-09.pg
# second compute the length of the average haploid genome 
cat landscape01-09.pg.fai |awk '{l+=$2}END{print l/NR}'
# which gives: 3.32839e+06
# thus the required number of reads is 3328390*200/200 (coverage 200; and paired end length 200)
# => 3328390

Next we simulate the Illumina paired-end reads

python read_pool-seq_illumina-PE.py --pg landscape01-09.pg --read-length 100 --inner-distance 100 --std-dev 20 --reads 3328390 --fastq1 reads_1.fastq --fastq2 reads_2.fastq

Observed TE landscape (with PoPoolationTE2)

PoPoolationTE2 can be obtained here https://sourceforge.net/projects/popoolation-te2/

For details on TE identification with PoPoolationTE2 see: https://sourceforge.net/p/popoolation-te2/wiki/Home/

Mapping

cat teseq-clean-ml100noS4.fasta chasis1M.fasta > chasis_te.fasta
bwa index chasis_te.fasta
bwa bwasw -t 6 chasis_te.fasta reads_1.fastq|samtools view -Sb - > reads_1.bam
bwa bwasw -t 6 chasis_te.fasta reads_2.fastq|samtools view -Sb - > reads_2.bam
java -jar popte2.jar se2pe --fastq1 reads_1.fastq --fastq2 reads_2.fastq --bam1 reads_1.bam --bam2 reads_2.bam --sort --output reads.sort.bam

Identifying TEs with PoPoolationTE2

java -jar popte2.jar ppileup --bam reads.sort.bam --map-qual 15 --hier tehier-ml100noS4.txt --output pp.gz
java -jar popte2.jar identifySignatures --ppileup pp.gz --mode separate --min-count 2 --output te.signatures
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar updatestrand --bam reads.sort.bam --signature te.signatures --output testrand.signatures --hier tehier-ml100noS4.txt --map-qual 15 --max-disagreement 0.4
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar frequency --ppileup pp.gz --signature testrand.signatures --output te.freqsignatures
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar filterSignatures --input te.freqsignatures --output tefiltered.freqsignatures --max-otherte-count 2 --max-structvar-count 2
java -jar /Volumes/Temp/Robert/programs/popte2/popte2.jar pairupsignatures --signature tefiltered.freqsignatures --ref-genome chasis_te.fasta --hier tehier-ml100noS4.txt --min-distance -200 --max-distance 300 --output tes.filtered.txt

Expected TE landscape

In order to compare the expected and the observed TE landscape we require a summary of the pgd-file.

Download

compute the expected TE landscape

python statistics_TEfrequencies-pooled.py --pgd landscape01-09.pgd --hier tehier-ml100noS4.txt > expected.txt

the first ten lines of expected.txt

856 HMS-Beagle  +   0.46
1640    gypsy10 +   0.46
2624    frogger +   0.16
3348    gypsy12 -   0.58
4065    Doc +   0.69
4892    looper1 -   0.34
5643    Penelope    +   0.53
6396    rooA    -   0.34
7308    R1A1-element    +   0.42
8213    mdg3    -   0.85

The four fields from left to right are: position, family, strand and population frequency of the TE insertion

compare the expected and the observed TE landscape

Download

compare the expected and the observed TE landscape

python compare-expected-observed.py --expected expected.txt --observed tes.filtered.txt --range 200 > results-exp-vs-obs.txt

This yields a comparision of the observed and expected results as in: https://sourceforge.net/projects/simulates/files/validation_pop2/results-exp-vs-obs.txt/download

Results

Either use your results-exp-vs-obs.txt or download our results https://sourceforge.net/projects/simulates/files/validation_pop2/results-exp-vs-obs.txt/download

head results-exp-vs-obs.txt gives:

FOUND   HMS-Beagle  856 844 +   +   0.46    0.53    12.0    0.07    FR
FOUND   gypsy10 1640    1614    +   +   0.46    0.444   26.0    0.016   FR
FOUND   frogger 2624    2628    +   +   0.16    0.206   4.0 0.046   FR
FOUND   gypsy12 3348    3349    -   -   0.58    0.579   1.0 0.001   FR
FOUND   Doc 4065    4064    +   +   0.69    0.681   1.0 0.009   FR
FOUND   looper1 4892    4898    -   -   0.34    0.39    6.0 0.05    FR
FOUND   Penelope    5643    5649    +   +   0.53    0.533   6.0 0.003   FR
FOUND   rooA    6396    6404    -   -   0.34    0.348   8.0 0.008   FR
FOUND   R1A1-element    7308    7293    +   +   0.42    0.37    15.0    0.05    FR
FOUND   mdg3    8213    8207    -   -   0.85    0.874   6.0 0.024   FR

The fields from left to right are: found/missed/wrong, family, expected position, observed position, expected strand, observed strand, expected population frequency, observed population frequency, delta position (position difference), delta pop. freq., FR/R/F

Summary

Here we compute a summary of the results.

Download https://sourceforge.net/projects/simulates/files/validation_pop2/summary-expobs.py/download

and compute a summary with python summary-expobs.py results-exp-vs-obs.txt:

FOUND   1000
WRONG   1
FOUND   STRAND  999
FOUND   POS_MEAN    8.544
FOUND   POS_STD 7.28093840106
FOUND   AFD_MEAN    0.030669
FOUND   AFD_STD 0.0239536518928

This tells us that:

All 1000 simulated TE insertions where identified (correct position +-200bp and family)
one additional TE insertion was found (NOF; at low frequency)
for 999 TE insertions the strand was correctly estimated
the average accuracy of the estimated TE positionis 8bp and has a standard deviation of 7bp
the average accuracy of the estimated population frequency is 3% and has a standard deviation of 2.3%

Note that the slight deviations from the expected value are most likely the outcome of i) paired-end sampling variation and ii) inaccuracies of the approache for TE identification (identifiying TE insertions from Pool-Seq data is a challenge).

Frequency distribution

In the following graph we show the correlation between the expected (simulated) and the observed population frequency (estimated with PoPooaltionTE2) of the TE insertions.

Conclusion:

We demonstrated that SimulaTE accurately simulates complex TE landscapes having for example 1000 random TE insertions.

Specificially we showed that the following key properties are accurately simulated

positions of TE insertions
strand of TE insertions
family of TE insertions
population frequencies of TE insertions
Pool-Seq reads for the TE landscape (Illumina paired end reads)

Wiki: Home
Wiki: Walkthrough
Wiki: Walkthrough_species_tool_compatibility

SimulaTE Wiki

Validation_Pop2

Introduction

Prerequisites

Validation of SimulaTE

Generate a pgd-file

Create the population genome

Simulate the reads

Observed TE landscape (with PoPoolationTE2)

Expected TE landscape

compare the expected and the observed TE landscape

Results

Summary

Frequency distribution

Conclusion:

Related