Note for this Walkthrough we already provide the requirements for running PoPoolationTE2, i.e. a TE-merged-reference and a TE-hierachy [Manual]. The following demonstrates how these files can be created: [WalkthroughPreparatoryWork]
In this walkthrough we aim to test two hypothesis.
Please proceed with the walkthrough in the given order, since files generated in earlier steps may be required later on.
The ppileup file is the basis for all subsequent analysis and thus central for identifiying TEs with PopoolationTE2.
The reads will be mapped with the local alignment algorithm bwa bwasw. Both reads will be mapped separately to the TE-merged-reference and the paired end information will be restored subsequently with PoPoolationTE2 (se2pe)
** Map reads to the TE-combined-reference**
:::bash
# Note the sample data set consists of Pool-Seq reads for three D. melanogaster populations
# France, Georgia, Ghana
bwa index walkthrough-refgenome/2R-603-consensusTE.fasta
mkdir map
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/France_1.fq.gz >map/France_1.sam
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/France_2.fq.gz >map/France_2.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Georgia_1.fq.gz >map/Georgia_1.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Georgia_2.fq.gz >map/Georgia_2.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Ghana_1.fq.gz >map/Ghana_1.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Ghana_2.fq.gz >map/Ghana_2.sam &
Restore paired-end information with PoPoolationTE2 se2pe
:::bash
# Note by using the --sort flag a sorted bam file may be obtained
# Note ~pt2 is the path to the PoPoolationTE2 folder
java -jar ~pt2/popte2.jar se2pe --fastq1 walkthrough-reads/France_1.fq.gz --fastq2 walkthrough-reads/France_2.fq.gz --bam1 map/France_1.sam --bam2 map/France_2.sam --sort --output France.sort.bam
java -jar ~pt2/popte2.jar se2pe --fastq1 walkthrough-reads/Georgia_1.fq.gz --fastq2 walkthrough-reads/Georgia_2.fq.gz --bam1 map/Georgia_1.sam --bam2 map/Georgia_2.sam --sort --output Georgia.sort.bam
java -jar ~pt2/popte2.jar se2pe --fastq1 walkthrough-reads/Ghana_1.fq.gz --fastq2 walkthrough-reads/Ghana_2.fq.gz --bam1 map/Ghana_1.sam --bam2 map/Ghana_2.sam --sort --output Ghana.sort.bam
** Generate the ppileup**
:::bash
java -jar ~pt2/popte2.jar ppileup --bam France.sort.bam --bam Georgia.sort.bam --bam Ghana.sort.bam --map-qual 15 --hier walkthrough-refgenome/dmelconsensustes.tehier --output Fra_Geo_Gha.ppileup.gz
# use the following command to browse through the ppileup file
gzip -cd Fra_Geo_Gha.ppileup.gz|less
The following example demonstrates the minimum pipeline for identifying TE insertions with PoPoolationTE2. More complexity will gradually be added in subsequent steps later on. We need to 1.) identify signatures of TE insertions 2.) estimate population frequencies and 3.) pair up signatures of TE insertions, yielding a final list of TE insertions.
:::bash
mkdir quick
# here we analyse TE abundance in the three populations separately
java -jar ~pt2/popte2.jar identifySignatures --ppileup Fra_Geo_Gha.ppileup.gz --mode separate --output quick/Fra_Geo_Gha.signatures --min-count 3
java -jar ~pt2/popte2.jar frequency --ppileup Fra_Geo_Gha.ppileup.gz --signature quick/Fra_Geo_Gha.signatures --output quick/Fra_Geo_Gha.freqsig
java -jar ~pt2/popte2.jar pairupSignatures --signature quick/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output quick/Fra_Geo_Gha.teinsertions
Voila, we have identified 311 TE insertions: 108 in France, 109 in Georgia and 92 in Ghana. The coverage varies among the populations and therefore these differences in TE abundance may be an artefact of the coverage. In the next section we demonstrate how to homogenize the physcial coverage within and between samples, and thus to homogenize the power to identify TEs.
** First results **
A small sample from the resulting Fra_Geo_Gha.teinsertions file:
:::bash
2 2R 12188285 . accord LTR F 0.099
3 2R 12188262 . accord LTR F 0.093
1 2R 12318659 . Tc1-2 TIR FR 0.943
2 2R 12318674 . Tc1-2 TIR FR 0.968
For an explanation of the output file see [TE insertion file]
In this section we aim to answer the hypothesis that the population from Ghana has more TE insertions than the other two populations (for an explanation see above; section hypothesis). The physical coverage has a substantial influence on the power to identify TE insertions, where the power to identify TE insertions increases with the coverage. This may not be a dramatic problem for sequencing individuals where TE abundance quickly saturates, i.e it asymptotically approaches an upper value with increasing read numbers. However, with Pool-Seq data such an upper-limit is not readily reached (Pool-Seq data are frequently unsaturated for TEs) and therefore coverage differences may be the major reason for differences in TE abundance among samples.
To address this problem we will subsample the ppileup tracks to equal physical coverage between and within samples.
** Estimate the optimal target coverage**
However, it will first be necessary to estimate the optimal target coverage. This optimum will be a compromise between two opposing considerations. First, the target coverage should be as high as possible, as this increases the power to identify TE insertion and the accuracy of the allele frequency estimates. Second if it is too high we will loose a substantial fractions of sites. PoPoolationTE2 only proceeds with sites having sufficient coverage in all samples (>=target coverage).
To provide some assistance in identifying a suitable target coverage we will first estimate the coverage distribution in all samples
:::bash
mkdir hypo1
java -jar ~pt2/popte2.jar stat-coverage --ppileup Fra_Geo_Gha.ppileup.gz --output hypo1/coverage.statistics
less hypo1/coverage.statistics
# joint 53 15398 199798 8.61
# joint 54 17140 215196 9.27
# joint 55 17786 232336 10.01
# joint 56 18835 250122 10.77
# joint 57 19992 268957 11.59
# joint 58 21596 288949 12.45
By using a target-coveage of 55 we will loose about 10% of the sites. In this walkthrough we consider this as an acceptable compromise. For explanation of output file see [diverse output files]
** Subsample the ppileup to an uniform coverage**
:::bash
java -jar ~pt2/popte2.jar subsamplePpileup --ppileup Fra_Geo_Gha.ppileup.gz --target-coverage 55 --output hypo1/Fra_Geo_Gha.ss55.ppileup.gz
** Estimate TE abundance**
TE abundance will be estimated as shown before. The only difference, we will use a --min-count of 2, since subsampling substantially reduced the physical coverage in the file.
:::bash
java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode separate --output hypo1/Fra_Geo_Gha.signatures --min-count 2
java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature hypo1/Fra_Geo_Gha.signatures --output hypo1/Fra_Geo_Gha.freqsig
java -jar ~pt2/popte2.jar pairupSignatures --signature hypo1/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo1/Fra_Geo_Gha.teinsertions
** Does hypothesis-1 hold?
cat hypo1/Fra_Geo_Gha.teinsertions|awk '{print $1}'|sort |uniq -c
# 1 => 77 (France)
# 2 => 101 (Georgia)
# 3 => 74 (Ghana)
NO It seems the population from Ghana actually has the fewest TE insertions.
To estimate the population frequency of a TE insertion of interest (the accord insertion next to Cyp6g1 gene) we need to perform a joint analysis.
When using --mode separate a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inprecision in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples.
For example, when using --mode separate the following three accord entries are reported. These three entries probably refer to the same TE insertion (they have similar genomic positions and family).
:::bash
less hypo1/Fra_Geo_Gha.teinsertions
# 3 2R 12185399 . accord LTR FR 0.532
# 2 2R 12185369 . accord LTR FR 0.769
# 1 2R 12185360 . accord LTR FR 0.768
It may be quite messy and inconvenient to group insertions with similar positions. A more straight forward approach is to perform a joint analysis.
Joint analysis
Note that a window size identical in all samples/populations has to be provided for the joint analysis (e.g.: minimumSampleMedian, maximumSampleMedian, fix100).
:::bash
mkdir hypo2
java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode joint --output hypo2/Fra_Geo_Gha.signatures --min-count 2 --signature-window minimumSampleMedian
java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature hypo2/Fra_Geo_Gha.signatures --output hypo2/Fra_Geo_Gha.freqsig
java -jar ~pt2/popte2.jar pairupSignatures --signature hypo2/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo2/Fra_Geo_Gha.teinsertions
Does hypothesis-2 hold?
:::bash
less hypo2/Fra_Geo_Gha.teinsertions
#1-3 2R 12185376 . accord LTR FR 0.767 0.768 0.529
YES Indeed the populaton frequency of the accord insertion seems to be lower in Ghana (0.529) than in France (0.767) and Georgia (0.768).
This is also in agreement with a previous study which found that the accord insertion has a high frequency (0.85-1.0) in non-African populations and a low frequency, ranging from 0.3 to 0.55, in East African populations.
http://www.ncbi.nlm.nih.gov/pubmed/15245421
However the population frequencies of the accord insertion reported here are lower than in the previous study. This is probably because the region arround the accord insertion has a complex history of nested TE insertions and structural rearangments.
In fact upon inspection of the ppileup file we found that the accord insertion is overlapping with P-element and HMS-Beagle insertions, again in agreement with previous works http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000998. This overlap with other TE insertions may explain the low population frequency estimates of the accord insertion.
As a new feature (compared to PoPoolationTE), PoPoolationTE2 allows to estimate the strand of TE insertions .
This will be done by updating the strand information in the signature file [signature file format].
:::bash
mkdir strand
java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode joint --output strand/Fra_Geo_Gha.signatures --min-count 2 --signature-window fix100
# estimate the strand
java -jar ~pt2/popte2.jar updatestrand --bam France.sort.bam --bam Georgia.sort.bam --bam Ghana.sort.bam --signature strand/Fra_Geo_Gha.signatures --hier walkthrough-refgenome/dmelconsensustes.tehier --map-qual 15 --max-disagreement 0.5 --output strand/Fra_Geo_Gha.strand.signatures
# the remaining steps are just as shown before
java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature strand/Fra_Geo_Gha.strand.signatures --output strand/Fra_Geo_Gha.freqsig
java -jar ~pt2/popte2.jar pairupSignatures --signature strand/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output strand/Fra_Geo_Gha.teinsertions
Voila TEs with estimates of the strand:
:::bash
1-3 2R 12185374 - accord LTR FR 0.765 0.764 0.525
This shows that the accord insertion next to the Cyp6g1 gene is on the minus strand.
TE insertions that are overlapping with other TE insertions or with structural variants may not be reliable. Also the estimates of the population frequency may not be reliable. To remove such low-quality TE insertions PoPoolationTE2 allows to filter TE insertions.
:::bash
mkdir filter
java -jar ~pt2/popte2.jar filterSignatures --input hypo1/Fra_Geo_Gha.freqsig --output filter/Fra_Geo_Gha.filter.freqsig --max-otherte-count 2 --max-structvar-count 2
# this step removes 19 insertions (from 319 to 300)
java -jar ~pt2/popte2.jar pairupSignatures --signature filter/Fra_Geo_Gha.filter.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output filter/Fra_Geo_Gha.teinsertions
It will always be important to cross-check your results with some basic statistics.
In our opinion the most important statistics are a.) the number of reads mapping to different TE families and b.) the number of paired ends supporting TE insertions, i.e. discordantly mapped paired ends where one read maps to a reference genome and the other to a TE (these paired ends are used for generating the ppileup).
** Number of reads mapping to a TE**
:::bash
mkdir stats
java -jar ~pt2/popte2.jar stat-reads --bam France.sort.bam --hier walkthrough-refgenome/dmelconsensustes.tehier --output stats/France.readstats
# te pogo 265
# te rooA 1
# te springer 11
# te Tirant 36
# te hobo 147
# te P-element 164
Thus the P-element, popo and hobo are quit abundant in the population sample from France; For an explanation of the output file see [diverse output files]
** Number of PE fragments supporting a TE insertion **
:::bash
java -jar ~pt2/popte2.jar stat-pairs --bam France.sort.bam --hier walkthrough-refgenome/dmelconsensustes.tehier --output stats/France.pairstats
# te pogo 265
# te Rt1b 73
# te Doc 58
# te Tirant 34
# te hobo 146
# te P-element 163
Again P-element, pogo and hobo are quite abundant. Note that our example is exceptional since we prefiltered the data for reads mapping in chromosome 2R:11Mb-13Mbp, which explains why the read-statistics and the pair-statistics are quite similar in our example. Such a high similarity would not be expected for unfiltered data! For an explanation of the output file see [diverse output files]
Wiki: DriverScripts
Wiki: FAQ
Wiki: Home
Wiki: Manual
Wiki: TE insertion file
Wiki: WalkthroughPreparatoryWork
Wiki: diverse output files
Wiki: signature file format
Wiki: signatures of TE insertions