Menu

Walkthrough

Robert Kofler

Prerequisites

Note for this Walkthrough we already provide the requirements for running PoPoolationTE2, i.e. a TE-merged-reference and a TE-hierachy [Manual]. The following demonstrates how these files can be created: [WalkthroughPreparatoryWork]

Hypothesis that will be tested

In this walkthrough we aim to test two hypothesis.

  • Hypothesis1: The population from Ghana contains more TE insertions than the populations from Georgia as well as France. Previous works showed that TE activity and the extend of hybrid dysgenesis is temperature sensitive. We thus want to test the hypothesis that higher temperatures leads to higher TE activity.
  • Hypothesis2: An accord insertion next to the Cyp6g1 in D. melanogaster, confers resistance to insecticides and therefore it quickly rose in frequency in different D. melanogaster populations. It is likely that pesticides are less extensively used in Ghana than in France or Georgia. We thus want to test the hypothesis that the accord insertion, confering resistance to pesticides, has a lower population frequency in Ghana than in France and Georgia.

Identifying TE insertions with PoPoolationTE2

Please proceed with the walkthrough in the given order, since files generated in earlier steps may be required later on.

1.) Building the ppileup file

The ppileup file is the basis for all subsequent analysis and thus central for identifiying TEs with PopoolationTE2.

The reads will be mapped with the local alignment algorithm bwa bwasw. Both reads will be mapped separately to the TE-merged-reference and the paired end information will be restored subsequently with PoPoolationTE2 (se2pe)

** Map reads to the TE-combined-reference**

:::bash
# Note the sample data set consists of Pool-Seq reads for three D. melanogaster populations
# France, Georgia, Ghana
bwa index walkthrough-refgenome/2R-603-consensusTE.fasta
mkdir map
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/France_1.fq.gz >map/France_1.sam 
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/France_2.fq.gz >map/France_2.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Georgia_1.fq.gz >map/Georgia_1.sam & 
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Georgia_2.fq.gz >map/Georgia_2.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Ghana_1.fq.gz >map/Ghana_1.sam &
bwa bwasw -t 3 walkthrough-refgenome/2R-603-consensusTE.fasta walkthrough-reads/Ghana_2.fq.gz >map/Ghana_2.sam &

Restore paired-end information with PoPoolationTE2 se2pe

:::bash
# Note by using the --sort flag a sorted bam file may be obtained
# Note ~pt2 is the path to the PoPoolationTE2 folder
java -jar ~pt2/popte2.jar se2pe --fastq1 walkthrough-reads/France_1.fq.gz --fastq2 walkthrough-reads/France_2.fq.gz --bam1 map/France_1.sam --bam2 map/France_2.sam --sort --output France.sort.bam
java -jar ~pt2/popte2.jar se2pe --fastq1 walkthrough-reads/Georgia_1.fq.gz --fastq2 walkthrough-reads/Georgia_2.fq.gz --bam1 map/Georgia_1.sam --bam2 map/Georgia_2.sam --sort --output Georgia.sort.bam
java -jar ~pt2/popte2.jar se2pe --fastq1 walkthrough-reads/Ghana_1.fq.gz --fastq2 walkthrough-reads/Ghana_2.fq.gz --bam1 map/Ghana_1.sam --bam2 map/Ghana_2.sam --sort --output Ghana.sort.bam

** Generate the ppileup**

:::bash
java -jar ~pt2/popte2.jar ppileup --bam France.sort.bam --bam Georgia.sort.bam --bam Ghana.sort.bam --map-qual 15 --hier walkthrough-refgenome/dmelconsensustes.tehier --output Fra_Geo_Gha.ppileup.gz
# use the following command to browse through the ppileup file 
gzip -cd Fra_Geo_Gha.ppileup.gz|less

2.) Minimum walkthrough

The following example demonstrates the minimum pipeline for identifying TE insertions with PoPoolationTE2. More complexity will gradually be added in subsequent steps later on. We need to 1.) identify signatures of TE insertions 2.) estimate population frequencies and 3.) pair up signatures of TE insertions, yielding a final list of TE insertions.

:::bash
mkdir quick
# here we analyse TE abundance in the three populations separately
java -jar ~pt2/popte2.jar identifySignatures --ppileup Fra_Geo_Gha.ppileup.gz --mode separate --output quick/Fra_Geo_Gha.signatures --min-count 3
java -jar ~pt2/popte2.jar frequency --ppileup Fra_Geo_Gha.ppileup.gz --signature quick/Fra_Geo_Gha.signatures --output quick/Fra_Geo_Gha.freqsig 
java -jar ~pt2/popte2.jar pairupSignatures --signature quick/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output quick/Fra_Geo_Gha.teinsertions

Voila, we have identified 311 TE insertions: 108 in France, 109 in Georgia and 92 in Ghana. The coverage varies among the populations and therefore these differences in TE abundance may be an artefact of the coverage. In the next section we demonstrate how to homogenize the physcial coverage within and between samples, and thus to homogenize the power to identify TEs.

** First results **

A small sample from the resulting Fra_Geo_Gha.teinsertions file:

:::bash
2       2R      12188285        .       accord  LTR     F               0.099
3       2R      12188262        .       accord  LTR     F               0.093
1       2R      12318659        .       Tc1-2   TIR     FR              0.943
2       2R      12318674        .       Tc1-2   TIR     FR              0.968

For an explanation of the output file see [TE insertion file]

3.) Hypothesis-1: requires subsampling of the ppileup file

In this section we aim to answer the hypothesis that the population from Ghana has more TE insertions than the other two populations (for an explanation see above; section hypothesis). The physical coverage has a substantial influence on the power to identify TE insertions, where the power to identify TE insertions increases with the coverage. This may not be a dramatic problem for sequencing individuals where TE abundance quickly saturates, i.e it asymptotically approaches an upper value with increasing read numbers. However, with Pool-Seq data such an upper-limit is not readily reached (Pool-Seq data are frequently unsaturated for TEs) and therefore coverage differences may be the major reason for differences in TE abundance among samples.

To address this problem we will subsample the ppileup tracks to equal physical coverage between and within samples.

** Estimate the optimal target coverage**
However, it will first be necessary to estimate the optimal target coverage. This optimum will be a compromise between two opposing considerations. First, the target coverage should be as high as possible, as this increases the power to identify TE insertion and the accuracy of the allele frequency estimates. Second if it is too high we will loose a substantial fractions of sites. PoPoolationTE2 only proceeds with sites having sufficient coverage in all samples (>=target coverage).
To provide some assistance in identifying a suitable target coverage we will first estimate the coverage distribution in all samples

:::bash
mkdir hypo1
java -jar ~pt2/popte2.jar stat-coverage --ppileup Fra_Geo_Gha.ppileup.gz --output hypo1/coverage.statistics
less hypo1/coverage.statistics
# joint   53      15398   199798  8.61
# joint   54      17140   215196  9.27
# joint   55      17786   232336  10.01
# joint   56      18835   250122  10.77
# joint   57      19992   268957  11.59
# joint   58      21596   288949  12.45

By using a target-coveage of 55 we will loose about 10% of the sites. In this walkthrough we consider this as an acceptable compromise. For explanation of output file see [diverse output files]

** Subsample the ppileup to an uniform coverage**

:::bash
java -jar ~pt2/popte2.jar subsamplePpileup --ppileup Fra_Geo_Gha.ppileup.gz --target-coverage 55 --output hypo1/Fra_Geo_Gha.ss55.ppileup.gz

** Estimate TE abundance**

TE abundance will be estimated as shown before. The only difference, we will use a --min-count of 2, since subsampling substantially reduced the physical coverage in the file.

:::bash
java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode separate --output hypo1/Fra_Geo_Gha.signatures --min-count 2
java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature hypo1/Fra_Geo_Gha.signatures --output hypo1/Fra_Geo_Gha.freqsig 
java -jar ~pt2/popte2.jar pairupSignatures --signature hypo1/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo1/Fra_Geo_Gha.teinsertions

** Does hypothesis-1 hold?

cat hypo1/Fra_Geo_Gha.teinsertions|awk '{print $1}'|sort |uniq -c
 # 1 => 77 (France)
 # 2 => 101 (Georgia)
 # 3 => 74 (Ghana)

NO It seems the population from Ghana actually has the fewest TE insertions.

4.) Hypothesis-2: requires a joint analysis

To estimate the population frequency of a TE insertion of interest (the accord insertion next to Cyp6g1 gene) we need to perform a joint analysis.
When using --mode separate a distinct insertion (with a distinct position) is reported for every sample/population, despite a TE insertion may actually occur in multiple samples. This inprecision in the position estimates makes it difficult to identify orthologous insertions, i.e. to identify the same insertion in all samples.

For example, when using --mode separate the following three accord entries are reported. These three entries probably refer to the same TE insertion (they have similar genomic positions and family).

:::bash
less hypo1/Fra_Geo_Gha.teinsertions
# 3       2R      12185399        .       accord  LTR     FR              0.532
# 2       2R      12185369        .       accord  LTR     FR              0.769
# 1       2R      12185360        .       accord  LTR     FR              0.768

It may be quite messy and inconvenient to group insertions with similar positions. A more straight forward approach is to perform a joint analysis.

Joint analysis
Note that a window size identical in all samples/populations has to be provided for the joint analysis (e.g.: minimumSampleMedian, maximumSampleMedian, fix100).

:::bash
mkdir hypo2
java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode joint --output hypo2/Fra_Geo_Gha.signatures --min-count 2 --signature-window minimumSampleMedian
java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature hypo2/Fra_Geo_Gha.signatures --output hypo2/Fra_Geo_Gha.freqsig
java -jar ~pt2/popte2.jar pairupSignatures --signature hypo2/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output hypo2/Fra_Geo_Gha.teinsertions

Does hypothesis-2 hold?

:::bash
less hypo2/Fra_Geo_Gha.teinsertions
#1-3     2R      12185376        .       accord  LTR     FR              0.767   0.768   0.529

YES Indeed the populaton frequency of the accord insertion seems to be lower in Ghana (0.529) than in France (0.767) and Georgia (0.768).
This is also in agreement with a previous study which found that the accord insertion has a high frequency (0.85-1.0) in non-African populations and a low frequency, ranging from 0.3 to 0.55, in East African populations.
http://www.ncbi.nlm.nih.gov/pubmed/15245421

However the population frequencies of the accord insertion reported here are lower than in the previous study. This is probably because the region arround the accord insertion has a complex history of nested TE insertions and structural rearangments.
In fact upon inspection of the ppileup file we found that the accord insertion is overlapping with P-element and HMS-Beagle insertions, again in agreement with previous works http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000998. This overlap with other TE insertions may explain the low population frequency estimates of the accord insertion.

5.) Strand of TE insertions ##

As a new feature (compared to PoPoolationTE), PoPoolationTE2 allows to estimate the strand of TE insertions .

This will be done by updating the strand information in the signature file [signature file format].

:::bash
mkdir strand
java -jar ~pt2/popte2.jar identifySignatures --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --mode joint --output strand/Fra_Geo_Gha.signatures --min-count 2 --signature-window fix100 
# estimate the strand
java -jar ~pt2/popte2.jar updatestrand --bam France.sort.bam --bam Georgia.sort.bam --bam Ghana.sort.bam --signature strand/Fra_Geo_Gha.signatures --hier walkthrough-refgenome/dmelconsensustes.tehier --map-qual 15 --max-disagreement 0.5 --output strand/Fra_Geo_Gha.strand.signatures
# the remaining steps are just as shown before
java -jar ~pt2/popte2.jar frequency --ppileup hypo1/Fra_Geo_Gha.ss55.ppileup.gz --signature strand/Fra_Geo_Gha.strand.signatures --output strand/Fra_Geo_Gha.freqsig
java -jar ~pt2/popte2.jar pairupSignatures --signature strand/Fra_Geo_Gha.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output strand/Fra_Geo_Gha.teinsertions

Voila TEs with estimates of the strand:

:::bash
1-3     2R      12185374        -       accord  LTR     FR              0.765   0.764   0.525

This shows that the accord insertion next to the Cyp6g1 gene is on the minus strand.

6.) Filtering TE insertions

TE insertions that are overlapping with other TE insertions or with structural variants may not be reliable. Also the estimates of the population frequency may not be reliable. To remove such low-quality TE insertions PoPoolationTE2 allows to filter TE insertions.

:::bash
mkdir filter
java -jar ~pt2/popte2.jar filterSignatures --input hypo1/Fra_Geo_Gha.freqsig --output filter/Fra_Geo_Gha.filter.freqsig --max-otherte-count 2 --max-structvar-count 2 
# this step removes 19 insertions (from 319 to 300)
java -jar ~pt2/popte2.jar pairupSignatures --signature filter/Fra_Geo_Gha.filter.freqsig --ref-genome walkthrough-refgenome/2R-603-consensusTE.fasta --hier walkthrough-refgenome/dmelconsensustes.tehier --min-distance -200 --max-distance 300 --output filter/Fra_Geo_Gha.teinsertions

7.) Creating basic statistics

It will always be important to cross-check your results with some basic statistics.
In our opinion the most important statistics are a.) the number of reads mapping to different TE families and b.) the number of paired ends supporting TE insertions, i.e. discordantly mapped paired ends where one read maps to a reference genome and the other to a TE (these paired ends are used for generating the ppileup).

** Number of reads mapping to a TE**

:::bash
mkdir stats
java -jar ~pt2/popte2.jar stat-reads --bam France.sort.bam --hier walkthrough-refgenome/dmelconsensustes.tehier --output stats/France.readstats 
# te      pogo    265
# te      rooA    1
# te      springer        11
# te      Tirant  36
# te      hobo    147
# te      P-element       164

Thus the P-element, popo and hobo are quit abundant in the population sample from France; For an explanation of the output file see [diverse output files]

** Number of PE fragments supporting a TE insertion **

:::bash
java -jar ~pt2/popte2.jar stat-pairs --bam France.sort.bam --hier walkthrough-refgenome/dmelconsensustes.tehier --output stats/France.pairstats
# te      pogo    265
# te      Rt1b    73
# te      Doc     58
# te      Tirant  34
# te      hobo    146
# te      P-element       163

Again P-element, pogo and hobo are quite abundant. Note that our example is exceptional since we prefiltered the data for reads mapping in chromosome 2R:11Mb-13Mbp, which explains why the read-statistics and the pair-statistics are quite similar in our example. Such a high similarity would not be expected for unfiltered data! For an explanation of the output file see [diverse output files]


Related

Wiki: DriverScripts
Wiki: FAQ
Wiki: Home
Wiki: Manual
Wiki: TE insertion file
Wiki: WalkthroughPreparatoryWork
Wiki: diverse output files
Wiki: signature file format
Wiki: signatures of TE insertions