PoPoolation TE2 Wiki

Brought to you by: rokofler

Manual

Authors:

Attachments

popte2_flow.png (17303 bytes)

Prerequisites

Java
a short read mapper, like BWA SW (we recommend a local alignment algorithm)
a modified reference genome, the TE-merged-reference (see below)
a TE hierarchy (see below)
paired-end reads for at least one sample, where samples could be pooled populations, tissues or sequenced individuals

Preparatory work

First it is necessary to create a TE-merged-reference and a TE-hierarchy. Next, paired ends need to be mapped to the TE-merged-reference.

TE-merged-reference

The TE-merged-reference consists of i) the repeat-masked reference genome and ii) TE sequences. TE sequences could either be consensus sequences of TE families (e.g. from RepBase) or the sequences which have been masked in the reference genome or both.
The TE-merged-reference is in the fasta format.
An example:

>2L
ATCATCNNNNNNATCATCGGCC....
>2R
GCAGAGNNNNNNNTCATGCGAC....
>roo_from_2L
AGCAGC
>roo_from_2R 
AGGAGC
>roo_consensus 
TGCAGC

Here, 2L and 2R are reference chromosomes. roo_from_2L is the sequence of the roo transposable element that was masked in 2L (NNNNNN in 2L), roo_from_2R is the sequence of roo in 2R and roo_consensus is the consenus sequence of roo.
PopTE2 needs to distinguish TE sequences from reference chromosomes. This is accomplished by using the TE-hierarchy (see below). Every sequence in the fasta file with a corresponding entry in the TE-hierarchy is considered a TE while every sequence without entry is considered a reference chromosome.

Note the base substitutions in the roo sequences. PoPoolation TE2 allows to provide multiple sequences for every TE family. Thereby also diverged TE copies may be identifed. The hierarchy (see below) allows to assign these different sequences to one family (roo).

TE hierarchy

The TE hierarchy serves two purposes. First it allows to distinguish TE sequences from rerference chromosomes (see above) , and second it allows to assign multiple slightly diverged sequences to one family.
Some TE families have highly diverged copies (e.g INE-1 in Drosophila, with up to 10% sequence divergence) and this feature ensures that even highly diverged copies could be identifed.
Based on the hierachy, all reads mapping to any of these diverged sequences are recognized as mapping to the same family. Using the above example any read mapping to roo_from_2L, roo_from_2R, roo_consensus is treated as mapping to the roo family, provided the following hierarchy is used.

id            family    order
roo_from_2L   roo       LTR
roo_from_2R   roo       LTR
roo_consensus roo       LTR
gypsy_a       gypsy     LTR
gypsy_b       gypsy     LTR
copia         copia     LTR
P-element     P-element TIR

The following walkthrough demonstrates how the TE-merged-reference and the TE hierarchy can be generated

[WalkthroughPreparatoryWork]

Mapping PE reads to TE-merged-reference

We recommend to use a local alignment algorithm (bwa sw, bwa mem, bowtie --local) for mapping reads to the TE-merged-reference. PoPoolationTE2 requires a sorted bam file as input. Note that for every sample a separate bam file is required (read groups are not supported). If you use bwa mem it is important that you provide the -m option, which ensures that secondary alignments are marked as such.

The following walkthrough hsow these bam files may be generated [Walkthrough]

First steps with PoPoolationTE2

Dowload

PoPoolationTE2 is available as Java jar file for download here: https://sourceforge.net/projects/popoolation-te2/
Since it is implemented in Java it can be run on most operating systems including Windows, Mac OS X, and Linux.

Run PoPoolationTE2

PoPoolationTE2 supports variable tasks. Display all possible tasks by starting PopTE2 without providing any parameters

:::bash
java  -jar popte2.jar

Than run any subtask by providing the name of the task as first argument. For example in case you want to display the version number

:::bash
java  -jar popte2.jar version

List of supported tasks

PoPoolationTE2 supports several subtasks. The name of the subtask needs to be provided as first parameter. We distinguish Main tasks (necessary for an unbiased comparison of TE abundance) and Secondary tasks (helpful, but not essential).

Main tasks
- ppileup Generate a ppileup file
- subsamplePpileup subsample ppileup files to an uniform coverage
- identifySignatures identify signatures of TE insertions
- frequency estimate population frequencies for signatures
- filterSignatures filter signatures of TE insertions
- pairupSignatures pair up signatures of TE insertions to obtain TE insertions
Secondary tasks
- se2pe restore paired end information for individually mapped reads (e.g. bwasw) output files
- updatestrand estimate strand of signatures of TE insertions
- stat-coverage calculate physical coverage statistics; helps to decide optimal target coverage for subsampling
- stat-reads compute the mapping statistics; statistics about reads mapping to different reference chromosomes and TEs
- stat-pairs compute the paired end statistics; statistics about reads supporting a TE insertion
- version print the version number

Workflow

Here is an overview of the workflow for using PopoolationTE2. Mandatory steps are shown with a full line and optional steps with a dashed line. For example one input file (.bam) is required but additional ones could be provided. Files are shown in eliptic frames and steps performed with PoPoolationTE2 in rectangular frames.

Main task

ppileup

This step allows to generate a ppileup (physical pileup) file for one or multiple samples [ppileup file]

:::bash
# minimum parameter call
java -jar popte2.jar ppileup --bam input1.bam --map-qual 15 --hier te-hierachy.txt --output output.ppileup

** Parameters**

--bam: a bam file; Illumina paired end reads mapped to a TE-merged-reference (see above). At least one bam file must be provided
--map-qual: the minimum mapping quality for reads mapping to a reference chromosome. Reference chromosomes are recognized as sequences without corresponding entry in the TE-hierarchy (see above). Note this restriction does not apply to reads mapping to TE sequence. For such reads a low mapping quality is in fact expected, especially if several slightly diverged sequences are provided for a TE family.
--hier: the TE hierarchy (see above)
--output: the output, which will be a ppileup file; This innovation introduced with PoPoolationTE2 facilitates an unbiased comparision of TE abundance between samples/populations [ppileup file]
--te-shortcuts: a list of shortcuts for TEs. Per default PoPoolationTE2 computes a shortcut for every TE family present in the hierarchy. This shortcut is than used in the ppileup file. However, a list of shortcuts may also be provided by the user. Such a custom-list needs to meet the following criteria: a.) a shortcut has to be provided for every family in the TE hierarchy. b.) shortcuts must be unique, i.e. no shortcut may be used for two families. Shortcuts are case insensitive. c.) shortcuts must have distinct uppercase and lowercase values. For example '4a' is a valid shortcut (4a != 4A) but '4' is not (4 = 4).
--dissable-zipped: per default the output is a gzipped ppileup file; by provding this option zipped output may be dissabled
--sr-min-dist: minimum distance between paired-end reads to account as structural rearrangements. The inner distance between paired end reads is subject to stochastic variation. However, distances exceeding --sr-min-dist will not be treated as stochastic variations, but rather as structural variation (e.g. inversions, rearangments). Note Reads mapping to distinct reference chromosomes are always treated as structural variations (e.g. translocations).
--id-up-quant: upper quantile of inner distance; If for example set to 0.01 the 1% paired end reads with the most extreme inner distance will be ignored. This step is performed after applying --sr-min-dist
--homogenize-pairs: allows to use identical number of mapped pair ends for all samples, i.e. this option allows to homogenize the number of mapped paired ends; The algorithm first counts the number of informative pairs in all bam files (i.e. pairs supporting a TE, proper pair, pair supporting structural variants), than identifies the smallest number of informative pairs among the samples and finally samples the number of informative pairs in all bam files (on the fly) to the smallest number. The same number of paired ends will thus be used in each sample for generating the ppileup track (introduced with v1.08.02)
--detailed-log: provide more detailed help messages
--help: show help

subsamplePpileup

This step allows to subsample a ppileup-file to an uniform coverage, thus homogenizing the power to identify TE insertions within as well as between samples/populations, which in turn enables an unbiased comparision of TE abundance.

:::bash
# minimum parameter call
java -jar popte2.jar subsampleppileup --ppileup input.ppileup.gz --target-coverage 100 --output output.ss100.ppileup.gz

** Parameters**

--ppileup: a physical pileup file; Mandatory
--output: a physical pileup file; will be zipped per default; Mandatory
--target-coverage: subsample the coverage at all populations and at all sites to the given value; Note that sites with insufficient coverage in ANY sample/population will be ignored; Mandatory
--dissable-zipped: per default the output file is zipped; unzipped output may be obtained by providing this option
--with-replace: allows to sample with replacement instead of the default without replace; we recommend the default; (introduced with v1.08.02)
--detailed-log: mostly for troubleshooting; more detailed output can be obtained.
--help: show a help message

Detail during subsampling, the physical coverage from the forward and the reverse direction are treated separately. Thus, for every genomic sites actually two subsampling steps are performed, one for the forward coverage and one for the reverse coverage. This may result in a slightly different [ppileup file]

identifySignatures

This step allows to identify signatures of TE insertions from the ppileup-file, as explained here [signatures of TE insertions]; Signatures will be reported in the [signature file format]

:::bash 
# minimum parameter call
java -jar popte2.jar identifySignatures --ppileup input.ppileup.gz --mode separate ---output output.signatures --min-count 2

** Parameters**

--ppileup: a physical pileup file [ppileup file]
--output: a signature file [signature file format]
--mode: (separate | joint), PopTE2 allows to identify signatures of TE insertions with two different algorithm. With the separate algorithm, TEs are identified in each sample separately, independent of the other samples. With the joint algorithm the ppileup tracks of all samples are merged (internally only) and signatures are identified from this merged ppileup track. For illustrated explanations see [signatures of TE insertions]; For more explanation on the two different modes see [signature modes]
--min-count: the minimum average physical coverage in the window for identifying a signature of TE insertions; for details see [signatures of TE insertions]
--signature-window (fixNNNN | minimumSampleMedian | maximumSampleMedian | median): signatures of TE insertions are identified using a window based approach (see [signatures of TE insertions]). The window size may be specified with this parameter; With the default, 'median', the median of the inner distance is used for each sample, where every sample could have a different window size. With the other three options (fixNNNN, minimumSampleMedian, maximumSampleMedia) an identical window-size will be used for all samples/populations. fixNNNN allows the user to provide a fixed custom winodw size (e.g. fix120 for a window size of 120); the maximum median is used for all samples/populations with maximumSampleMedian and the minimum with minimumSampleMedian
--min-valley (fixNNNN | minimumSampleMedian | maximumSampleMedian | median) the minimum size of the valley between two consectuive TE insertions; the average coverage of the valley needs to be lower than --min-count; for illustrated explanation see [signatures of TE insertions]; default=[the same as ]--signature-window
--chunk-distance: to avoid excessive memory consumption by loading ppileup tracks for entire chromosomes, PoPoolationTE2 processes the ppileup track in chunks. If TE support lower than --min-count is found for --chunk-distance multiplied by the median insert size, PoPoolationTE2 proceeds with a new chunk. default=5
--detailed-log show a detailed log message
--help show help

frequency

This step estimates the population frequency of TE insertions and rearrangements [estimate frequency]

:::bash
# minimum parameter call
java -jar popte2.jar frequency --ppileup input.ppileup.gz --signature input.signatures --output output.freqsignatures

** Parameters**

--ppileup: a physical pileup file [ppileup file]
--signature: a signature file [signature file format]
--output: a signature file with frequency estimates [signature file format]
--detailed-log: show more detailed log messages
--help: display help

filterSignatures

This step allows to filter signatures. For example signatures overlapping with other TE insertions or with structural variants may be removed.

:::bash 
# minimum parameter call
java -jar popte2.jar filterSignatures --input tofilter.signatures --output filtered.signatures

** Parameters **

--input: the signatures to filer; mandatory [signature file format]
--output: the filtered signatures; mandatory [signature file format]
--min-coverage: the minimum average coverage; all samples need to meet this requirement; default=0
--max-coverage: the maximum average coverage; all samples need to meet this requirement; default=infinite
--min-count: the minimum average count of the given TE; At least one sample needs to meet this requirement; default=0
--max-otherte-count: the maximum allowed average count of other TEs. All samples need to meet this requirement; default=infinite
--max-structvar-count: the maximum allowed average count of structural variants (rearrangements); All samples need to meet this requirement; default=infinite
--min-fraction: the minimum required frequency of the TE; Only entries of the same family and signature direction (forward or reverse) are conisdered; At least one sample needs to meet this requirement; default=0.0
--max-otherte-fraction: the maximum allowed frequency of other TEs. All samples need to meet this requirement; default=1.0
--max-structvar-fraction: the maximum allowed frequency of structural variants; All samples need to meet this requiremenet; default=1.0
--help: display a help message

Note: We strongly recommend to filter overlapping TE insertions as frequency estimates may not be reliable (--max-otherte-count)

pairupSignatures

This step pairs matching signatures of TE insertions, generating the final result, a list of TE insertions.

:::bash
# minimum parameter call
java -jar popte2.jar pairupSignatures --signature topair.signatures --ref-genome temerged-reference.fasta --hier tehier.txt --output teinsertions.txt

Parameters

--signature: signatures of TE insertions that ought to be paired; mandatory
--ref-genome: the TE-merged-reference used for mapping the reads; this is necessary as PoPoolationTE2 computes the distance between signatures of TE insertions, but poly-N tracts should not be considered (otherwise we would bias against reference insertions); mandatory
--hier: the TE hierarchy; mandatory
--output: the final result, a list of TE insertions [TE insertion file]; mandatory
--min-distance: the minimum distance between valid pairs of signatures; distance is always computed as position-forward-signature minus position-reverse-signature, hence negative values are possible; default=-100
--max-distance: the maximum distance between valid pairs of signatures; default=500
--max-freq-diff: the maximum frequency difference between valid pairs of signatures; Applies to all pairs of samples (e.g forward-sample1 vs reverse-sample1 and forward-sample2 vs reverse-sample2; but NOT forward-sample1 vs reverse-sample2); default=1.0
--detailed-log: show more detailed logging messages
--help: display help

Secondary tasks

se2pe

This step restores paired end information for separately mapped reads. For example if read_1.fastq and read_2.fastq were mapped separately with bwa bwasw, this subtask allows to generate a merged bam file with paired end information (e.g. the flags will be set properly, and the position of the mates will be updated).

:::bash
# minimum parameter call
java -jar popte2.jar se2pe --fastq1 read_1.fastq --fastq2 read_2.fastq --bam1 read_1.bam --bam2 read2.bam --output paired-end.bam

** Parameters**

--fastq1: the first fastq read; may be zipped; mandatory
--fastq2: the second fastq read; may be zipped; mandatory
--bam1: the mapping result for the first read; may be sam or bam (not sorted!); mandatory
--bam2: the mapping result for the second read; may be sam or bam (not sorted!); mandatory
--output: the mapping result for both reads, with paired end information restored (e.g. the flags properly set, and the position of the mates updated); may be sam or bam; mandatory
--sort: set this flag for obtaining a sorted output file; PoPoolationTE2 requires sorted files for generating the ppileup file
--index: Create an index for the output file
--help: Show a help message

updateStrand

Per default, the PoPoolationTE2 pipeline does not estimate the strand of a TE insertion (i.e sense or antisense). If the strand information is desired this step may be used.

:::bash
# minimum parameter call
java -jar popte2.jar updateStrand --signature toupdate.signature --output strandupdated.signatur --bam sample1.bam --bam sample2.bam --bam sample3.bam --hier tehierarchy.txt --max-disagreement 0.1

Parameters

--bam: a bam file of paired-end reads mapped to the TE-merged-reference; may be provided multiple times; must be in the same order as was used for generating the ppileup file mandatory
--signature: the signatures for which the strand of the TE insertions should be estimated
--output: signatures with strand information
--hier: the TE hierarchy
--map-qual: the minimum mapping quality of reads mapping to a reference chromosome (not to a TE)
--max-disagreement: different paired end fragments may disagree on the strand of the TE insertion. If the provided maximum disagreement of paired end fragments is exceeded the strand will be unknown (character point). For example 0.1 means that at the most 10% of the reads may provide conflicting strand information. mandatory
--sr-mindist: minimum inner distance for structural rearrangements; if possible provide the same value as used for generating the ppileup; default=10000
--id-up-quant: ignore paired-end fragments with an insert size exceeding this fraction; if possible provide the same value as used for generating the ppileup; default=0.01
--detailed-log: show a more detailed logging message
--help: show a help message

stat-coverage

This option allows to generate coverage statistics for the ppileup file

:::bash
# minimum parameter call
java -jar popte2.jar stat-coverage --ppileup input.ppileup --output coverage-statistics.txt

Parameters

--ppileup: a ppileup file [ppileup file]; mandatory
--output: statistics about the physical coverage for all samples in the ppileup file; for the format of the output file see [diverse output files]; mandatory
--detailed-log: show a more detailed logger message
--help: show the help

stat-reads

This step allows to generate statistics about the reads mapped to TE sequences. For example the fraction of reads mapping to each TE family may be computed.

:::bash
# minimum parameter call
java -jar popte2.jar stat-reads --bam input1.bam --hier tehierarchy.txt --output read-stat.txt

Parameters

--bam: a bam file of paired end reads mapped to the TE-merged-reference; only a single file can be provided; mandatory
--map-qual: the minimum mapping quality of a read mapping to a TE (!); default=0
--hier: the TE hierarchy; mandatory
--output: the statistics of reads mapping to TEs; for the format of the output file see [diverse output files]; mandatory
--detailed-log: show a more detailed logging message
--help: show the help

stat-pairs

This step generates statistics about mapped paired end fragments, allowing to estimate the fraction of fragments mapped as proper pair, as discordant pairs that support a TE insertion and as discordant pairs that supporting a structural rearrangement.

:::bash
# minimum parameter call
java -jar popte2.jar stat-pairs --bam input1.bam --hier tehierarchy.txt --output read-stat.txt

Parameters

--bam: a bam file of paired end reads mapped to the TE-merged-reference; only a single file can be provided; mandatory
--map-qual: the minimum mapping quality of a read mapping to a TE (!); default=0
--hier: the TE hierarchy; mandatory
--output: the statistics of paired end fragments; for the format of the output file see [diverse output files]; mandatory
--detailed-log: show a more detailed logger message
--help: show the help

Wiki: FAQ
Wiki: Home
Wiki: TE insertion file
Wiki: Walkthrough
Wiki: WalkthroughPreparatoryWork
Wiki: diverse output files
Wiki: estimate frequency
Wiki: ppileup file
Wiki: signature file format
Wiki: signature modes
Wiki: signatures of TE insertions

PoPoolation TE2 Wiki

Manual

Prerequisites

Preparatory work

TE-merged-reference

TE hierarchy

Mapping PE reads to TE-merged-reference

First steps with PoPoolationTE2

Dowload

Run PoPoolationTE2

List of supported tasks

Workflow

Main task

ppileup

subsamplePpileup

identifySignatures

frequency

filterSignatures

pairupSignatures

Secondary tasks

se2pe

updateStrand

stat-coverage

stat-reads

stat-pairs

Related