PoPoolation TE2 Wiki

Brought to you by: rokofler

Introduction and basic concepts

Authors:

Attachments

coverage_physicalcoverage.png (1534 bytes)

signature.png (14342 bytes)

Introduction
- Two vexing problems - motivation for developing PopoolationTE2
Basic concepts
- Coverage vs physical coverage
- Signaturs of TE insertions

Introduction

PopoolationTE2 is designed to enable an unbiased comparision of TE abundance between samples.
It aims to avoid biases that are frequently encountered when comparing TE abundance using paired-end data. As major problem, the power to identify TE insertions is quite heterogenous within and between samples.
This power depends on the number of mapped PE reads and on the insert size of the paired ends. Or in other words the power depends on the physical coverage. In PoPoolationTE2 the physical coverage is the product of read numbers and inner distance. Comparing TE abundance between sites or samples having different power (physical coverage) will thus lead to biases, where more TE insertions will be found in regions having higher physical coverage.
Not even subsampling read numbers of all samples to equal numbers will solve the problem, mostly because samples may have different inner distances and different coverage heterogeneities.

To enable homogenizing the power to identify TEs between and within samples we introduced the phyiscial pileup (ppileup) file, which summarizes for every site in the genome the structural status derived from mapped paired end reads. This ppileup file may be generated for single populations/samples or for multiple. The ppileup file allows to subsample the physical coverage to equal levels between and within samples and thus to homogenize the power to identify TEs, which in turn enables an unbiased comparision of TE abundance. Subsampling will only be done for sites having sufficient coverage in all samples, thus as a side effect, subsampling restricts the analysis to sites having sufficient power to identify TEs in all samples. Finally TE insertions will be directly identified from the physical pileup file. It is our hope that the the ppileup file could also assist third parties to develop tools for TE identification.

Two vexing problems - motivation for developing PopoolationTE2

Two problems frequently arise when comparing TE abundance between samples.

A TE insertions of interest is identified in one sample (e.g. a mutation accumulation line) but not in another sample (e.g. base population of the mutation accumulation line). Is this insertion a true novel TE insertion or rather an artefact of the analysis, where the coverage in the base population was simply insufficient to identify the insertion. PoPoolationTE2 addresses this problem by allowing to restrict the analysis to regions having sufficient power to identify TEs in all samples.
When comparing TE abundance between samples, like pooled populations or tissues, it may be interesting to figure out which sample contains the most TE insertions. For example, whether a pooled population from, say Florida, has more TEs than a population from Africa. Answering this questions for pooled population is usually challenging as pooled data are mostly unsaturated for TEs, which means that the number of identified TEs increases with read numbers. Therefore, most TEs are usually found in the sample with the most reads. Subsampling the reads in all samples to equal numbers does not entirely solve the problem. We found that homogenizing the power to identify TEs between and within samples allows for the least biased comparison of TE abundance between and within samples, even if samples have differing inner distances.

Basic concepts

Coverage vs physical coverage

The power to identify TE insertions with paired end increases with the physical coverage. In contrast to the base coverage which only scales with read numbers, the physical coverage scales with both, read numbers and insert size.
Base coverage is defined as number of reads spanning a genomic site. Physical coverage as used in PoPoolationTE2 is defined as number of paired ends spanning a genomic site, where only the inner distance is considered (read sequences are ignored here).

In the following example site a is covered by one read and thus has a base coverage of 1. However the site is covered by two paired-end reads and thus has a physical coverage of 2.

For more details see [ppileup file]

Signaturs of TE insertions

When paired ends are mapped to a modified genome consisting of a repeat masked reference genome and a set of TE sequences, than any TE insertion will lead to groups of discordantly mapped pairs, where one read maps to a TE and the other read to the reference genome (see graphic below). One group of discordantly mapped reads will be found to the left of the TE insertion (forward signature) and one to the right of the insertion (reverse signature).

For more details see [signatures of TE insertions]

Wiki: Home
Wiki: ppileup file
Wiki: signature file format
Wiki: signatures of TE insertions