Menu

TELandscapeMetrics

Robert Kofler Florian Schwarz

Introduction

The three TE landscape metrics estimate how well an assembly captures the TE landscape of an organisms. In particular we ask i) is the TE abundance accurately reproduced in an assembly ii) are all SNPs found in TEs present in an assembly and iii) are all internal deletions of TEs present in an assembly.
For computing the TE landscape metrics we require the expected and the observed TE abundance and diversity.
The expected TE landscape is derived from Illumina raw reads for an organism of interest and the observed landscape is estimated from the assemblies.

Requirements:

You need to install:

:::bash
source activate deviate_env

Walkthrough

Obtain the necessary files

Here we estimate the TE landscape metrics for a short- and a long-read based assembly of the D. melanogaster strain Canton-S.
For this walkthroug we need the following files

Preparatory steps

First unzip the reads

:::bash
gzip -d Canton-S_Illumina_2x.fastq.gz

Identify and remember the length of your reads, we need it at a later step for generating the artificial reads. For example.:

:::bash
cat Canton-S_Illumina_2x.fastq | paste - - - - | head -1 | awk '{print length($2)}'
125 
# our read length is 125bp

Index the consensus sequences of TEs (reads will be mapped against these sequences using bwa bwasw)

:::bash
bwa index dmel_consensus.fasta

Expected TE abundance and diversity

Move the reads into a separate folder (this is necessary as DeviaTE generates many files that summarize the TE landscape)

:::bash
mkdir expected
mv Canton-S_Illumina_2x.fastq expected

Estimate the expected TE landscape with DeviaTE

:::bash
cd expected
deviate --library ../dmel_consensus.fasta --threads 10 --rpm --families ALL --input_fq Canton-S_Illumina_2x.fastq

# we concatenate all raw files of DeviaTE (one for each TE family), these files contain the TE landscape
cat *.raw >../expected.txt

Generate artificial reads for the assembly

:::bash
# go back into main folder
cd .. 

# generate a directory for the assemly
mkdir observed

# generate the reads, use the  read length determined above from the Illumina raw reads (i.e expected values)
python cuscoquality/artificial-reads-for-assembly.py --assembly subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta  --read-length 125 > observed/assembly-reads.fastq

Observed TE abundance and diversity

:::bash
cd observed 
deviate --library dmel_consensus.fasta --threads 10 --rpm --families ALL --input_fq assembly-reads.fastq
cat *.raw > ../observed.txt

TE landscape metrics

All left to do is to compute the TE landscape metrics

:::bash
python cuscoquality/te-landscape-metrics.py --observed observed.txt --expected expected.txt

As estimating the landscape metrics with DeviaTE is a bit time consuming we provide precomputed estimates of the TE abundance and diversity.

Next unzip all files, e.g.:

:::bash
gzip -d Canton-S_Illumina_2x.fastq.gz

** Long read based assembly**

First we computed the landscape metrics for the long-read based assembly

:::bash
python cuscoquality/te-landscape-metrics.py --observed observed_100x_canu.raw  --expected CantonS_30x.raw --min-count 2 --min-freq 0.02 --min-cov 0.05

# which gives the following result:
TE landscape metrics

TE abundance - slope 1.0063
SNP count  - slope 1.0297
ID count   - slope 1.0672

Note that the TE abundance, the abundance of SNPs in the TEs and the abundance of IDs in the TEs are close to 1, hence a long-read based assembly captures the TE landscape quite accurately

** Short read based assembly**
As a contrast we now show the results for a short read based assembly

:::bash
python cuscoquality/te-landscape-metrics.py --observed abyss_CS.raw --expected CantonS_30x.raw --min-count 2 --min-freq 0.02 --min-cov 0.05

# which gives the results:
TE landscape metrics

TE abundance - slope 0.5529
SNP count  - slope 1.1415
ID count   - slope 1.5300

Note that the short read based assembly has a low quality as it badly captures the TE landscape of Canton-S. The TE abundancy is grossly underestimated and the number of SNPs and IDs is dramatically overestimated.

Getting the raw data

For some users it may be interesting to get the raw-data which have been used to compute the slope of the linear regression.
This may be easily achieved using the parameter --print-raw-data

:::bash
python cuscoquality/te-landscape-metrics.py --observed observed_100x_canu.raw --expected CantonS_30x.raw --min-count 2 --min-freq 0.02 --min-cov 0.05 --print-raw-data

which gives the following results:

:::bash
DMRER2DM    8349.978057024851   340 8   8771.11808385376    372 5
RT1C    2716.1252999699536  2079    30  2788.888785939409   2130    41
DIVER2  5113.954950839485   1125    33  5641.890622162796   1174    33
DMPOGOR11   3071.2622164669892  23  13  2533.667445009262   37  14
STALKER3    442.37612212273575  56  1   587.144605053864    65  0
...
...
  • col1: the TE family
  • col2: expected abundance of the TE family
  • col3: expected abundance of SNPs for the TE family
  • col4: expected abundance of internal deletions (ID) for the TE family
  • col5: observed abundance of the TE family
  • col6: observed abundance of SNPs for the TE family
  • col7: observed abundance of internal deletions (ID) for the TE family

Remember expected values are based on the Illumina short reads and observed values are inferred from the assembly


Related

Wiki: Home