The three TE landscape metrics estimate how well an assembly captures the TE landscape of an organisms. In particular we ask i) is the TE abundance accurately reproduced in an assembly ii) are all SNPs found in TEs present in an assembly and iii) are all internal deletions of TEs present in an assembly.
For computing the TE landscape metrics we require the expected and the observed TE abundance and diversity.
The expected TE landscape is derived from Illumina raw reads for an organism of interest and the observed landscape is estimated from the assemblies.
You need to install:
:::bash
source activate deviate_env
Here we estimate the TE landscape metrics for a short- and a long-read based assembly of the D. melanogaster strain Canton-S.
For this walkthroug we need the following files
First unzip the reads
:::bash
gzip -d Canton-S_Illumina_2x.fastq.gz
Identify and remember the length of your reads, we need it at a later step for generating the artificial reads. For example.:
:::bash
cat Canton-S_Illumina_2x.fastq | paste - - - - | head -1 | awk '{print length($2)}'
125
# our read length is 125bp
Index the consensus sequences of TEs (reads will be mapped against these sequences using bwa bwasw)
:::bash
bwa index dmel_consensus.fasta
Move the reads into a separate folder (this is necessary as DeviaTE generates many files that summarize the TE landscape)
:::bash
mkdir expected
mv Canton-S_Illumina_2x.fastq expected
Estimate the expected TE landscape with DeviaTE
:::bash
cd expected
deviate --library ../dmel_consensus.fasta --threads 10 --rpm --families ALL --input_fq Canton-S_Illumina_2x.fastq
# we concatenate all raw files of DeviaTE (one for each TE family), these files contain the TE landscape
cat *.raw >../expected.txt
:::bash
# go back into main folder
cd ..
# generate a directory for the assemly
mkdir observed
# generate the reads, use the read length determined above from the Illumina raw reads (i.e expected values)
python cuscoquality/artificial-reads-for-assembly.py --assembly subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta --read-length 125 > observed/assembly-reads.fastq
:::bash
cd observed
deviate --library dmel_consensus.fasta --threads 10 --rpm --families ALL --input_fq assembly-reads.fastq
cat *.raw > ../observed.txt
All left to do is to compute the TE landscape metrics
:::bash
python cuscoquality/te-landscape-metrics.py --observed observed.txt --expected expected.txt
As estimating the landscape metrics with DeviaTE is a bit time consuming we provide precomputed estimates of the TE abundance and diversity.
Next unzip all files, e.g.:
:::bash
gzip -d Canton-S_Illumina_2x.fastq.gz
** Long read based assembly**
First we computed the landscape metrics for the long-read based assembly
:::bash
python cuscoquality/te-landscape-metrics.py --observed observed_100x_canu.raw --expected CantonS_30x.raw --min-count 2 --min-freq 0.02 --min-cov 0.05
# which gives the following result:
TE landscape metrics
TE abundance - slope 1.0063
SNP count - slope 1.0297
ID count - slope 1.0672
Note that the TE abundance, the abundance of SNPs in the TEs and the abundance of IDs in the TEs are close to 1, hence a long-read based assembly captures the TE landscape quite accurately
** Short read based assembly**
As a contrast we now show the results for a short read based assembly
:::bash
python cuscoquality/te-landscape-metrics.py --observed abyss_CS.raw --expected CantonS_30x.raw --min-count 2 --min-freq 0.02 --min-cov 0.05
# which gives the results:
TE landscape metrics
TE abundance - slope 0.5529
SNP count - slope 1.1415
ID count - slope 1.5300
Note that the short read based assembly has a low quality as it badly captures the TE landscape of Canton-S. The TE abundancy is grossly underestimated and the number of SNPs and IDs is dramatically overestimated.
For some users it may be interesting to get the raw-data which have been used to compute the slope of the linear regression.
This may be easily achieved using the parameter --print-raw-data
:::bash
python cuscoquality/te-landscape-metrics.py --observed observed_100x_canu.raw --expected CantonS_30x.raw --min-count 2 --min-freq 0.02 --min-cov 0.05 --print-raw-data
which gives the following results:
:::bash
DMRER2DM 8349.978057024851 340 8 8771.11808385376 372 5
RT1C 2716.1252999699536 2079 30 2788.888785939409 2130 41
DIVER2 5113.954950839485 1125 33 5641.890622162796 1174 33
DMPOGOR11 3071.2622164669892 23 13 2533.667445009262 37 14
STALKER3 442.37612212273575 56 1 587.144605053864 65 0
...
...
Remember expected values are based on the Illumina short reads and observed values are inferred from the assembly