SimulaTE Wiki

Brought to you by: rokofler

Walkthrough_species_tool_compatibility

Introduction
Walkthrough:
Note

Introduction

Assuming, you want to identifiy TE insertions in a species of interest, say Dunkleosteus terrelli, but you do not know whether a given tool provides reliable results for this species or whether the available genomic resource are suitable.

Basically the performance of an approach for TE identification will depend on three factors:

the reference genome (e.g. TE identifcation will be difficult for highly repetitive genomes)
the TE sequences (e.g. TE identification may be difficult when TE sequences have a high sequence similarity)
the tool (some algorithm are just better than others; also there may be interactions between tools and genomic resources, for example if a tool is very sensitive to repetitive regions in genomes)

Thus it is necessary to evaluate the performance of an approach for TE identification.
To address this question it is necessary to simulate a TE landscape with known insertions (using the genomic resources of the species of interest) and than test which tool best reproduces the simulated TE landscape.

In this walkthrough we demonstrate how to simulate a TE landscape and Illumina paired end reads for a species of interest. These reads may than be used to evaluate the performance of an approach for TE identification.

Walkthrough:

Requirements:

the sequence of a chromosome of the species of interest in fasta format; in case the entire genome should be used, just concatenate the sequences of all chromosomes. In this walkthrough we use chromosome 2R of Drosophila: https://sourceforge.net/projects/simulates/files/walkthrough-species/2R.fasta/download
sequences of the TEs that should be identified; this could be consensus sequences of the TE families present in the species of interest; In this walkthrough we use the consensus sequences of Drosophila melanogaster TEs: https://sourceforge.net/projects/simulates/files/walkthrough-species/teseq-clean-ml100noS4.fasta/download

Mask the TEs

We need to be able to build arbitrary complex TE landscapes, TE insertions already present in the reference genome would interfere with this process. Thus we need to mask the TE sequences in the refence genome. We use RepeatMasker to mask all TEs with the character N and than a custom script to remove all Ns from the sequence: https://sourceforge.net/projects/simulates/files/walkthrough-species/remove-N.py/download

RepeatMasker -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4 -lib teseq-clean-ml100noS4.fasta 2R.fasta
python remove-N.py 2R.fasta.masked >2R.clean.fasta

generate a TE landscape

In this walkthrough we generate a random TE landscape, with random position, family, strand and population frequency of TE insertions. For a walkthrough demonstrating how to generate custom TE landscapes see [Walkthrough]

python define-landscape_random-insertions-freq-range.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd

build the population genome

Next we build the population genome

python build-population-genome.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --pgd mylandscape.pgd --output mylandscape.pg

simulate Illumina paired end reads:

Based on the population genome we simulate Illumina paired-end reads:

python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq

next steps

The obtained Illumina paired-end reads may be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP.
An example, demonstrating TE identification with the simulated reads and comparision between the expected and observed TE landscape, can be found here: [Validation_Pop2]

The TE identification pipelines differ substantially among the tools for TE identificaiton from NGS data. Moreover the pipeline may also change substantially with the version of the tool. Hence, we refere to the manual of the respective tool for details. Just to show a few examples, the following tools may be used with SimulaTE data:

PoPoolationTE2 https://sourceforge.net/projects/popoolation-te2/
T-LeX2 https://academic.oup.com/nar/article/43/4/e22/2410985/T-lex2-genotyping-frequency-estimation-and-re
TEMP https://www.ncbi.nlm.nih.gov/pubmed/24753423
LoRTE https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5385071/
Retroseq https://www.ncbi.nlm.nih.gov/pubmed/23233656
TE-Tracker https://www.ncbi.nlm.nih.gov/pubmed/25408240
Jitterbug https://www.ncbi.nlm.nih.gov/pubmed/26459856

Note

In this walkthrough we simulated Illumina paired-end data when sequencing the population as pool (Pool-Seq). SimulaTE however also allows to simulate

Illumina paired-end data for sequencing individuals of a population separately
PacBio data when individuals are sequenced as pools
PacBio data when individuals are sequenced separately
Illumina single-end data when individuals are sequenced as pools
Illumina single-end data when indviduals are sequenced separtely

Wiki: Home
Wiki: Validation_Pop2
Wiki: Walkthrough