Assuming, you want to identifiy TE insertions in a species of interest, say Dunkleosteus terrelli, but you do not know whether a given tool provides reliable results for this species or whether the available genomic resource are suitable.
Basically the performance of an approach for TE identification will depend on three factors:
Thus it is necessary to evaluate the performance of an approach for TE identification.
To address this question it is necessary to simulate a TE landscape with known insertions (using the genomic resources of the species of interest) and than test which tool best reproduces the simulated TE landscape.
In this walkthrough we demonstrate how to simulate a TE landscape and Illumina paired end reads for a species of interest. These reads may than be used to evaluate the performance of an approach for TE identification.
We need to be able to build arbitrary complex TE landscapes, TE insertions already present in the reference genome would interfere with this process. Thus we need to mask the TE sequences in the refence genome. We use RepeatMasker to mask all TEs with the character N and than a custom script to remove all Ns from the sequence: https://sourceforge.net/projects/simulates/files/walkthrough-species/remove-N.py/download
RepeatMasker -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4 -lib teseq-clean-ml100noS4.fasta 2R.fasta
python remove-N.py 2R.fasta.masked >2R.clean.fasta
In this walkthrough we generate a random TE landscape, with random position, family, strand and population frequency of TE insertions. For a walkthrough demonstrating how to generate custom TE landscapes see [Walkthrough]
python define-landscape_random-insertions-freq-range.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd
Next we build the population genome
python build-population-genome.py --chassis 2R.clean.fasta --te-seqs teseq-clean-ml100noS4.fasta --pgd mylandscape.pgd --output mylandscape.pg
Based on the population genome we simulate Illumina paired-end reads:
python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq
The obtained Illumina paired-end reads may be used as input for tools identifying TE insertions using Pool-Seq data, such as PoPoolationTE2 or TEMP.
An example, demonstrating TE identification with the simulated reads and comparision between the expected and observed TE landscape, can be found here: [Validation_Pop2]
The TE identification pipelines differ substantially among the tools for TE identificaiton from NGS data. Moreover the pipeline may also change substantially with the version of the tool. Hence, we refere to the manual of the respective tool for details. Just to show a few examples, the following tools may be used with SimulaTE data:
In this walkthrough we simulated Illumina paired-end data when sequencing the population as pool (Pool-Seq). SimulaTE however also allows to simulate