This short walkthrough demonstrates how to generate a random TE landscape and how to simulate Illumina paired-end reads for this landscape.
We simulate 1000 TE insertions in a population of N=100 haploid genomes. The TE insertions have a random position, family, strand and population frequency (between a range of 0.1 and 0.9). Finally we simulate Illumina paired-end data for a pooled population (Pool-Seq).
first we generate a pgd-file, i.e. a description of the TE landscape using our DSL (for details see https://sourceforge.net/p/simulates/wiki/Home/#manual)
python define-landscape_random-insertions-freq-range.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd
Note: we use a minimum distance of 500 between adjacent TE insertions
based on the pgd-file we generate the population genome
python build-population-genome.py --pgd mylandscape.pgd --te-seqs teseq-clean-ml100noS4.fasta --chassis chasis1M.fasta --output mylandscape.pg
We simulate 100.000 Illumina paired-end reads, with a read length of 100bp, an inner distance of 100bp, a standard deviation of the inner distance of 20 and an error rate of 15
python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq
The obtained reads may be used as input for tools identifying TE insertions using Illumina data, such as PoPoolationTE2 or TEMP.
The reads may be used to i) validate tools for TE identification and ii) test whether a given tool is suitable for TE identification in a species of interest (a particular set of genomic resources).
An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]
Next we show how to generate a custom TE landscape and simulate PacBio reads for this landscape.
We simulate 5 TE insertions in a population of N=100 haploid genomes. The TE insertions have a random position. We manually specifiy the family the strand and the population frequency. Finally we simulate PacBio reads sequencing all individuals separately (assuming a diploid organism).
First we generate an empty pgd-file where we manually fill in the details for the TE insertions.
python define-landscape_template.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --N 100 --insert-count 5 --output custom-landscape.pgd
Following the empty template landscape custom-lanscape.pgd
1=$1 # M14653
2=$2 # DME9736
...
120=$120 # Beagle2
121=$121 # Q
122=$122 # OSV
123=$123 # DME542581
150985 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
158736 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
273676 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
298778 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
439415 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
The first part defines the TE sequences and the second part specifies TE insertions in the 100 individuals, where a star indicates no insertion in a given haploid genome. For more details on the population genome definition file see [describing_TE_landscapes] and [describing_TE_sequences]
Next we edit the file (using any text editor of choice) and specifiy TE insertions similarly as in the following example (don't forget to safe your changes):
# Chasis 2R; Length 1000000 nt
mariner=$1 # M14653
idefix=$2 # DME9736
...
120=$120 # Beagle2
q=$121 # Q
120=$122 # OSV
123=$123 # DME542581
nest=120+{50:mariner-}
multinest=idefix-{100:mariner+,300:q+{20:idefix}}
150985 * * * * * * * * * * * * nest nest nest * * nest- * * * * * * * nest * * * * * * * * * * * * nest * * * * * * * nest- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
158736 * * * * * * * * multinest- * * multinest- * * * multinest- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
273676 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * idefix * * * * * * * * * idefix * * * * * * * idefix * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
298778 * q+ q+ q+ q+ q+ * * * * q+ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
439415 * idefix * * idefix * * * mariner * idefix * * * * * * * * * * mariner * * * * * * * * * mariner * * * * * * * * * * * mariner mariner * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Note that we re-named some insertions, like 1 into mariner; this was just done for convenience and is not required; dealing with mariner is just more intuitive than dealing with 1
Note that we generated two new insertions: nest is a mariner nested within OSV ; multinest are several nested insertions within idefix; for details on our DSL for describing TE sequences see [describing_TE_sequences]
Based on our pgd-file we than build the population genome:
python ~/dev/simulate/build-population-genome.py --pgd custom-landscape.pgd --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --output custom-landscape.pg
PacBio reads frequently have a bimodal read length distribution; In this example we use the following read length distribution: https://sourceforge.net/projects/simulates/files/validation_reads/rld.txt/download
which is visualized here:
python read_individual_pacbio.py --pg custom-landscape.pg --rld-file rld.txt --error-rate 0.1 --deletion-fraction 0.5 --reads 10000 --fasta-prefix reads
Additionally the reads will have an error rate of 10%, where half the errors are deletions and the other half insertions (--deletion-fraction); We have 100 haploid genomes thus 50 diploid individuals will be simulated (always two consecutive entries from the pgd-file). We simulate 10.000 reads for each diploid individual.
This command will generate the following 50 read files:
-rw-r--r-- 1 robertkofler staff 113067241 Jul 7 14:05 reads1.fasta
-rw-r--r-- 1 robertkofler staff 113444056 Jul 7 14:06 reads2.fasta
-rw-r--r-- 1 robertkofler staff 112483761 Jul 7 14:07 reads3.fasta
-rw-r--r-- 1 robertkofler staff 113818134 Jul 7 14:08 reads4.fasta
...
-rw-r--r-- 1 robertkofler staff 113177013 Jul 7 14:46 reads49.fasta
-rw-r--r-- 1 robertkofler staff 113611996 Jul 7 14:47 reads50.fasta
The generated reads may be used as input for tools identifying TE insertions using PacBio data, such as LoRTE https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-017-0088-x
The reads may be used to i) validate tools for TE identification and ii) test whether a given tool is suitable for TE identification in a species of interest.
Wiki: Home
Wiki: Validation_Pop2
Wiki: Walkthrough_species_tool_compatibility
Wiki: describing_TE_landscapes
Wiki: describing_TE_sequences