Introduction
SimulaTE is a package for simulating arbitrary complex TE landscapes - i.e TE insertions in the genomes of all individuals in a population - and to simulate reads of different sequencing technologies (Illumina, PacBio) based on this population. Reads may either be simulated for individuals separately or for a pooled population (Pool-Seq). SimulaTE may be used to i) evaluate the performance of software for the identification of TEs and ii) evaluate the suitabilty of given genomic resources, such as a reference genome or a set of known TE insertions, for estimating TE abundance.
SimulaTE operates at three tiers (see figure).
First a TE landscape needs to be specified using a simple domain specific language, that we specifically developed for this purpose. We provide scripts that aid in this task. Second, based on the description of the TE landscape the genomes of all individuals in a population are simulated (the population genome). Third reads, mimicing the properties of various sequencing technologies, are simulated using the population genome as template. In the following overview scripts are shown as grey rectangles and files as white rhombs.

We provide multiple scripts to simulate i) TE landscapes (script name starts with define-landscape_), ii) reads for Pool-Seq (read_pool-seq_) and iii) reads for sequencing individuals (read_individual_).
Installation
Download the latest release and unzip the file in any folder of your choice.
The scripts can be used immediatelly by providing the python command and the path to the script; For example:
# absolute path
python /Users/robert/programs/simulate/build-population-genome.py --pgd mylandscape.pgd --te-seqs teseq-clean-ml100noS4.fasta --chassis chasis1M.fasta --output mylandscape.pg
# relative path
python programs/simulate/build-population-genome.py --pgd mylandscape.pgd --te-seqs teseq-clean-ml100noS4.fasta --chassis chasis1M.fasta --output mylandscape.pg
Use SimulaTE scripts directly
Some users prefer to use the script directly, without providing the python command and the path to the script, as shown in the following example:
build-population-genome.py --pgd mylandscape.pgd --te-seqs teseq-clean-ml100noS4.fasta --chassis chasis1M.fasta --output mylandscape.pg
To use the scripts directly you need to follow these two steps:
1.) Make the SimulaTE scripts executable
# go to the SimulaTE folder, e.g.
cd /Users/robert/programs/simulate
# and change the file mode to executable
chmod +x *.py
2.) Add the path of SimulaTE to the environmental variable $PATH
Find the absolute path of your SimulaTE installation (e.g. "/Users/robert/programs/simulate") and add the following line to the file .bash_profile in your home directory (use a texteditor of choice).
export PATH=/Users/robert/programs/simulate:$PATH
Note You need to open a new instance of the shell for this modificiation to take effect
For more info on adding a path to the environmental variable PATH see https://www.cyberciti.biz/faq/how-to-add-to-bash-path-permanently-on-linux/ and https://stackoverflow.com/questions/14637979/how-to-permanently-set-path-on-linux
Manual
Following a description of all the scripts provided with SimulaTE. Parameters within square brakets are optional, all other parameters must be provided.
Tier 1: defining TE landscapes
define-landscape_template.py
The script creates an empty default TE landscape that may be filled in by the user
- --chassis a fasta file containing a single sequence. TEs will be inserted into this sequence; We call this sequence the chassis. For example a chromosme arm could be provided.
- --te-seqs a fasta file that may contain one or multiple entries; this can be used to define the TE sequences that will be inserted into the chassis; For example the consensus sequences of TE families could be provided
- --N the population size (number of haploid genomes)
- [--insert-count] number of empty TE insertions to create (these empty insertions may be manually filled in by the user)
- [--min-distance] a minimum distance between two consecutive TE insertions
- --output a pgd-file (population genome definition) that may be edited by the user; for details see [describing_TE_landscapes]
define-landscape_random-insertions-freq-range.py
The script creates a TE landscape with TE insertions having random positions, frequency, strand and population frequency.
- --chassis a fasta file containing a single sequence, i.e. the chassis.
- --te-seqs a fasta file that may contain one or multipe entries; this can be used to define the TE sequences that will be inserted into the chassis; For example the consensus sequences of TE families could be provided
- --N the population size (number of haploid genomes)
- [--insert-count] number of TE insertions to create
- [--min-distance] a minimum distance between two consecutive TE insertions
- --output a pgd-file (population genome definition); for details see [describing_TE_landscapes]
- --min-freq minimum population frequency of TE insertions
- --max-freq maximum population frequency of TE insertions
Tier 2: build the population genome
build-population-genome.py
This script generates the population genome based on the description of a TE landscape (obtained from Tier 1). The population genome contains all haploid genomes of a population.
More details can be found here [describing_TE_landscapes]
- [--chassis] a fasta file containing a single sequence, i.e. the chassis.
- [--te-seqs] a fasta file that may contain one or multipe TE sequences that will be inserted into the chassis
- --pgd a population-genome-definition (pgd) file; the definition of the TE landscape; for details see [describing_TE_landscapes]
- --output the output file, the population genome, i.e. a multiple fasta file containing all haploid genomes of a population; for details see [describing_TE_landscapes]
Tier 3: simulate the reads
Based on the population genome, reads are simulated that have the properties of varying sequencing technologies.
Two main applications can be distinguished: scripts simulating Pool-Seq data (read_pool-seq_) and scripts for simulating sequencing of individuals separately (read_individual_).
read_pool-seq_illumina-SE.py
The script generates Illumina single-end reads (SE) for a pooled population (Pool-Seq).
- --pg the population genome file
- --read-length the length of the reads
- [--error-rate] the fraction of sequencing errors that will be introduced into the reads; only base substitutions; default = 0.0
- --reads the total number of reads to generate
- --fastq the output file; reads will be in the fastq format
read_individual_illumina-SE.py
The script generates Illumina single-end reads (SE) for sequencing individuals separately, either haploids or diploids.
- --pg the population genome file
- --read-length the length of the reads
- [--error-rate] the fraction of sequencing errors that will be introduced into the reads; only base substitutions; default = 0.0
- --reads number of reads to generate per individual
- [--haploid] flag; specifiy if individuals are haploid; if not provided diploids will be used; two consecutive haploid genomes in the --pg file will constitute the genome of one diploid
- --fastq-prefix the prefix of the output files; a separate fastq-file will be generated for each individual
read_pool-seq_illumina-PE.py
The script generates Illumina paired-end reads (PE) for a pooled population (Pool-Seq).
- --pg the population genome file
- --read-length the length of the reads
- --inner-distance the mean of the inner distance between paired-end reads (fragment size = 2 * read_length + inner_distance)
- --std-dev the standard deviation of the inner distance
- [--error-rate] the fraction of sequencing errors that will be introduced into the reads; only base substitutions; default = 0.0
- [--fraction-chimera] the fraction of chimeric paired-end fragements to generate; chimeric reads are an artefact of Illumina library preparation and derived from random genomic positions. Usually about 2% chimeric reads are found. default = 0.0
- --reads the total number of reads to generate
- --fastq1 the output file for the first read; fastq format
- --fastq2 the output file for the second read; fastq format
read_individual_illumina-PE.py
The script generates Illumina paired-end reads (PE) for sequencing individuals separately, either haploids or diploids.
- --pg the population genome file
- --read-length the length of the reads
- --inner-distance the mean of the inner distance between paired-end reads (fragment size = 2 * read_length + inner_distance)
- --std-dev the standard deviation of the inner distance
- [--error-rate] the fraction of sequencing errors that will be introduced into the reads; only base substitutions; default = 0.0
- [--fraction-chimera] the fraction of chimeric paired end fragements to generate; default = 0.0
- [--haploid] flag; specifiy if individuals are haploid; if not provided diploids will be used; two consecutive haploid genomes in the --pg file will constitute the genome of one diploid
- --reads number of reads per individual
- --fastq-prefix the prefix of the output files; a separate fastq-file will be generated for each individual
read_pool-seq_pacbio.py
The script generates PacBio reads for a pooled population (Pool-Seq). The read length may either be drawn from a normal distribution (with mean and standard deviation) or from a user defined distribution (provided in a file).
- --pg the population genome file
- [--read-length] the mean of the reads length, assuming a normal distribution of the read length
- [--std-dev] the standard deviation of the read lengths, assuming a normal distribution of the read lengths
- [--rld-file] the read length distribution file; any distribution of read lengths may be provided; if this option is provided --std-dev and --read-length will be ignored; see below for details on the rld-file
- [--error-rate] the fraction of sequencing errors that will be introduced into the reads; solely indels; default = 0.0
- [--deletion-fraction] PacBio generates overwhelmingly indels, where about half are deletions and the other half insertions; this parameter allows to set the fraction of deletions; 1 minus this fraction will be the insertions; default = 0.5
- --reads the total number of reads to generate
- --fasta the output file; fasta format
the rld-file
Example of rld-file (read length distribution):
The first column is the read length and the second the counts. Columns are separated by a tab.
Note Oxford nanopore creates mostly deletions (75%); Thus a --deletion-fraction 0.75 could be used to emulate ONT reads.
read_individual_pacbio.py
The script generates PacBio reads for sequencing individuals separately. Either haploid or diploid individuals may be simulated.
- --pg the population genome file
- [--read-length] the mean of the reads length, assuming a normal distribution of the read length
- [--std-dev] the standard deviation of the read lengths, assuming a normal distribution of the read length
- [--rld-file] the read length distribution file; any distribution of read lengths may be provided; if this option is provided --std-dev and --read-length will be ignored; see above for details on the rld-file
- [--error-rate] the fraction of sequencing errors that will be introduced into the reads; solely indels; default = 0.0
- [--deletion-fraction] PacBio generates overwhelmingly indels, where about half are deletions and the other half insertions; this parameter allows to set the fraction of deletions; 1 minus this fraction will be the insertions; default = 0.5
- [--haploid] flag; specifiy if individuals are haploid; if not provided diploids will be used; two consecutive haploid genomes in the --pg file will constitute the genome of one diploid
- --reads number of reads per individual
- --fasta-prefix the prefix of the output files; a separate fasta-file will be generated for each individual
Note Oxford nanopore creates mostly deletions (75%); Thus a --deletion-fraction 0.75 could be used to emulate ONT reads.