SimulaTE Wiki

Brought to you by: rokofler

Walkthrough

Authors:

Attachments

rld.png (7650 bytes)

Walkthrough 1: Random insertions; Pool-Seq; Illumina paired-end sequencing
Walkthrough 2: Manual insertions; Individual sequencing; PacBio sequencing

Walkthrough 1: Random insertions; Pool-Seq; Illumina paired-end sequencing

Introduction

This short walkthrough demonstrates how to generate a random TE landscape and how to simulate Illumina paired-end reads for this landscape.
We simulate 1000 TE insertions in a population of N=100 haploid genomes. The TE insertions have a random position, family, strand and population frequency (between a range of 0.1 and 0.9). Finally we simulate Illumina paired-end data for a pooled population (Pool-Seq).

Download the data:

chassis, i.e. the sequence into which TEs will be inserted: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download

generating the population genome definition (pgd)

first we generate a pgd-file, i.e. a description of the TE landscape using our DSL (for details see https://sourceforge.net/p/simulates/wiki/Home/#manual)

python define-landscape_random-insertions-freq-range.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --insert-count 1000 --min-freq 0.1 --max-freq 0.9 --min-distance 500 --N 100 --output mylandscape.pgd

Note: we use a minimum distance of 500 between adjacent TE insertions

build the population genome

based on the pgd-file we generate the population genome

python build-population-genome.py --pgd mylandscape.pgd --te-seqs teseq-clean-ml100noS4.fasta --chassis chasis1M.fasta --output mylandscape.pg

simulate the reads

We simulate 100.000 Illumina paired-end reads, with a read length of 100bp, an inner distance of 100bp, a standard deviation of the inner distance of 20 and an error rate of 15

python read_pool-seq_illumina-PE.py --pg mylandscape.pg --read-length 100 --inner-distance 100 --std-dev 20 --error-rate 0.01 --reads 100000 --fastq1 reads_1.fastq --fastq2 reads_2.fastq

next steps

The obtained reads may be used as input for tools identifying TE insertions using Illumina data, such as PoPoolationTE2 or TEMP.
The reads may be used to i) validate tools for TE identification and ii) test whether a given tool is suitable for TE identification in a species of interest (a particular set of genomic resources).

An example, demonstrating TE identification with the simulated reads, can be found here: [Validation_Pop2]

Walkthrough 2: Manual insertions; Individual sequencing; PacBio sequencing

Introduction

Next we show how to generate a custom TE landscape and simulate PacBio reads for this landscape.
We simulate 5 TE insertions in a population of N=100 haploid genomes. The TE insertions have a random position. We manually specifiy the family the strand and the population frequency. Finally we simulate PacBio reads sequencing all individuals separately (assuming a diploid organism).

Download the data:

chassis, i.e. the sequence into which TEs will be inserted: https://sourceforge.net/projects/simulates/files/validation_pop2/chasis1M.fasta/download
TE sequences: https://sourceforge.net/projects/simulates/files/validation_pop2/teseq-clean-ml100noS4.fasta/download

Generate an empty template of a TE landscape

First we generate an empty pgd-file where we manually fill in the details for the TE insertions.

python define-landscape_template.py --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --N 100 --insert-count 5 --output custom-landscape.pgd

Define the landscape

Following the empty template landscape custom-lanscape.pgd

1=$1     # M14653
2=$2     # DME9736
...
120=$120     # Beagle2
121=$121     # Q
122=$122     # OSV
123=$123     # DME542581
150985 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
158736 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
273676 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
298778 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
439415 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

The first part defines the TE sequences and the second part specifies TE insertions in the 100 individuals, where a star indicates no insertion in a given haploid genome. For more details on the population genome definition file see [describing_TE_landscapes] and [describing_TE_sequences]

Next we edit the file (using any text editor of choice) and specifiy TE insertions similarly as in the following example (don't forget to safe your changes):

# Chasis 2R; Length 1000000 nt
mariner=$1     # M14653
idefix=$2     # DME9736
...
120=$120     # Beagle2
q=$121     # Q
120=$122     # OSV
123=$123     # DME542581
nest=120+{50:mariner-}
multinest=idefix-{100:mariner+,300:q+{20:idefix}}
150985 * * * * * * * * * * * * nest nest nest * * nest- * * * * * * * nest * * * * * * * * * * * * nest * * * * * * * nest- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
158736 * * * * * * * * multinest- * * multinest- * * * multinest- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
273676 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * idefix * * * * * * * * * idefix * * * * * * * idefix * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
298778 * q+ q+ q+ q+ q+ * * * * q+ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
439415 * idefix * * idefix * * * mariner * idefix * * * * * * * * * * mariner * * * * * * * * * mariner * * * * * * * * * * * mariner mariner * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Note that we re-named some insertions, like 1 into mariner; this was just done for convenience and is not required; dealing with mariner is just more intuitive than dealing with 1
Note that we generated two new insertions: nest is a mariner nested within OSV ; multinest are several nested insertions within idefix; for details on our DSL for describing TE sequences see [describing_TE_sequences]

insertion at position 150985: five individuals have a nest insertions; four on the plus strand one on the minus strand
insertion at position 158736: three individuals have a multinest insertion on the minus strand
insertion at position 273676: three individuals hava a idefix insertion on the plus strand
insertion at position 298778: six individuals have a q insertion on the plus strand
insertion at position 439415: five individuals have a mariner insertion on the plus strand and three a idefix insertion at the plus strand; the mariner and the idefix insertion are thus overlapping; In other words different individuals have different inserts at the same genomic site

build population genome

Based on our pgd-file we than build the population genome:

python ~/dev/simulate/build-population-genome.py --pgd custom-landscape.pgd --chassis chasis1M.fasta --te-seqs teseq-clean-ml100noS4.fasta --output custom-landscape.pg

simulate the reads

PacBio reads frequently have a bimodal read length distribution; In this example we use the following read length distribution: https://sourceforge.net/projects/simulates/files/validation_reads/rld.txt/download

which is visualized here:

python read_individual_pacbio.py --pg custom-landscape.pg --rld-file rld.txt --error-rate 0.1 --deletion-fraction 0.5 --reads 10000 --fasta-prefix reads

Additionally the reads will have an error rate of 10%, where half the errors are deletions and the other half insertions (--deletion-fraction); We have 100 haploid genomes thus 50 diploid individuals will be simulated (always two consecutive entries from the pgd-file). We simulate 10.000 reads for each diploid individual.

This command will generate the following 50 read files:

-rw-r--r--  1 robertkofler  staff  113067241 Jul  7 14:05 reads1.fasta
-rw-r--r--  1 robertkofler  staff  113444056 Jul  7 14:06 reads2.fasta
-rw-r--r--  1 robertkofler  staff  112483761 Jul  7 14:07 reads3.fasta
-rw-r--r--  1 robertkofler  staff  113818134 Jul  7 14:08 reads4.fasta
...
-rw-r--r--  1 robertkofler  staff  113177013 Jul  7 14:46 reads49.fasta
-rw-r--r--  1 robertkofler  staff  113611996 Jul  7 14:47 reads50.fasta

next steps

The generated reads may be used as input for tools identifying TE insertions using PacBio data, such as LoRTE https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-017-0088-x
The reads may be used to i) validate tools for TE identification and ii) test whether a given tool is suitable for TE identification in a species of interest.

Wiki: Home
Wiki: Validation_Pop2
Wiki: Walkthrough_species_tool_compatibility
Wiki: describing_TE_landscapes
Wiki: describing_TE_sequences

SimulaTE Wiki

Walkthrough

Walkthrough 1: Random insertions; Pool-Seq; Illumina paired-end sequencing

Introduction

Download the data:

generating the population genome definition (pgd)

build the population genome

simulate the reads

next steps

Walkthrough 2: Manual insertions; Individual sequencing; PacBio sequencing

Introduction

Download the data:

Generate an empty template of a TE landscape

Define the landscape

build population genome

simulate the reads

next steps

Related