ReadSim Wiki

Simple reads simulator for pacbio & nanopore

Brought to you by: hayanlee

manual

Authors:

Download
Installation
Start Using Simple Script
Settings for Long Reads Sequencing Technology
Inside the Script
- dnasim.py (for heterozygous polyploidy genome)
- readsim.py (for reads simulation)
  - More Options

Download

Please use this link to download the latest version.

Installation

$ tar xvfz readsim-1.x.tar.gz 
$ cd readsim-1.x
readsim-1.x$ ls
example   src

Download the latest version.
Extract the tar file.
If successful, you will see readsim-1.x directory.

Start Using Simple Script

In script directory there are 3 scripts; each is for pacbio, pacbio_ec and nanopore.
Included script gives you a simple and good start.
The script will run around 5-10 min, depending on your hardware specification.
To get the result, follow the steps.

readsim-1.6$ cd example/
readsim-1.6/example$ ls
ecoli  pacbio.p5-c3.len.dist  script
readsim-1.6$ cd script/
readsim-1.6/example/script$ ls
run_dnasim.ecoli.sh  
run_readsim.ecoli.nanopore.sh  
run_readsim.ecoli.pacbio_ec.sh
run_readsim.ecoli.pacbio.sh
run_readsim.ecoli.perfect.sh
run_readsim.ecoli.uniform.sh


readsim-1.0/example/script$ ./run_readsim.ecoli.pacbio.sh 
================================================================================
                         Read Simulator for Long Reads

Sequencing Technology : pacbio
FASTA file path : NC_000913.fna
Reads : mean(2000bp) distribution(exp)
Coverage : mean(5.0x)
Mutation (Substitution) : mean(1.0%)
Mutation (Insertion)    : mean(12.0%)
Mutation (Deletion)     : mean(2.0%)
================================================================================
[INFO:002] 2014-11-21 01:26:32 reading NC_000913.fna
[INFO:002] 2014-11-21 01:26:32 done.
[INFO:002] 2014-11-21 01:26:32 generating length given distribution.
[INFO:003] 2014-11-21 01:26:32 NC_000913.pacbio.reads.fa is created
[INFO:012] 2014-11-21 01:26:32 position 4896 has been processed. (0.00x)
[INFO:012] 2014-11-21 01:26:34 position 1000417 has been processed. (0.22x)
[INFO:012] 2014-11-21 01:26:36 position 2000480 has been processed. (0.43x)
[INFO:012] 2014-11-21 01:26:37 position 3001390 has been processed. (0.65x)
[INFO:012] 2014-11-21 01:26:39 position 4001424 has been processed. (0.86x)
[INFO:012] 2014-11-21 01:26:41 position 5001934 has been processed. (1.08x)
[INFO:012] 2014-11-21 01:26:42 position 6000558 has been processed. (1.29x)

.
.
.
[INFO:012] 2014-11-21 01:27:06 position 20000187 has been processed. (4.31x)
[INFO:012] 2014-11-21 01:27:07 position 21000389 has been processed. (4.53x)
[INFO:012] 2014-11-21 01:27:09 position 22000240 has been processed. (4.74x)
[INFO:012] 2014-11-21 01:27:11 position 23003796 has been processed. (4.96x)
[INFO:012] 2014-11-21 01:27:11 position 23199876 has been processed. (5.00x)
[INFO:010] Total 11367 reads are generated; 5669 is forward, 5698 is reversed

Settings for Long Reads Sequencing Technology

Pacbio

Average read length : 5,000 bp
Read distribution : exponential distribution
Coverage : 10x
Average error rate for substitution : 1%
Average error rate for insertion : 12%
Average error rate for deletion : 2%

Pacbio Error Corrected

Average read length : 5,000 bp
Read distribution : exponential distribution
Coverage : 10x
Average error rate for substitution : 0.33%
Average error rate for insertion : 0.33%
Average error rate for deletion : 0.33%

Nanopore

Average read length : 100,000 bp
Read distribution : normal distribution
Coverage : 10x
Average error rate for substitution : 3%
Average error rate for insertion : 3%
Average error rate for deletion : 3%

Inside the Script

dnasim.py (for heterozygous polyploidy genome)

../../src/dnasim.py --ploidy 10 --het 0.05 --pre [prefix]  [ref]
../../src/dnasim.py --ploidy 10 --het 0.05 --pre NC_000913 NC_000913.fna

ploidy
het - heterozygous level, 0.05 means 5% difference

readsim.py (for reads simulation)

The main script. When you run with your own command or fix the script, please make sure the path is correct. This is the script for Pacbio

for l in 2000; do
  for c in 5; do 
    ../../src/readsim.py sim fa \
    --ref NC_000913.fna \
    --pre NC_000913.pacbio.reads \
    --rev_strd on \
    --tech pacbio_ec --read_mu $l --cov_mu $c
  done;
done;

sim - The major command, meaning that we will simulate reads
- fa - generate a base only reads file. You will see .fasta as a result
- fq - generate a base and quality valued in the same file. You will see .fastq as a result
- fafq - generate two base reads and quality value file separately. You will see .fa and .fq file as results.
ref - Reference genome. When you run with your own command or fix the script, please make sure the path is correct
pre - The prefix. The last file name will be prefix.{fast|fastq|fa|fq}
rev_strd - on | off, 'on' means create backward strands as well as forward strands, randomly half and half. 'off' means no backward strands, every read is forward strand.
tech - The sequencing technology that you want to simulate. We do support pacbio, pacbio_ec (pacbio error corrected) and nanopore
read_mu - The average on read length
cov_mu - The overall coverage

More Options

read_mu - The average on read length
read_dist - The overall distribution of read length. Choose among {uniform, normal, exp}. It also can take a file, in which each line has a length of a read. The read length file can get the simulated reads more realistic(v1.5).
cov_mu - The overall coverage
err_sub_mu - The average on substitution rate
err_in_mu - The average on insertion rate
err_del_mu - The average on deletion rate