Menu

manual

Hayan Lee

Download

Please use this link to download the latest version.


Installation

$ tar xvfz readsim-1.x.tar.gz 
$ cd readsim-1.x
readsim-1.x$ ls
example   src
  • Download the latest version.
  • Extract the tar file.
  • If successful, you will see readsim-1.x directory.

Start Using Simple Script

In script directory there are 3 scripts; each is for pacbio, pacbio_ec and nanopore.
Included script gives you a simple and good start.
The script will run around 5-10 min, depending on your hardware specification.
To get the result, follow the steps.

readsim-1.6$ cd example/
readsim-1.6/example$ ls
ecoli  pacbio.p5-c3.len.dist  script
readsim-1.6$ cd script/
readsim-1.6/example/script$ ls
run_dnasim.ecoli.sh  
run_readsim.ecoli.nanopore.sh  
run_readsim.ecoli.pacbio_ec.sh
run_readsim.ecoli.pacbio.sh
run_readsim.ecoli.perfect.sh
run_readsim.ecoli.uniform.sh


readsim-1.0/example/script$ ./run_readsim.ecoli.pacbio.sh 
================================================================================
                         Read Simulator for Long Reads

Sequencing Technology : pacbio
FASTA file path : NC_000913.fna
Reads : mean(2000bp) distribution(exp)
Coverage : mean(5.0x)
Mutation (Substitution) : mean(1.0%)
Mutation (Insertion)    : mean(12.0%)
Mutation (Deletion)     : mean(2.0%)
================================================================================
[INFO:002] 2014-11-21 01:26:32 reading NC_000913.fna
[INFO:002] 2014-11-21 01:26:32 done.
[INFO:002] 2014-11-21 01:26:32 generating length given distribution.
[INFO:003] 2014-11-21 01:26:32 NC_000913.pacbio.reads.fa is created
[INFO:012] 2014-11-21 01:26:32 position 4896 has been processed. (0.00x)
[INFO:012] 2014-11-21 01:26:34 position 1000417 has been processed. (0.22x)
[INFO:012] 2014-11-21 01:26:36 position 2000480 has been processed. (0.43x)
[INFO:012] 2014-11-21 01:26:37 position 3001390 has been processed. (0.65x)
[INFO:012] 2014-11-21 01:26:39 position 4001424 has been processed. (0.86x)
[INFO:012] 2014-11-21 01:26:41 position 5001934 has been processed. (1.08x)
[INFO:012] 2014-11-21 01:26:42 position 6000558 has been processed. (1.29x)

.
.
.
[INFO:012] 2014-11-21 01:27:06 position 20000187 has been processed. (4.31x)
[INFO:012] 2014-11-21 01:27:07 position 21000389 has been processed. (4.53x)
[INFO:012] 2014-11-21 01:27:09 position 22000240 has been processed. (4.74x)
[INFO:012] 2014-11-21 01:27:11 position 23003796 has been processed. (4.96x)
[INFO:012] 2014-11-21 01:27:11 position 23199876 has been processed. (5.00x)
[INFO:010] Total 11367 reads are generated; 5669 is forward, 5698 is reversed

Settings for Long Reads Sequencing Technology

Pacbio

  • Average read length : 5,000 bp
  • Read distribution : exponential distribution
  • Coverage : 10x
  • Average error rate for substitution : 1%
  • Average error rate for insertion : 12%
  • Average error rate for deletion : 2%

Pacbio Error Corrected

  • Average read length : 5,000 bp
  • Read distribution : exponential distribution
  • Coverage : 10x
  • Average error rate for substitution : 0.33%
  • Average error rate for insertion : 0.33%
  • Average error rate for deletion : 0.33%

Nanopore

  • Average read length : 100,000 bp
  • Read distribution : normal distribution
  • Coverage : 10x
  • Average error rate for substitution : 3%
  • Average error rate for insertion : 3%
  • Average error rate for deletion : 3%

Inside the Script

dnasim.py (for heterozygous polyploidy genome)

../../src/dnasim.py --ploidy 10 --het 0.05 --pre [prefix]  [ref]
../../src/dnasim.py --ploidy 10 --het 0.05 --pre NC_000913 NC_000913.fna
  • ploidy
  • het - heterozygous level, 0.05 means 5% difference

readsim.py (for reads simulation)

The main script. When you run with your own command or fix the script, please make sure the path is correct. This is the script for Pacbio

for l in 2000; do
  for c in 5; do 
    ../../src/readsim.py sim fa \
    --ref NC_000913.fna \
    --pre NC_000913.pacbio.reads \
    --rev_strd on \
    --tech pacbio_ec --read_mu $l --cov_mu $c
  done;
done;
  • sim - The major command, meaning that we will simulate reads
    • fa - generate a base only reads file. You will see .fasta as a result
    • fq - generate a base and quality valued in the same file. You will see .fastq as a result
    • fafq - generate two base reads and quality value file separately. You will see .fa and .fq file as results.
  • ref - Reference genome. When you run with your own command or fix the script, please make sure the path is correct
  • pre - The prefix. The last file name will be prefix.{fast|fastq|fa|fq}
  • rev_strd - on | off, 'on' means create backward strands as well as forward strands, randomly half and half. 'off' means no backward strands, every read is forward strand.
  • tech - The sequencing technology that you want to simulate. We do support pacbio, pacbio_ec (pacbio error corrected) and nanopore
  • read_mu - The average on read length
  • cov_mu - The overall coverage

More Options

  • read_mu - The average on read length
  • read_dist - The overall distribution of read length. Choose among {uniform, normal, exp}. It also can take a file, in which each line has a length of a read. The read length file can get the simulated reads more realistic(v1.5).
  • cov_mu - The overall coverage
  • err_sub_mu - The average on substitution rate
  • err_in_mu - The average on insertion rate
  • err_del_mu - The average on deletion rate

Related

Wiki: Home