Home
Name Modified Size InfoDownloads / Week
Cerulean_v_0_1.tar.gz 2013-09-12 4.4 MB
README 2013-08-15 3.7 kB
Totals: 2 Items   4.4 MB 0
Cerulean Hybrid Genome Assembler v0.1.1

This software extends contigs assembled using short read datasets like Illumina
paired-end reads using long reads like PacBio RS long reads.

The method is fully described in:
Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013).
Cerulean: A hybrid assembly using high throughput short and long reads.
arXiv preprint arXiv:1307.7933.

A] Requirements:
Ubuntu 12.04 (may run on other operating systems, but not tested)
Python 2.7.1 (may run on older versions, but not tested)
numpy, matplotlib libraries for Python
ABySS assembler: http://www.bcgsc.ca/platform/bioinfo/software/abyss
SMRT Analysis tookit (for BLASR): http://pacbiodevnet.com/
PBJelly: https://sourceforge.net/projects/pb-jelly/ 

B] Inputs and Pre-processing:
 i) Assembled contigs from ABySS short read assembler
 ii)Mapping of Pacbio reads to ABySS contigs using BLASR

 i) Assembly of Illumina paired-end reads:
   If the paired-end reads are stored in fastq format in the files reads1.fastq
   and reads2.fastq, then contigs may be assembled by:
   $ abyss-pe k=64 n=10 in='reads1.fastq reads2.fastq' name=<dataname>
   This will generate 2 files used for inputs to Cerulean:
   * <dataname>-contigs.fa    #This contains the contig sequences
   * <dataname>-contigs.dot   #This contains the graph structure

 ii)Mapping PacBio reads to ABySS contigs using BLASR:
   Note: sawriter and blasr are part of SMRT Analysis toolkit
   Note: You need to set the environmental variables and path:
   $ export SEYMOUR_HOME=/opt/smrtanalysis/
   $ source $SEYMOUR_HOME/etc/setup.sh
   
   Suppose PacBio reads are stored in <dataname>_pacbio.fasta
   $ sawriter <dataname>-contigs.fa
   $ blasr <dataname>_pacbio.fa <dataname>-contigs.fa -minMatch 10 \
     -minPctIdentity 70 -bestn 30 -nCandidates 30 -maxScore -500 \
     -nproc <numthreads> -noSplitSubreads \
     -out <dataname>_pacbio_contigs_mapping.fasta.m4
   
   Make sure the fasta.m4 file generated has the following format:
   qname tname qstrand tstrand score pctsimilarity tstart tend tlength \
   qstart qend qlength ncells
   The file format may be verified by adding the option -header to blasr. 

C] Execute:
 Cerulean requires that all input files are in the same directory <basedir>:
 i)   <basedir>/<dataname>-contigs.fa
 ii)  <basedir>/<dataname>-contigs.dot
 iii) <basedir>/<dataname>_pacbio_contigs_mapping.fasta.m4

 To run:
 $ python src/Cerulean.py --dataname <dataname> --basedir <basedir> \
 --nproc <numthreads>
 
 This will generate:
 i)  <basedir>_cerulean.fasta
 ii) <basedir>_cerulean.dot
 Note: The dot does not have same contigs as fasta, but intermediate graph.
 
 
D] Post-processing:
 Currently Cerulean does not include consensus sequence of PacBio reads in gaps
 The gaps may be filled using PBJelly.
 $ python $JELLYPATH/fakeQuals.py <dataname>_cerulean.fasta <dataname>_cerulean.qual
 $ python $JELLYPATH/fakeQuals.py <dataname>_pacbio.fasta <dataname>_pacbio.qual
 $ cp $JELLYPATH/lambdaExample/Protocol.xml .
 $ mkdir PBJelly
 Modify Protocol.xml as follows:
 Set <reference> to $PATH_TO_<basedir>/<dataname>_cerulean.fasta 
 Set <outputDir> to $PATH_TO_<basedir>/PBJelly
 Set <baseDir> to $PATH_TO_<basedir>
 Set <job> to <dataname>_pacbio.fasta
 Set <blasr> option -nproc <numthreads> 
 Note: PBJelly requires that the suffix be .fasta and not .fa
 Next run PBJelly:
 ($ source $JELLYPATH/exportPaths.sh)
 $ python $JELLYPATH/Jelly.py <stage> Protocol.xml
 where <stage> has to be in the order:
 setup
 mapping
 support
 extraction
 assembly
 output
 
 The assembled contigs may be view in <basedir>/PBJelly/assembly/jellyOutput.fasta

In case of any questions or errors please contact vdeshpan eng DT ucsd DT edu
Source: README, updated 2013-08-15