The Hybrid Assembly Pipeline automatically assembles bacterial genomes using combinations of
short read sequence data. This approach was used in the article
"De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas
syringae pv. oryzae"
published in Genome Research in Feburary 2009 (http://genome.cshlp.org/content/19/2/294.long).
This article showed that a bacterial genome could be assembled de novo to an N50 scaffold size of
over 100kb using a single lane of Illumina reads, 1/4 plate of 454 long reads, and 1/4 plate of 454
paired ends.
This readme outlines how to use the hybrid assembly pipeline to de novo assemble multiple
sequence types. This readme makes repeated reference to the parameters.txt file - control of
assemblies occurs entirely through this file.
The readme is in four sections:
I) Requirements
II) Installation
III) Basic assemblies using VCAKE
IV) Custom assemblies
I) Requirements:
1) Hardware requirements:
5-10 gb of RAM are required to run VCAKE on 1 typical "lane" of Illumina data.
2) Software requirements (all must be in your path except for the pipeline scripts)
NEWBLER assembler (available from Roche biotechnology: )
BLAT (http://genome.ucsc.edu/FAQ/FAQblat)
BLASTALL (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/)
SOAPv1 (http://soap.genomics.org.cn/soap1/)
perl (http://www.perl.org/)
VCAKE (www.sourceforge.net/projects/vcake/)
Pipeline scripts - included in this package
soapsortnum and soapsorttext - included in this package
3) Data requirements
Illumina (or any short, <100bp reads) data (for de novo assembly with VCAKE and error
correction with SOAP and pipeline scripts)
Any combination of long reads and paired ends sequences that are combatable with NEWBLER
(please see NEWBLER documentation for instructions on formatting data for NEWBLER).
II) Installation
untar the assembly pipeline, and move it to wherever is most convenient for you
$ tar -xf HybridAssemblyPipeline.tar
The wrapper script AssemblyPipeline_5.0.pl should be in the main directory along with a sample
Parameters.txt file, this README, and a directory containing all the pipeline scripts. Also included
are two c programs, soapsorttext.c and soapsortnum.c. To compile these programs:
$ gcc -o soapsortnum soapsortnum.c
$ gcc -o soapsorttext soapsorttext.c
Then, you must put these programs in your PATH for assembly to proceed.
III) Basic Assemblies using VCAKE
You should start by making a copy of the Parameters_default.txt file named appropriately for
your assembly. You will be editing this file for your own data.
1) inputting sequence data:
Each sequence data file is input as a single line in the Parameters.txt file with the appropriate
label. You can label sequences as "Illumina:" "longreads:" or "pairedends:"
"illumina" could be any fasta containing sequences of length <50 bp, all of the same length.
If you have more than one Illumina sequences file, concatinate them all into a single file
before inputting them, and trim all sequences so they are the same length. The format must be
single line fasta. e.g.
>header1
ATCGACGACGAGCAGCACGT
>header2
GAGGAGAGAGAGAGAGAGAG
...
"longreads" refers to any set of sequences that are not paired ends. These could be 454 long
reads, Sanger sequences, etc. Accepted formats are .fna (60bp per line fasta used by
NEWBLER) with or without associated .qual data (simply make sure the .qual file is named
the same as the .fna file and is in the smae directory), and .sff (a NEWBLER-specific format
often provided by Roche that includes both quality and sequence data). You can provide
multiple longreads datasets.
"pairedends" refers to paired end sequences to be used by NEWBLER with the -p parameter.
For 454 paired ends, use the .sff file provided by Roche. For Sanger paired ends, please see
NEWBLER documentation for proper formatting to be accepted by NEWBLER. You can provide
multiple paired ends datasets.
EXAMPLE:
illumina: /home/data/illuminaseqs.fa
longreads: /home/data/454/454seqs.fna
longreads: /home/data/sangerreads.fna
pairedends: /home/data/454/454pairedends.sff
2) Setting appropriate paths
You must also define an output directory, and the path to the folder containing the pipeline
scripts. e.g.:
scripts: /home/AssemblyPipeline/pipelinescripts/
output: /home/asssemblyout/
3) Setting VCAKE parameters
You can adjust VCAKE assembly parameters. The parameters you can alter are the following:
-s pisitest (the prefix for all output files - include even if you skip VCAKE assembly.)
-c 0.6 (ratio of the most represented base required to extend assembly)
-k 36 (Length of Illumina reads)
-n 19 (bp of overlap for read to extend assembly)
-t 3 (Under this level of coverage, use -m parameter rather than -n for required overlap)
-m 17 (bp of overlap needed to extend assembly if -t parameter is triggered)
-o 75 (output contigs this size or larger)
-v 3 (assembly will halt if there is more than this many of the 2nd most common base)
-x 500 (assembly will halt at this level of coverage or higher)
It is essential that you set -k to the actual read length of your illumina reads or assembly will
fail. Set -x to 2-3 times the expected average coverage for the initial run (you can estimate
this by dividing your total number of basepairs in your illumina data with your expected genome
size).
If you wish to skip the VCAKE assembly process, you may uncomment this line as shown:
SkipVCAKE <- to skip VCAKE assembly, please uncomment this line
4) Setting NEWBLER parameters
You can adjust NEWBLER assembly parameters. The script NewblerParams.pl will reformat
these into the .xml format read by the NEWBLER assembler. Please see NEWBLER
documentation for details.
SeedStep 12
SeedLength 16
Hit 1000
Position 200
MatchLength 40
MatchID 90
IdentScore 2
DiffScore -3
Unique 12
aceMode Auto
AlignMode None
ContDepth 1
AllThresh 100
LargeThresh 100
ShowVariations false
PairMax 5000
4) Running the assembly
Simply run the script AssemblyPipeline_5.0.pl with perl, inputting your edited parameters file:
>perl AssemblyPipeline_5.0.pl Parameters.txt
Provided all inputs are correct, the assembly should proceed. Be sure not to change any other
parts of the parameters file. In particular, do not alter these lines:
##VCake Parameters
##Newbler Parameters:
5) Outputs of the assembly:
Final assembly. Prefix will be whatever you set -s to above (see section 2 above).
e.g. pisitest_AllScaffolds.final.fa
Vertical alignments of Illumina reads to each scaffold (in directory VerticalAlignments)
List of putative repeat regions
VCAKE stats files - 1 for each run:
IV) Custom Assemblies
As mentioned briefly above, assembly can be performed with or without performing a VCAKE assembly of the
illumina data. THis is accomplished by uncommenting this line in the parameters.txt file:
#SkipVCAKE <- to skip VCAKE assembly, please uncomment this line
to:
SkipVCAKE <- to skip VCAKE assembly, please uncomment this line
For example, if you wish to preassemble your reads with another short reads assembler, you can do so by
inputting the contigs from this assembly into the pipeline as "longreads:"
Make sure to preformat the contigs to fit Newbler's specifications (60 bp per line, no contigs longer than
2000bp). The script FormatFasta_fna_seqonly.pl included in the pipelinescripts folder can be used to format
a standard 1-line fasta into .fna format.
The pipeline will still use (and in fact requires) illumina reads, to correct errors in the NEWBLER assembly.
In particular, small 1-2 bp indels common to 454 sequence data are easily corrected using illumina reads
with the SOAP alignment.