Download Latest Version AssemblyPipeline_1.1.tar.gz (103.5 kB)
Email in envelope

Get an email when there's a new version of VCAKE

Home / hybrid assembly pipeline / AssemblyPipeline_v1.1
Name Modified Size InfoDownloads / Week
Parent folder
pipelinetestdata.tar.gz 2009-05-29 127.9 MB
AssemblyPipeline_1.1.tar.gz 2009-05-29 103.5 kB
README.txt 2009-03-18 9.4 kB
Totals: 3 Items   128.0 MB 0
The Hybrid Assembly Pipeline automatically assembles bacterial genomes using combinations of 
short read sequence data.  This approach was used in the article
"De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas 
syringae pv. oryzae"
published in Genome Research in Feburary 2009 (http://genome.cshlp.org/content/19/2/294.long).

This article showed that a bacterial genome could be assembled de novo to an N50 scaffold size of
over 100kb using a single lane of Illumina reads, 1/4 plate of 454 long reads, and 1/4 plate of 454 
paired ends.

This readme outlines how to use the hybrid assembly pipeline to de novo assemble multiple
sequence types.  This readme makes repeated reference to the parameters.txt file - control of 
assemblies occurs entirely through this file.  

The readme is in four sections:

I) Requirements
II) Installation
III) Basic assemblies using VCAKE
IV) Custom assemblies

I) Requirements:
        1) Hardware requirements:
        5-10 gb of RAM are required to run VCAKE on 1 typical "lane" of Illumina data.         
        
        2) Software requirements (all must be in your path except for the pipeline scripts)
                NEWBLER assembler (available from Roche biotechnology: )
                BLAT (http://genome.ucsc.edu/FAQ/FAQblat)
                BLASTALL (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/)
                SOAPv1 (http://soap.genomics.org.cn/soap1/)
                perl (http://www.perl.org/)
                VCAKE (www.sourceforge.net/projects/vcake/)
                Pipeline scripts - included in this package
		soapsortnum and soapsorttext - included in this package

        3) Data requirements
                Illumina (or any short, <100bp reads) data (for de novo assembly with VCAKE and error
                        correction with SOAP and pipeline scripts)
                Any combination of long reads and paired ends sequences that are combatable with NEWBLER
                (please see NEWBLER documentation for instructions on formatting data for NEWBLER).
  
II) Installation
        untar the assembly pipeline, and move it to wherever is most convenient for you
        
        $ tar -xf HybridAssemblyPipeline.tar
        
        The wrapper script AssemblyPipeline_5.0.pl should be in the main directory along with a sample
        Parameters.txt file, this README, and a directory containing all the pipeline scripts.  Also included
        are two c programs, soapsorttext.c and soapsortnum.c.  To compile these programs:
        
        $ gcc -o soapsortnum soapsortnum.c
        $ gcc -o soapsorttext soapsorttext.c
        
        Then, you must put these programs in your PATH for assembly to proceed.
              
III) Basic Assemblies using VCAKE

        You should start by making a copy of the Parameters_default.txt file named appropriately for 
        your assembly.  You will be editing this file for your own data.
        
        1) inputting sequence data:
                Each sequence data file is input as a single line in the Parameters.txt file with the appropriate 
                label.  You can label sequences as "Illumina:" "longreads:" or "pairedends:" 
                
                "illumina" could be any fasta containing sequences of length <50 bp, all of the same length.  
                If you have more than one Illumina sequences file, concatinate them all into a single file
                before inputting them, and trim all sequences so they are the same length.  The format must be
                single line fasta.  e.g.
                >header1
                ATCGACGACGAGCAGCACGT
                >header2
                GAGGAGAGAGAGAGAGAGAG
                ...
                "longreads" refers to any set of sequences that are not paired ends.  These could be 454 long
                reads, Sanger sequences, etc.  Accepted formats are .fna (60bp per line fasta used by 
                NEWBLER) with or without associated .qual data (simply make sure the .qual file is named
                the same as the .fna file and is in the smae directory), and .sff (a NEWBLER-specific format 
                often provided by Roche that includes both quality and sequence data).  You can provide
                multiple longreads datasets.

                "pairedends" refers to paired end sequences to be used by NEWBLER with the -p parameter.  
                For 454 paired ends, use the .sff file provided by Roche.  For Sanger paired ends, please see
                NEWBLER documentation for proper formatting to be accepted by NEWBLER.  You can provide
                multiple paired ends datasets.

                EXAMPLE:
                illumina: /home/data/illuminaseqs.fa
                longreads: /home/data/454/454seqs.fna
                longreads: /home/data/sangerreads.fna
                pairedends: /home/data/454/454pairedends.sff
                        
        2) Setting appropriate paths                
                You must also define an output directory, and the path to the folder containing the pipeline 
                scripts.  e.g.:
                scripts: /home/AssemblyPipeline/pipelinescripts/
                output: /home/asssemblyout/

        3) Setting VCAKE parameters
                You can adjust VCAKE assembly parameters. The parameters you can alter are the following:   
                                
                -s pisitest	(the prefix for all output files - include even if you skip VCAKE assembly.)
                -c 0.6 		(ratio of the most represented base required to extend assembly)
                -k 36 		(Length of Illumina reads)
                -n 19  		(bp of overlap for read to extend assembly)
                -t 3 		(Under this level of coverage, use -m parameter rather than -n for required overlap)
                -m 17 		(bp of overlap needed to extend assembly if -t parameter is triggered)
                -o 75 		(output contigs this size or larger)
                -v 3 		(assembly will halt if there is more than this many of the 2nd most common base)
                -x 500 		(assembly will halt at this level of coverage or higher)
                
                It is essential that you set -k to the actual read length of your illumina reads or assembly will
                fail.  Set -x to 2-3 times the expected average coverage for the initial run (you can estimate 
				this by dividing your total number of basepairs in your illumina data with your expected genome
				size).  
                
                If you wish to skip the VCAKE assembly process, you may uncomment this line as shown:
                SkipVCAKE <- to skip VCAKE assembly, please uncomment this line

        4) Setting NEWBLER parameters
                You can adjust NEWBLER assembly parameters.  The script NewblerParams.pl will reformat
                these into the .xml format read by the NEWBLER assembler.  Please see NEWBLER 
                documentation for details.
		
                SeedStep 12  
                SeedLength 16
                Hit 1000
                Position 200
                MatchLength 40
                MatchID 90
                IdentScore 2
                DiffScore -3
                Unique 12
                aceMode Auto
                AlignMode None
                ContDepth 1
                AllThresh 100
                LargeThresh 100
                ShowVariations false
                PairMax 5000

        4) Running the assembly
                Simply run the script AssemblyPipeline_5.0.pl with perl, inputting your edited parameters file:
                >perl AssemblyPipeline_5.0.pl Parameters.txt
                Provided all inputs are correct, the assembly should proceed.  Be sure not to change any other 
                parts of the parameters file.  In particular, do not alter these lines:
                       ##VCake Parameters
                       ##Newbler Parameters:

        5) Outputs of the assembly:
                Final assembly.  Prefix will be whatever you set -s to above (see section 2 above).
		e.g. pisitest_AllScaffolds.final.fa
		Vertical alignments of Illumina reads to each scaffold (in directory VerticalAlignments)
		List of putative repeat regions 
		VCAKE stats files - 1 for each run:		

IV) Custom Assemblies
	As mentioned briefly above, assembly can be performed with or without performing a VCAKE assembly of the
	illumina data.  THis is accomplished by uncommenting this line in the parameters.txt file:

#SkipVCAKE <- to skip VCAKE assembly, please uncomment this line
to:
SkipVCAKE <- to skip VCAKE assembly, please uncomment this line

	For example, if you wish to preassemble your reads with another short reads assembler, you can do so by 
	inputting the contigs from this assembly into the pipeline as "longreads:"
	
	Make sure to preformat the contigs to fit Newbler's specifications (60 bp per line, no contigs longer than
	2000bp).  The script FormatFasta_fna_seqonly.pl included in the pipelinescripts folder can be used to format
	a standard 1-line fasta into .fna format.

	The pipeline will still use (and in fact requires) illumina reads, to correct errors in the NEWBLER assembly.
	In particular, small 1-2 bp indels common to 454 sequence data are easily corrected using illumina reads
	with the SOAP alignment.
Source: README.txt, updated 2009-03-18