1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

PacBioToCA

From wgs-assembler

Jump to: navigation, search

Hybrid error correction and de novo assembly of single-molecule sequencing reads

Sergey Koren, Michael Schatz, Brian Walenz, Jeffrey Martin, Jason Howard, Ganeshkumar Ganapathy, Zhong Wang, David A. Rasko, W. Richard McCombie, Erich D. Jarvis, and Adam M. Phillippy,

About PacBio RS

Contents

The PacBio RS sequencing instrument from Pacific Biosciences is a single-molecule sequencer that is able to generate long-reads (CLR). Unfortunately, these sequences have a low accuracy, averaging only 83-86%. While It is possible to read each circularized molecule multiple times to get higher accuracy sequences (CCS), this produces significantly shorter sequences.

About pacBioToCA

The pacBioToCA script is a correction pipeline to enable the use of the long-read sequences produced by the PacBio RS instrument. To algorithmically deal with the error, we support either alternate high-identity sequences (454, Illumina, or PacBio circularized sequences) or solely C2 or newer sequencing to re-call accurate consensus. Celera Assembler includes a pipeline, pacBioToCA, to generate FRG File (as well as fasta and qual) from PacBio RS sequences given short-read high-identity data or C2 CLR data to correct with.

The PacBio corrected sequences (PBcR) are output in fasta/qual formats as well as a Celera Assembler FRG file. If you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results.

If you are having trouble, please check the known issues and known Celera Assembler issues to see if your problem is documented. If you are using sequences shorter than 100bp for correction, please check the required parameters section instruction on how to input them.

Announcements

  • April 4, 2013 - Updated documentation to describe recommended Celera Assembler parameters and Quiver polishing of consensus for assembly of corrected sequences. Please see assembly of corrected sequences for more information.
  • March 19, 2013 - Updated documentation to describe new option of self-correction. pacBioToCA can correct PacBio RS C2 data (or newer) using solely long-read low-accuracy data. Please see self-correction for more information.
  • March 18, 2013 - Updated documentation to describe new support for both BLASR or Bowtie 2 for alignment (in addition to CA's overlapper).
  • August 9, 2012 - An NGS Leaders webinar by Adam Phillippy and Sergey Koren is now available at YouTube.
  • July 1, 2012 - The paper on correction and assembly is now available as an advance online publication from Nature Biotechnology.
  • June 20, 2012 - New known issue: using Illumina sequences shorter than 100bp for correction requires adjusted parameters. Please see the known issues for the required parameters.
  • May 4, 2012 - New known issue: PacBio SMRT-portal 1.3.0 outputs fastq files in a non-standard format. If you encounter errors running the pipeline or have no corrected sequences output, please see the known issues.

Usage

The pacBioToCA pipeline is available in CA 7.0, which is available either as source or pre-compiled for 64-bit Linux.

If you are interested in using the new self-correction feature, you must follow the instructions to build Celera Assembler from source following the compilation instructions. Be sure to instal both the kmer package and samtools when building and adjust the AS_READ_MAX_NORMAL_LEN_BITS to 15.

usage:pacBioToCA [options] -s spec.file -fastq fastqfile <frg>
  -length        Minimum length of PacBio RS fragment to keep.
  -partitions    Number of partitions for consensus
  -l libraryname Name of the library; freeformat text.
  -t threads     Number of threads to use for correction.
  -noclean       Do not clean up after the correction finishes. By default, the temp<libraryname> directory will be removed on successful completion of the correction.
  -sge           Optional parameters to pass through to qsub for SGE. If your spec file specifies an sge line, it will be used by default.
  -sgeCorrection Optional parameters to pass through when running the correction step to qsub for SGE. If your spec file specifies an sgeScript line, it will be used by default.

Writes a CA FRG file consisting of a single LIB message to <libraryname>.frg. Also stores fasta and qual sequences (named <libraryname>.fasta and <libraryname>.qual).

The options (-length, -partitions, and -threads, -noclean, -sge, and -sgeCorrection) are optional.

The -libraryname option is mandatory. It provides a UID name for the library that the reads are placed into. This name is completely up to the end user, but must contain no white spaces or commas.

The -s pacbio.spec option is mandatory. It should point to a a CA spec file to specify how to overlap the illumina and PacBio data for correction (which comes included with the sample data.) Two templates are provided for SGE and high-memory multi-core environment.

The -t X option specifies that the correction program itself is run using X threads. It should be set to the number of processors available wherever this pipeline runs.

The -fastq specifies the pacbio fragments to correct. The correction will also work with unfiltered reads if you don't have a filtered fastq file.

You can specify an arbitrary number of frg files to be used as the trusted sequences for correction. They can be any format CA accepts. Please see the FastqToCA and SffToCA pages for information on inputting Illumina or 454 data. Generally, you don't need more than about 50X of high-identity sequences for correction.

Other Requirements

The pipeline also requires some files from the AMOS project (amos.sf.net). If you have SMRTportal in your path, you will have AMOS in your path and do not need to do anything. Otherwise, any release (such as 3.0.0 onward) should work. By default, pacBioToCA will first look for AMOS to be installed in the same directory as CA. That is if you have a directory called /work/CA/ with executables under Linux-amd64/bin, pacBioToCA will check for /work/AMOS/bin. Next, it will check the path for AMOS executables. If AMOS is not found, the pipeline will stop and report an error.

 If you are using self-correction, SMRTportal must also be in your path.

Inputting PacBio RS Sequences

The pacBioToCA pipeline expects PacBio RS sequences in fastq format (with sanger (PHRED32) quality values). The pbh5tools package from Pacific Biosciences converts h5 files to fasta or fastq files. The currently recommended parameters for correction are quality = 0.75 and minimum length = 100.

If you already have fasta files, you can download a Java utility to convert the fasta files into pacBioToCA compatible fastq files:

tar xvzf convertFastaAndQualToFastq.tar.gz
convertFastaAndQualToFastq.class
convertFastaAndQualToFastq.java
Utils$Pair.class
Utils$ToProtein.class
Utils$Translate.class
Utils.class
Utils.java


java convertFastaAndQualToFastq myfile.fasta > myfile.fastq

NOTE: If you are using the SMRT-portal version 1.3, please see the known issues if you encounter errors or no corrected sequences are output.

Self-Correction With C2 Sequences (or newer)

Starting with the C2 chemistry (and the more recent C2-XL and XL-XL) chemistries, you can self-correct PacBio RS sequences without any high-identity technology. Using this pipeline requires the latest CA built from source including SAMtools. You must also have SMRTportal installed and available in your path. SMRTportal is freely available from the PacBio RS DevNet website. Approximately 100X (or more) of C2 sequencing is recommended. If you have less PacBio RS sequencing than this, then follow the instructions for correction with complementary high-identity sequences.

To run the pipeline, follow the usage instructions below.

Self-correction Example with E. coli Reads

This example is applicable for sufficient coverage of C2 or newer sequencing chemistry sequences. In this case, self-correction can be used, eliminating the requirement for complementary high-accuracy short-read sequences. Download the sample data from http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz

tar xvzf selfSampleData.tar.gz
x selfSampleData/
x selfSampleData/asm.spec
x selfSampleData/ecoli.fasta
x selfSampleData/pacbio_filtered.fastq.bz2
x selfSampleData/pacbio.spec
x selfSampleData/README
x selfSampleData/asm.SGE.spec
x selfSampleData/pacbio.SGE.spec

ls selfSampleData/
README                         instructions on running the correction
asm.spec                       CA spec file for assembly. Assumes a high-memory multi-core machine. 
asm.SGE.spec                   CA spec file for assembly. Assumes an SGE environment.
ecoli.fasta                    The reference E. coli K-12 MG1655 sequence
pacbio_filtered.fastq.bz2      The PacBio RS sequences (with high-error) after filtering through the secondary pipeline (100X coverage).
pacbio.spec                    CA spec file to be used for correction. Assumes a high-memory multi-core machine.
pacbio.SGE.spec                CA spec file to be used for correction. Assumes an SGE environment.

Correct PacBio RS single-pass sequences

Run the correction to generate corrected pacbio sequences.

<wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio_filtered.fastq.bz2 longReads=1 genomeSize=4650000 > run.out 2>&1

Assemble corrected sequences

After the correction pipeline runs, you will have a pacbio.frg file in the selfSampleData directory. There is also be pacbio.fasta and a pacbio.qual files. You can use the pacbio.frg file for a CA assembly. Note, if you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results. To select the longest 25X, given a genome size of 4.65Mbp, we need 116Mbp in PBcR sequences.

<wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg
<wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > pacbio.25X.frg

Finally, assemble the longest 25X of corrected data

<wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.25X.frg > asm.out 2>&1

Substitute your CA location for <wgs> and your machine type for <Linux-amd64>. After this you should have an assembled ecoli in asm/9-terminator/asm.ctg.fasta.

head -n 50 asm/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=4641790
MeanBasesInScaffolds=4641790
MinBasesInScaffolds=4641790
MaxBasesInScaffolds=4641790
N25ScaffoldBases=4641790
N50ScaffoldBases=4641790
N75ScaffoldBases=4641790
ScaffoldAt1000000=4641790

TotalSpanOfScaffolds=4641790
MeanSpanOfScaffolds=4641790
MinScaffoldSpan=4641790
MaxScaffoldSpan=4641790
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=4641790
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,4641790,4641790,4641790,0,7180000000002
total=1,4641790,4641790,4641790,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=4641790
TotalVarRecords=16815
MeanContigLength=4641790
MinContigLength=4641790
MaxContigLength=4641790
N25ContigBases=4641790
N50ContigBases=4641790
N75ContigBases=4641790
ContigAt1000000=4641790

[BigContigs_greater_10000]
TotalBigContigs=1
BigContigLength=4641790
MeanBigContigLength=4641790
MinBigContig=4641790
MaxBigContig=4641790
BigContigsPercentBases=100.00

The assembly is comprised of one long contig (the full E. coli chromosome). We can check the number of sequences within each contig:

cat asm/9-terminator/asm.posmap.frgctg |awk '{print $2}' |sort |uniq -c
  22548 7180000000001

The first column gives the sequence count in each contig. Generally, contigs comprised of a few sequences (less than 50) can safely be ignored in downstream analysis.

Correction Using Complementary High-Identity Sequences

Inputting High-Identity Sequences

The below example is applicable if you have less than 50X coverage of PacBio RS data or you have C1 or pre-release C2 chemistry. In this case, high-identity sequences are required to correct PacBio RS sequence. You can correct the PacBio RS sequences using any high-identity next-generation technology. This include Illumina, 454, and PacBio CCS sequences. The pacBioToCA pipeline accepts an arbitrary number of frg files at the end of the command.

  • For Illumina data, use the fastqToCA command. For examples, see the phage example below and the usage page.
  • For 454 data, use sffToCA.
  • For PacBio RS CCS sequences, use the convert-fasta-to-v2 command. CCS sequences must be input as FastaA, not FastQ. For more information, see the detailed usage. To convert a CCS fasta file named ccs.fasta (with corresponding qualities in ccs.qual) to a frg file run:
convert-fasta-to-v2.pl -pacbio -s ccs.fna -q ccs.qual -l pacBioCCS > css.frg

If you CCS data is in fastq format, you can download a Java utility. To convert a fastq file run:

java convertFastqToFastaAndQual ccs.fastq ccs.fna ccs.qual

then convert the fasta/qual files as above.

Download the phage sample data available http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz

tar xvzf sampleData.tar.gz
x sampleData/
x sampleData/asm.spec
x sampleData/illumina.fastq
x sampleData/lambda.fasta
x sampleData/pacbio.filtered_subreads.fasta
x sampleData/pacbio.spec
x sampleData/README
x sampleData/asm.SGE.spec
x sampleData/pacbio.SGE.spec

ls sampleData/
README                         instructions on running the correction
asm.spec                       CA spec file for assembly. Assumes a high-memory multi-core machine. 
asm.SGE.spec                   CA spec file for assembly. Assumes an SGE environment.
illumina.fastq                 Illumina unpaired sequences to be used for correction
lambda.fasta                   The reference phage sequence
pacbio.filtered_subreads.fasta The PacBio RS sequences (with high-error) after filtering through the secondary pipeline.
pacbio.spec                    CA spec file to be used for correction. Assumes a high-memory multi-core machine.
pacbio.SGE.spec                CA spec file to be used for correction. Assumes an SGE environment.

Processing complementary data

First, convert illumina.fastq to CA frg format:

cd sampleData/
<wgs>/<Linux-amd64>/bin/fastqToCA -libraryname illumina -technology illumina -type sanger -innie -reads illumina.fastq > illumina.frg

Correct PacBio RS single-pass sequences

Next, run the correction to generate corrected pacbio sequences. This example uses the Java utility above to convert the PacBio RS fasta file to fastq.

java  convertFastaAndQualToFastq pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fastq
<wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio.filtered_subreads.fastq illumina.frg > run.out 2>&1

Assemble corrected sequences

After the correction pipeline runs, you will have a pacbio.frg file in the sampleData directory. There is also be pacbio.fasta and a pacbio.qual files. You can use the pacbio.frg file for a CA assembly. Note, if you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results.

To subset the longest 25X of sequences (only supported when CA is built from source). Given a genome size of 50Kbp, we need 1Mbp in PBcR sequences.

<wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore  pacbio.frg 
<wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 1000000 asm.gkpStore > pacbio.25X.frg

Assemble the data (if you are using the longest 25X)

<wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.25X.frg > asm.out 2>&1

or

<wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.frg > asm.out 2>&1

otherwise.

Substitute your CA location for <wgs> and your machine type for <Linux-amd64>. After this you should have an assembled phage in asm/9-terminator/asm.ctg.fasta.

head -n 40 asm/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=48452
MeanBasesInScaffolds=48452
MinBasesInScaffolds=48452
MaxBasesInScaffolds=48452
N25ScaffoldBases=48452
N50ScaffoldBases=48452
N75ScaffoldBases=48452

TotalSpanOfScaffolds=48452
MeanSpanOfScaffolds=48452
MinScaffoldSpan=48452
MaxScaffoldSpan=48452
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=48452
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,48452,48452,48452,0,7180000000002
total=1,48452,48452,48452,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=48452
TotalVarRecords=11
MeanContigLength=48452
MinContigLength=48452
MaxContigLength=48452
N25ContigBases=48452
N50ContigBases=48452
N75ContigBases=48452

[BigContigs_greater_10000]

Illumina-only assembly comparison

For comparison, you can assemble the Illumina data alone:

<wgs>/<Linux-amd64>/bin/runCA -p asm -d illuminaASM -s asm.spec illumina.frg > asm.out 2>&1
head -n 40 illuminaASM/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=3
TotalContigsInScaffolds=3
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=48635
MeanBasesInScaffolds=16212
MinBasesInScaffolds=6606
MaxBasesInScaffolds=27998
N25ScaffoldBases=27998
N50ScaffoldBases=27998
N75ScaffoldBases=14031

TotalSpanOfScaffolds=48635
MeanSpanOfScaffolds=16212
MinScaffoldSpan=6606
MaxScaffoldSpan=27998
IntraScaffoldGaps=0
2KbScaffolds=3
2KbScaffoldSpan=48635
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,27998,27998,27998,0,7180000000010
1=1,14031,14031,14031,0,7180000000011
2=1,6606,6606,6606,0,7180000000012
total=3,48635,48635,16212,0

[Contigs]
TotalContigsInScaffolds=3
TotalBasesInScaffolds=48635
TotalVarRecords=0
MeanContigLength=16212
MinContigLength=6606
MaxContigLength=27998
N25ContigBases=27998
N50ContigBases=27998
N75ContigBases=14031

Assembly of Corrected Sequences

Quality Trimming

By default, pacBioToCA will trim the corrected sequences to a quality value of 54.5. In case there is low coverage of the genome or the sequences are short (especially when using self-correction), it may be beneficial to try assembling with alternate QV trim values. Assuming pacBioToCA was run with library name <libraryname>, the following command allows custom QV trimming

<wgs>/<Linux-amd64>/bin/trimFastqByQVWindow temp<libraryname>/corrected.fasta temp<libraryname>/corrected.qual trimmedQV.fasta trimmedQV.qual <QV value> > trimmedQV.fastq
<wgs>/<Linux-amd64>/bin/fastqToCA -libraryname trimmedQV -technology pacbio -type sanger -reads trimmedQV.fastq > trimmedQV.frg

The resulting trimmedQV.frg file can now be used for assembly. If the assembly using the default setting is poor quality, the recommended alternate values to to try are 52.5, 56.5, and 59.5.

Celera Assembler Parameters

Celera Assembler, like many OLC assemblers, is sensitive to high coverage and noise, especially near the ends of sequences. To remove some of this noise, we recommend using the longest 25X of sequence to assemble. Given an E. coli K12 genome (4.65Mbp genome size), 25X coverage is approximately 116MBp. The commands below will subset an frg file to only 25X of the longest sequences.

<wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg
<wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > pacbio.25X.frg

Finally, some Celera Assembler parameters can be adjusted to make overlap detection more stringent and eliminate false overlaps when assembly only PacBio corrected long sequences. The <ovl value> parameter should be set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000. The currently recommended parameters are

<wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0

When you are combining PacBio corrected sequences with other technologies (454, etc), the alternate parameters are

<wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14 

Polishing Consensus

The final assembly consensus is normally QV45 of higher. It can be improved by post-processing using the Quiver consenus algorithm (the default for the RS_Resequencing protocol in SMRT®Portal 1.4). Import the assembly into the SMRTPortal and run the RS_Resequencing protocol to generate a final polished assembly.

Additional Options

Replacing Celera Assembler Overlapper

Instead of using CA's built-in overlapper, it is possible to use BLASR, Bowtie 2, or an arbitrary aligner to find overlaps between PacBio RS sequences and high-identity sequences when CA is built from source. Note that BLASR is automatically enabled if you are using self-correction.

To enable BLASR add:

"blasr=-advanceHalf -noRefineAlign -ignoreQuality -minMatch 10 -maxLCPLength 15 -minPctIdentity 70.0"

to the spec file or as a parameter to pacBioToCA.

To enable Bowtie 2, specify

"bowtie = -N 0 -L 10 --rfg 2,2 --rdg 2,2 -R 3 -i S,1,1.25"

to the spec file or as a parameter to pacBioToCA.

To use an arbitrary aligner, generate alignments with PacBio RS sequences as the reference and high-identity sequences as the reads in SAM format. Then, given sam files named sam0001.out, sam002.out, sam0003.out,

find -L ./ -name \\*.sam -print | sort > fofn
echo "samFOFN=fofn" >> pacbio.spec

Filling Coverage Gaps With PacBio RS Sequences

In older versions of pacBioToCA, when there was a coverage gap in the short-read coverage, the PacBio RS sequence was split. This caused issues when the short-read data had coverage gaps due to sequencing bias. In the latest version available when CA is built from source, short-reads can be used to recruit PacBio RS sequences to coverage gaps. This option has only been tested on genomes less than 10Mbp in size. The corrected sequences still have a high-quality consensus above 99% accuracy. On a test run of E. coli K-12, adding the option increased the mean corrected sequence length to 4,187 from 2,493 and improved the assembly N50 to 4.65Mbp from 3.32Mbp. To enable this option specify

"maxUncorrectedGap=1500"

in the spec file or as a parameter to pacBioToCA.

Known Issues

Celera Assembler Known Issues

For the latest issues found in the Celera Assembler release, please see version 7.0 known issues page. You can also check the Celera Assembler bugs page for any reported issues.

Correcting Large (> 100Mbp) Genomes

pacBioToCA and Celera Assembler require adjusted parameters for optimal performance on large genomes. The default parameters in the pacbio.spec file are configured to fully utilize a machine with 16 cores and 48GB of RAM. For large genomes (i.e. 1GB), an SGE grid or a high-memory machine (256GB) of RAM are recommended for optimal performance. The recommended parameters for large genomes are:

ovlHashBits = 25
ovlHashBlockLength = 1000000000
ovlRefBlockSize =  100000000
doOverlapBasedTrimming = 0

This will make each overlapper job use approximately 18GB of RAM (up from the default of 6GB of ram). You should set the ovlConcurrency to be:

<total memory> / 18

For example, for a 256GB machine, ovlConcurrency would be 14. The ovlThreads parameter should be set to:

<total cores> / ovlConcurrency

So for a 256GB machine with 32 cores, ovlThreads would be 2 (assuming ovlConcurrency was set to 14 above). When using an SGE grid, the sgeOverlap parameter should also be set to request the appropriate threads and memory:

sgeOverlap = -pe threads 2 -l mem=10GB

For full details on tuning these parameters, please see the Celera Assembler runCA wiki page.

If you already have a correction running, you can tell the number of overlap jobs created by running

wc -l temp<your library name>/0-overlaptrim-overlapper/ovlopt

If the number of overlap jobs is large (over 400), then the parameters above are recommended over the default settings.

Low Number of Output Corrected Sequences

Correcting a combination of multiple PacBio libraries has sometimes resulted in poor throughput, with only a few long corrected PacBio RS reads remaining after correction. The current recommendation is to correct multiple PacBio RS libraries separately and combine the corrected reads for subsequence analysis. Note that is it OK to combine multiple PacBio RS cells run from the same library for correction.

Short-read sequences below 64bp

If you have short sequences for correction (< 64bp), the pipeline will by default not load them and will give a no-overlaps error instead of outputting corrected sequences. In order to use short Illumina sequences, you have to specify the
-shortReads
option to the pipeline. Also, add the following parameters at the end of the pacbio.spec file
frgMinLen = 40 # smaller than your read length
ovlMinLen = 30 # about 75% of your frgMinLen
merSize = 10

This is currently only recommended for small (<10Mbp) genomes as it will increase runtime.

If you have Illumina sequences longer than 64bp but shorter than 100bp, have to specify the
-shortReads
option to the pipeline along with the following parameter to pacbio.spec:
merSize = 10

No Overlaps Found

The PacBio SMRT-portal version 1.3.0 outputs a non-standard fastq file from their filtering pipeline. This causes the pacBioToCA pipeline to either stop with an error or output no corrected sequences. You can check if your fastq file is the cause of the problem if it splits the sequence and quality on multiple lines:

@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG
TCANCCCTGCATTATAAGAAGTACCGCAGGCGACG
ATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC
+
EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFID
HHIGHIHHGHIHIGHGIGHGGIC
@HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG
TAANGGGCTGACAACGCATTACACAAAACTCAAAC
ACAACAACCGTAACCACCGCGGTTCAATGGGACTG
G
+
EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGG
GIEIIHIDHIIHIFHEGGHIF=

versus the expected format of:

@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG
TCANCCCTGCATTATAAGAAGTACCGCAGGCGACGATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC
+
EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFIDHHIGHIHHGHIHIGHGIGHGGIC
@HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG
TAANGGGCTGACAACGCATTACACAAAACTCAAACACAACAACCGTAACCACCGCGGTTCAATGGGACTGG
+
EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGGGIEIIHIDHIIHIFHEGGHIF=

As a workaround, you can use the fasta file and convert it to fastq following the instructions above. Alternatively, you can use a perl script to convert non-standard fastq files to ones pacBioToCA expects. Thank you to Lee Katz for providing the script. To convert a non-standard fastq file named nonstandard.fastq to the correct format run the command:

perl run_assembly_convertMultiFastqToStandard.pl -i nonstandard.fastq -o standard.fastq

Run the script with no parameters to get help.

Error in runCorrection.sh Step

The CA 7.0 release has a bug in the correction pipeline that may sometimes lead to a crash. If your pipeline fails or hangs on the runCorrection.sh step:

----------------------------------------END Tue Jan 17 11:04:28 2012 (1 seconds)
Failed to execute temppacbio/runCorrection.sh

The following guide will let you continue correction without losing any computation already completed:

cd temppacbio
sh runCorrection.sh

The pipeline will recompute the failed step. If it fails again, edit the runCorrection.sh script and replace the line

      -t <16> \

where <16> corresponds to the number of threads you used for correction (-t parameter) with

      -t 1 \

and run the correction again. Now, you can resume the pipeline using the original command

<wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio.filtered_subreads.fastq illumina.frg

This bug is fixed in the CVS source code and will be fixed in the next release of CA. You can download the patched executable compiled for Linux-amd64. Replace the correctPacBio executable with the downloaded one. For other platforms, you can patch the bug in CA 7.0 by downloading the CA 7.0 source. Modify the file AS_global.h in wgs-assembler/src (line #206) as above and then modify wgs-assembler/src/AS_PBR/CorrectPacBio.cc line #155 from:

}

to

} else {
   frgToLib[fr.gkFragment_getReadIID()] = FALSE;
}

Contact

Koren S, Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., and Phillippy, A. M. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012)

If you encounter issues or have questions, please contact the authors of the pipeline, Sergey Koren (sergek AT umd.edu) or Adam M. Phillippy (aphillippy AT gmail.com)

Personal tools
Navigation
download
documentation