PacBioToCA
From wgs-assembler
Hybrid error correction and de novo assembly of single-molecule sequencing reads
Sergey Koren, Michael Schatz, Brian Walenz, Jeffrey Martin, Jason Howard, Ganeshkumar Ganapathy, Zhong Wang, David A. Rasko, W. Richard McCombie, Erich D. Jarvis, and Adam M. Phillippy,
About PacBio RS
|
The PacBio RS sequencing instrument from Pacific Biosciences is a single-molecule sequencer that is able to generate long-reads (CLR). Unfortunately, these sequences have a low accuracy, averaging only 83-86%. While It is possible to read each circularized molecule multiple times to get higher accuracy sequences (CCS), this produces significantly shorter sequences.
About pacBioToCA
The pacBioToCA script is a correction pipeline to enable the use of the long-read sequences produced by the PacBio RS instrument. To algorithmically deal with the error, we support either alternate high-identity sequences (454, Illumina, or PacBio circularized sequences) or solely C2 or newer sequencing to re-call accurate consensus. Celera Assembler includes a pipeline, pacBioToCA, to generate FRG File (as well as fasta and qual) from PacBio RS sequences given short-read high-identity data or C2 CLR data to correct with.
The PacBio corrected sequences (PBcR) are output in fasta/qual formats as well as a Celera Assembler FRG file. If you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results.
If you are having trouble, please check the known issues and known Celera Assembler issues to see if your problem is documented. If you are using sequences shorter than 100bp for correction, please check the required parameters section instruction on how to input them.
Announcements
- April 15, 2013 - Pre-print of Reducing assembly complexity of microbial genomes with single-molecule sequencing posted on arXiv. Supporting materials and data are also posted at CBCB
- April 4, 2013 - Updated documentation to describe recommended Celera Assembler parameters and Quiver polishing of consensus for assembly of corrected sequences. Please see assembly of corrected sequences for more information.
- March 19, 2013 - Updated documentation to describe new option of self-correction. pacBioToCA can correct PacBio RS C2 data (or newer) using solely long-read low-accuracy data. Please see self-correction for more information.
- March 18, 2013 - Updated documentation to describe new support for both BLASR or Bowtie 2 for alignment (in addition to CA's overlapper).
- August 9, 2012 - An NGS Leaders webinar by Adam Phillippy and Sergey Koren is now available at YouTube.
- July 1, 2012 - The paper on correction and assembly is now available as an advance online publication from Nature Biotechnology.
- June 20, 2012 - New known issue: using Illumina sequences shorter than 100bp for correction requires adjusted parameters. Please see the known issues for the required parameters.
- May 4, 2012 - New known issue: PacBio SMRT-portal 1.3.0 outputs fastq files in a non-standard format. If you encounter errors running the pipeline or have no corrected sequences output, please see the known issues.
- Feb. 15, 2012 - Presentation on error correction and assembly at AGBT
- Jan 19, 2012 - Presentation on SMRT-Assembly at PAG (video)
- Jan 16, 2012 - CA 7.0 release as source or pre-compiled for 64-bit Linux.
Usage
The pacBioToCA pipeline is available in CA 7.0, which is available either as source or pre-compiled for 64-bit Linux.
If you are interested in using the new self-correction feature, you must follow the instructions to build Celera Assembler from source following the compilation instructions. Be sure to instal both the kmer package and samtools when building and adjust the AS_READ_MAX_NORMAL_LEN_BITS to 15.
usage:pacBioToCA [options] -s spec.file -fastq fastqfile <frg> -length Minimum length of PacBio RS fragment to keep. -partitions Number of partitions for consensus -l libraryname Name of the library; freeformat text. -t threads Number of threads to use for correction. -noclean Do not clean up after the correction finishes. By default, the temp<libraryname> directory will be removed on successful completion of the correction. -sge Optional parameters to pass through to qsub for SGE. If your spec file specifies an sge line, it will be used by default. -sgeCorrection Optional parameters to pass through when running the correction step to qsub for SGE. If your spec file specifies an sgeScript line, it will be used by default. Writes a CA FRG file consisting of a single LIB message to <libraryname>.frg. Also stores fasta and qual sequences (named <libraryname>.fasta and <libraryname>.qual).
The options (-length, -partitions, and -threads, -noclean, -sge, and -sgeCorrection) are optional.
The -libraryname option is mandatory. It provides a UID name for the library that the reads are placed into. This name is completely up to the end user, but must contain no white spaces or commas.
The -s pacbio.spec option is mandatory. It should point to a a CA spec file to specify how to overlap the illumina and PacBio data for correction (which comes included with the sample data.) Two templates are provided for SGE and high-memory multi-core environment.
The -t X option specifies that the correction program itself is run using X threads. It should be set to the number of processors available wherever this pipeline runs.
The -fastq specifies the pacbio fragments to correct. The correction will also work with unfiltered reads if you don't have a filtered fastq file.
You can specify an arbitrary number of frg files to be used as the trusted sequences for correction. They can be any format CA accepts. Please see the FastqToCA and SffToCA pages for information on inputting Illumina or 454 data. Generally, you don't need more than about 50X of high-identity sequences for correction.
Other Requirements
The pipeline also requires some files from the AMOS project (amos.sf.net). If you have SMRTportal in your path, you will have AMOS in your path and do not need to do anything. Otherwise, any release (such as 3.0.0 onward) should work. By default, pacBioToCA will first look for AMOS to be installed in the same directory as CA. That is if you have a directory called /work/CA/ with executables under Linux-amd64/bin, pacBioToCA will check for /work/AMOS/bin. Next, it will check the path for AMOS executables. If AMOS is not found, the pipeline will stop and report an error. If you are using self-correction, SMRTportal must also be in your path.
Inputting PacBio RS Sequences
The pacBioToCA pipeline expects PacBio RS sequences in fastq format (with sanger (PHRED32) quality values). The pbh5tools package from Pacific Biosciences converts h5 files to fasta or fastq files. The currently recommended parameters for correction are quality = 0.75 and minimum length = 100.
If you already have fasta files, you can download a Java utility to convert the fasta files into pacBioToCA compatible fastq files:
tar xvzf convertFastaAndQualToFastq.tar.gz convertFastaAndQualToFastq.class convertFastaAndQualToFastq.java Utils$Pair.class Utils$ToProtein.class Utils$Translate.class Utils.class Utils.java java convertFastaAndQualToFastq myfile.fasta > myfile.fastq
NOTE: If you are using the SMRT-portal version 1.3, please see the known issues if you encounter errors or no corrected sequences are output.
Self-Correction With C2 Sequences (or newer)
Starting with the C2 chemistry (and the more recent C2-XL and XL-XL) chemistries, you can self-correct PacBio RS sequences without any high-identity technology. Using this pipeline requires the latest CA built from source including SAMtools. You must also have SMRTportal installed and available in your path. SMRTportal is freely available from the PacBio RS DevNet website. Approximately 100X (or more) of C2 sequencing is recommended. If you have less PacBio RS sequencing than this, then follow the instructions for correction with complementary high-identity sequences.
To run the pipeline, follow the usage instructions below.
Self-correction Example with E. coli Reads
This example is applicable for sufficient coverage of C2 or newer sequencing chemistry sequences. In this case, self-correction can be used, eliminating the requirement for complementary high-accuracy short-read sequences. Download the sample data from http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz
tar xvzf selfSampleData.tar.gz x selfSampleData/ x selfSampleData/asm.spec x selfSampleData/ecoli.fasta x selfSampleData/pacbio_filtered.fastq.bz2 x selfSampleData/pacbio.spec x selfSampleData/README x selfSampleData/asm.SGE.spec x selfSampleData/pacbio.SGE.spec ls selfSampleData/ README instructions on running the correction asm.spec CA spec file for assembly. Assumes a high-memory multi-core machine. asm.SGE.spec CA spec file for assembly. Assumes an SGE environment. ecoli.fasta The reference E. coli K-12 MG1655 sequence pacbio_filtered.fastq.bz2 The PacBio RS sequences (with high-error) after filtering through the secondary pipeline (100X coverage). pacbio.spec CA spec file to be used for correction. Assumes a high-memory multi-core machine. pacbio.SGE.spec CA spec file to be used for correction. Assumes an SGE environment.
Correct PacBio RS single-pass sequences
Run the correction to generate corrected pacbio sequences.
<wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio_filtered.fastq.bz2 longReads=1 genomeSize=4650000 > run.out 2>&1
Assemble corrected sequences
After the correction pipeline runs, you will have a pacbio.frg file in the selfSampleData directory. There is also be pacbio.fasta and a pacbio.qual files. You can use the pacbio.frg file for a CA assembly. Note, if you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results. To select the longest 25X, given a genome size of 4.65Mbp, we need 116Mbp in PBcR sequences.
<wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > pacbio.25X.frg
Finally, assemble the longest 25X of corrected data
<wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.25X.frg > asm.out 2>&1
Substitute your CA location for <wgs> and your machine type for <Linux-amd64>. After this you should have an assembled ecoli in asm/9-terminator/asm.ctg.fasta.
head -n 50 asm/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=4641790 MeanBasesInScaffolds=4641790 MinBasesInScaffolds=4641790 MaxBasesInScaffolds=4641790 N25ScaffoldBases=4641790 N50ScaffoldBases=4641790 N75ScaffoldBases=4641790 ScaffoldAt1000000=4641790 TotalSpanOfScaffolds=4641790 MeanSpanOfScaffolds=4641790 MinScaffoldSpan=4641790 MaxScaffoldSpan=4641790 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=4641790 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,4641790,4641790,4641790,0,7180000000002 total=1,4641790,4641790,4641790,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=4641790 TotalVarRecords=16815 MeanContigLength=4641790 MinContigLength=4641790 MaxContigLength=4641790 N25ContigBases=4641790 N50ContigBases=4641790 N75ContigBases=4641790 ContigAt1000000=4641790 [BigContigs_greater_10000] TotalBigContigs=1 BigContigLength=4641790 MeanBigContigLength=4641790 MinBigContig=4641790 MaxBigContig=4641790 BigContigsPercentBases=100.00
The assembly is comprised of one long contig (the full E. coli chromosome). We can check the number of sequences within each contig:
cat asm/9-terminator/asm.posmap.frgctg |awk '{print $2}' |sort |uniq -c
22548 7180000000001
The first column gives the sequence count in each contig. Generally, contigs comprised of a few sequences (less than 50) can safely be ignored in downstream analysis.
Correction Using Complementary High-Identity Sequences
Inputting High-Identity Sequences
The below example is applicable if you have less than 50X coverage of PacBio RS data or you have C1 or pre-release C2 chemistry. In this case, high-identity sequences are required to correct PacBio RS sequence. You can correct the PacBio RS sequences using any high-identity next-generation technology. This include Illumina, 454, and PacBio CCS sequences. The pacBioToCA pipeline accepts an arbitrary number of frg files at the end of the command.
- For Illumina data, use the fastqToCA command. For examples, see the phage example below and the usage page.
- For 454 data, use sffToCA.
- For PacBio RS CCS sequences, use the convert-fasta-to-v2 command. CCS sequences must be input as FastaA, not FastQ. For more information, see the detailed usage. To convert a CCS fasta file named ccs.fasta (with corresponding qualities in ccs.qual) to a frg file run:
convert-fasta-to-v2.pl -pacbio -s ccs.fna -q ccs.qual -l pacBioCCS > css.frg
If you CCS data is in fastq format, you can download a Java utility. To convert a fastq file run:
java convertFastqToFastaAndQual ccs.fastq ccs.fna ccs.qual
then convert the fasta/qual files as above.
Download the phage sample data available http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz
tar xvzf sampleData.tar.gz x sampleData/ x sampleData/asm.spec x sampleData/illumina.fastq x sampleData/lambda.fasta x sampleData/pacbio.filtered_subreads.fasta x sampleData/pacbio.spec x sampleData/README x sampleData/asm.SGE.spec x sampleData/pacbio.SGE.spec ls sampleData/ README instructions on running the correction asm.spec CA spec file for assembly. Assumes a high-memory multi-core machine. asm.SGE.spec CA spec file for assembly. Assumes an SGE environment. illumina.fastq Illumina unpaired sequences to be used for correction lambda.fasta The reference phage sequence pacbio.filtered_subreads.fasta The PacBio RS sequences (with high-error) after filtering through the secondary pipeline. pacbio.spec CA spec file to be used for correction. Assumes a high-memory multi-core machine. pacbio.SGE.spec CA spec file to be used for correction. Assumes an SGE environment.
Processing complementary data
First, convert illumina.fastq to CA frg format:
cd sampleData/ <wgs>/<Linux-amd64>/bin/fastqToCA -libraryname illumina -technology illumina -type sanger -innie -reads illumina.fastq > illumina.frg
Correct PacBio RS single-pass sequences
Next, run the correction to generate corrected pacbio sequences. This example uses the Java utility above to convert the PacBio RS fasta file to fastq.
java convertFastaAndQualToFastq pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fastq <wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio.filtered_subreads.fastq illumina.frg > run.out 2>&1
Assemble corrected sequences
After the correction pipeline runs, you will have a pacbio.frg file in the sampleData directory. There is also be pacbio.fasta and a pacbio.qual files. You can use the pacbio.frg file for a CA assembly. Note, if you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results.
To subset the longest 25X of sequences (only supported when CA is built from source). Given a genome size of 50Kbp, we need 1Mbp in PBcR sequences.
<wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 1000000 asm.gkpStore > pacbio.25X.frg
Assemble the data (if you are using the longest 25X)
<wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.25X.frg > asm.out 2>&1
or
<wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.frg > asm.out 2>&1
otherwise.
Substitute your CA location for <wgs> and your machine type for <Linux-amd64>. After this you should have an assembled phage in asm/9-terminator/asm.ctg.fasta.
head -n 40 asm/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=48452 MeanBasesInScaffolds=48452 MinBasesInScaffolds=48452 MaxBasesInScaffolds=48452 N25ScaffoldBases=48452 N50ScaffoldBases=48452 N75ScaffoldBases=48452 TotalSpanOfScaffolds=48452 MeanSpanOfScaffolds=48452 MinScaffoldSpan=48452 MaxScaffoldSpan=48452 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=48452 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,48452,48452,48452,0,7180000000002 total=1,48452,48452,48452,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=48452 TotalVarRecords=11 MeanContigLength=48452 MinContigLength=48452 MaxContigLength=48452 N25ContigBases=48452 N50ContigBases=48452 N75ContigBases=48452 [BigContigs_greater_10000]
Illumina-only assembly comparison
For comparison, you can assemble the Illumina data alone:
<wgs>/<Linux-amd64>/bin/runCA -p asm -d illuminaASM -s asm.spec illumina.frg > asm.out 2>&1
head -n 40 illuminaASM/9-terminator/asm.qc [Scaffolds] TotalScaffolds=3 TotalContigsInScaffolds=3 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=48635 MeanBasesInScaffolds=16212 MinBasesInScaffolds=6606 MaxBasesInScaffolds=27998 N25ScaffoldBases=27998 N50ScaffoldBases=27998 N75ScaffoldBases=14031 TotalSpanOfScaffolds=48635 MeanSpanOfScaffolds=16212 MinScaffoldSpan=6606 MaxScaffoldSpan=27998 IntraScaffoldGaps=0 2KbScaffolds=3 2KbScaffoldSpan=48635 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,27998,27998,27998,0,7180000000010 1=1,14031,14031,14031,0,7180000000011 2=1,6606,6606,6606,0,7180000000012 total=3,48635,48635,16212,0 [Contigs] TotalContigsInScaffolds=3 TotalBasesInScaffolds=48635 TotalVarRecords=0 MeanContigLength=16212 MinContigLength=6606 MaxContigLength=27998 N25ContigBases=27998 N50ContigBases=27998 N75ContigBases=14031
Assembly of Corrected Sequences
Quality Trimming
By default, pacBioToCA will trim the corrected sequences to a quality value of 54.5. In case there is low coverage of the genome or the sequences are short (especially when using self-correction), it may be beneficial to try assembling with alternate QV trim values. Assuming pacBioToCA was run with library name <libraryname>, the following command allows custom QV trimming
<wgs>/<Linux-amd64>/bin/trimFastqByQVWindow temp<libraryname>/corrected.fasta temp<libraryname>/corrected.qual trimmedQV.fasta trimmedQV.qual <QV value> > trimmedQV.fastq <wgs>/<Linux-amd64>/bin/fastqToCA -libraryname trimmedQV -technology pacbio -type sanger -reads trimmedQV.fastq > trimmedQV.frg
The resulting trimmedQV.frg file can now be used for assembly. If the assembly using the default setting is poor quality, the recommended alternate values to to try are 52.5, 56.5, and 59.5.
Celera Assembler Parameters
Celera Assembler, like many OLC assemblers, is sensitive to high coverage and noise, especially near the ends of sequences. To remove some of this noise, we recommend using the longest 25X of sequence to assemble. Given an E. coli K12 genome (4.65Mbp genome size), 25X coverage is approximately 116MBp. The commands below will subset an frg file to only 25X of the longest sequences.
<wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > pacbio.25X.frg
Finally, some Celera Assembler parameters can be adjusted to make overlap detection more stringent and eliminate false overlaps when assembly only PacBio corrected long sequences. The <ovl value> parameter should be set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000. The currently recommended parameters are
<wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 utgMergeErrorRate=0.03 utgMergeErrorLimit=0
When you are combining PacBio corrected sequences with other technologies (454, etc), the alternate parameters are
<wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14
Polishing Consensus
The final assembly consensus is normally QV45 of higher. It can be improved by post-processing using the Quiver consenus algorithm (the default for the RS_Resequencing protocol in SMRT®Portal 1.4). Import the assembly into the SMRTPortal and run the RS_Resequencing protocol to generate a final polished assembly.
Additional Options
Replacing Celera Assembler Overlapper
Instead of using CA's built-in overlapper, it is possible to use BLASR, Bowtie 2, or an arbitrary aligner to find overlaps between PacBio RS sequences and high-identity sequences when CA is built from source. Note that BLASR is automatically enabled if you are using self-correction.
To enable BLASR add:
"blasr=-advanceHalf -noRefineAlign -ignoreQuality -minMatch 10 -maxLCPLength 15 -minPctIdentity 70.0"
to the spec file or as a parameter to pacBioToCA.
To enable Bowtie 2, specify
"bowtie = -N 0 -L 10 --rfg 2,2 --rdg 2,2 -R 3 -i S,1,1.25"
to the spec file or as a parameter to pacBioToCA.
To use an arbitrary aligner, generate alignments with PacBio RS sequences as the reference and high-identity sequences as the reads in SAM format. Then, given sam files named sam0001.out, sam002.out, sam0003.out,
find -L ./ -name \\*.sam -print | sort > fofn echo "samFOFN=fofn" >> pacbio.spec
Filling Coverage Gaps With PacBio RS Sequences
In older versions of pacBioToCA, when there was a coverage gap in the short-read coverage, the PacBio RS sequence was split. This caused issues when the short-read data had coverage gaps due to sequencing bias. In the latest version available when CA is built from source, short-reads can be used to recruit PacBio RS sequences to coverage gaps. This option has only been tested on genomes less than 10Mbp in size. The corrected sequences still have a high-quality consensus above 99% accuracy. On a test run of E. coli K-12, adding the option increased the mean corrected sequence length to 4,187 from 2,493 and improved the assembly N50 to 4.65Mbp from 3.32Mbp. To enable this option specify
"maxUncorrectedGap=1500"
in the spec file or as a parameter to pacBioToCA.
Known Issues
Celera Assembler Known Issues
For the latest issues found in the Celera Assembler release, please see version 7.0 known issues page. You can also check the Celera Assembler bugs page for any reported issues.
Correcting Large (> 100Mbp) Genomes
pacBioToCA and Celera Assembler require adjusted parameters for optimal performance on large genomes. The default parameters in the pacbio.spec file are configured to fully utilize a machine with 16 cores and 48GB of RAM. For large genomes (i.e. 1GB), an SGE grid or a high-memory machine (256GB) of RAM are recommended for optimal performance. The recommended parameters for large genomes are:
ovlHashBits = 25 ovlHashBlockLength = 1000000000 ovlRefBlockSize = 100000000 doOverlapBasedTrimming = 0
This will make each overlapper job use approximately 18GB of RAM (up from the default of 6GB of ram). You should set the ovlConcurrency to be:
<total memory> / 18
For example, for a 256GB machine, ovlConcurrency would be 14. The ovlThreads parameter should be set to:
<total cores> / ovlConcurrency
So for a 256GB machine with 32 cores, ovlThreads would be 2 (assuming ovlConcurrency was set to 14 above). When using an SGE grid, the sgeOverlap parameter should also be set to request the appropriate threads and memory:
sgeOverlap = -pe threads 2 -l mem=10GB
For full details on tuning these parameters, please see the Celera Assembler runCA wiki page.
If you already have a correction running, you can tell the number of overlap jobs created by running
wc -l temp<your library name>/0-overlaptrim-overlapper/ovlopt
If the number of overlap jobs is large (over 400), then the parameters above are recommended over the default settings.
Low Number of Output Corrected Sequences
Correcting a combination of multiple PacBio libraries has sometimes resulted in poor throughput, with only a few long corrected PacBio RS reads remaining after correction. The current recommendation is to correct multiple PacBio RS libraries separately and combine the corrected reads for subsequence analysis. Note that is it OK to combine multiple PacBio RS cells run from the same library for correction.
Short-read sequences below 64bp
If you have short sequences for correction (< 64bp), the pipeline will by default not load them and will give a no-overlaps error instead of outputting corrected sequences. In order to use short Illumina sequences, you have to specify the-shortReadsoption to the pipeline. Also, add the following parameters at the end of the pacbio.spec file
frgMinLen = 40 # smaller than your read length ovlMinLen = 30 # about 75% of your frgMinLen merSize = 10
This is currently only recommended for small (<10Mbp) genomes as it will increase runtime.
If you have Illumina sequences longer than 64bp but shorter than 100bp, have to specify the-shortReadsoption to the pipeline along with the following parameter to pacbio.spec:
merSize = 10
No Overlaps Found
The PacBio SMRT-portal version 1.3.0 outputs a non-standard fastq file from their filtering pipeline. This causes the pacBioToCA pipeline to either stop with an error or output no corrected sequences. You can check if your fastq file is the cause of the problem if it splits the sequence and quality on multiple lines:
@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG TCANCCCTGCATTATAAGAAGTACCGCAGGCGACG ATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC + EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFID HHIGHIHHGHIHIGHGIGHGGIC @HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG TAANGGGCTGACAACGCATTACACAAAACTCAAAC ACAACAACCGTAACCACCGCGGTTCAATGGGACTG G + EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGG GIEIIHIDHIIHIFHEGGHIF=
versus the expected format of:
@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG TCANCCCTGCATTATAAGAAGTACCGCAGGCGACGATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC + EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFIDHHIGHIHHGHIHIGHGIGHGGIC @HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG TAANGGGCTGACAACGCATTACACAAAACTCAAACACAACAACCGTAACCACCGCGGTTCAATGGGACTGG + EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGGGIEIIHIDHIIHIFHEGGHIF=
As a workaround, you can use the fasta file and convert it to fastq following the instructions above. Alternatively, you can use a perl script to convert non-standard fastq files to ones pacBioToCA expects. Thank you to Lee Katz for providing the script. To convert a non-standard fastq file named nonstandard.fastq to the correct format run the command:
perl run_assembly_convertMultiFastqToStandard.pl -i nonstandard.fastq -o standard.fastq
Run the script with no parameters to get help.
Error in runCorrection.sh Step
The CA 7.0 release has a bug in the correction pipeline that may sometimes lead to a crash. If your pipeline fails or hangs on the runCorrection.sh step:
----------------------------------------END Tue Jan 17 11:04:28 2012 (1 seconds) Failed to execute temppacbio/runCorrection.sh
The following guide will let you continue correction without losing any computation already completed:
cd temppacbio sh runCorrection.sh
The pipeline will recompute the failed step. If it fails again, edit the runCorrection.sh script and replace the line
-t <16> \
where <16> corresponds to the number of threads you used for correction (-t parameter) with
-t 1 \
and run the correction again. Now, you can resume the pipeline using the original command
<wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio.filtered_subreads.fastq illumina.frg
This bug is fixed in the CVS source code and will be fixed in the next release of CA. You can download the patched executable compiled for Linux-amd64. Replace the correctPacBio executable with the downloaded one. For other platforms, you can patch the bug in CA 7.0 by downloading the CA 7.0 source. Modify the file AS_global.h in wgs-assembler/src (line #206) as above and then modify wgs-assembler/src/AS_PBR/CorrectPacBio.cc line #155 from:
}
to
} else {
frgToLib[fr.gkFragment_getReadIID()] = FALSE;
}
Contact
Koren S, Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., and Phillippy, A. M. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012)
If you encounter issues or have questions, please contact the authors of the pipeline, Sergey Koren (sergek AT umd.edu) or Adam M. Phillippy (aphillippy AT gmail.com)
