Hello,
I've attempted to use PBCR to error correct pacbio reads ( total reads = 170,000) with illumina short reads (35M, read length-100, single end). the results seem puzzling.
I’ve plotted read length for Pacbio reads before and after error correction.It appears that during the error correction process, PBCR removes a lot of long reads and error correct only short reads. Also after error correction the number of reads have increased and the mean read length have dropped from 10kb to 350bp? Could you advise why would this may have happened. Any configuration changes I missed on my end. It appears that long reads are split in to shorter reads and reads longer than 5 kb are discarded and not error corrected.
My understanding is long reads are error corrected by shorter reads. So the read length of pacbio reads should be the same with nearly the same read numbers? My concern is many long reads have been deleted after error correction. I don’t think it makes sense to move forward for me with assembly using long reads just because of shorter read length.
Below is the command line I have used and pacbio.spec.
stopAfter=overlapper
utgErrorRate = 0.25
utgErrorLimit = 4.5
cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25
merSize=14
merylMemory = 150000
merylThreads = 20
ovlStoreMemory = 150000
useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0
sge = -V -A assembly
sgeScript = -pe threads 20
sgeConsensus = -pe threads 20
sgeOverlap = -pe threads 20
sgeFragmentCorrection = -pe threads 20
sgeOverlapCorrection = -pe threads 20
ovlHashBits = 26
ovlThreads = 20
merCompression = 1
merOverlapperSeedBatchSize = 200000
merOverlapperExtendBatchSize = 750000
frgCorrThreads = 20
frgCorrBatchSize = 200000
ovlCorrBatchSize = 500000
merylMemory = 150000
merylThreads = 20
ovlConcurrency = 20
cnsConcurrency = 20
merOverlapperThreads = 20
merOverlapperSeedConcurrency = 20
merOverlapperExtendConcurrency = 20
frgCorrConcurrency = 20
ovlCorrConcurrency = 20
cnsConcurrency = 20
nohup fastqToCA -libraryname MATED -reads /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &
nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &
Best Regards,
R
Hi,
I replied to your previous email from June. Basically, if there is no short read coverage of the long reads, they will be split because the hybrid correction relies on the short reads to generate the corrected read consensus. It seems from your results that most PacBio reads do not have short reads mapping to them. If you’re not already, I’d use at least CA 8.1 and follow the instructions here:
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29 http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29
This uses BLASR instead of the default overlapper which will work better for hybrid correction (you have to have BLASR/smrtportal installed and in your path). That said, the hybrid pipeline is quite slow and has not been updated since CA 8.1. I would instead recommend using just PacBio data for correction and assembly. The default settings will work fine if you have at least 30X but it looks like you have closer to 10X (based on your #reads, average length, and genome size 170000*10000 / 175000000). You try the recommended parameters from the wiki page:
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
However, having < 15X will significantly impact the quality of your assembly due to coverage limitations but you should still end up with usable corrected sequences without assembly. This mode also has the benefit that the pipeline will tailor it’s memory/resource usage to your system so you can specify an empty spec file (your spec file is relatively old and some of the options are no longer need to be explicitly specified).
You can also try one of the alternate hybrid correction methods (ECTools, LORDEC, proovread) but with 10X you would likely not be able to generate a good assembly from just the PacBio data.
Sergey
Related
Support Requests:
#21Thanks Sergey. Appreciate your response. Would give a try with other
assemblers and see what I get.
Regards,
RP
On Mon, Jul 20, 2015 at 3:41 PM, Sergey Koren skoren@users.sf.net wrote:
Related
Support Requests:
#21