Whole-Genome Shotgun Assembler / Support Requests / #21 PBCR reads split after error correction

Sergey Koren - 2015-07-20

Hi,

I replied to your previous email from June. Basically, if there is no short read coverage of the long reads, they will be split because the hybrid correction relies on the short reads to generate the corrected read consensus. It seems from your results that most PacBio reads do not have short reads mapping to them. If you’re not already, I’d use at least CA 8.1 and follow the instructions here:
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29 http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29

This uses BLASR instead of the default overlapper which will work better for hybrid correction (you have to have BLASR/smrtportal installed and in your path). That said, the hybrid pipeline is quite slow and has not been updated since CA 8.1. I would instead recommend using just PacBio data for correction and assembly. The default settings will work fine if you have at least 30X but it looks like you have closer to 10X (based on your #reads, average length, and genome size 170000*10000 / 175000000). You try the recommended parameters from the wiki page:
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
However, having < 15X will significantly impact the quality of your assembly due to coverage limitations but you should still end up with usable corrected sequences without assembly. This mode also has the benefit that the pipeline will tailor it’s memory/resource usage to your system so you can specify an empty spec file (your spec file is relatively old and some of the options are no longer need to be explicitly specified).

You can also try one of the alternate hybrid correction methods (ECTools, LORDEC, proovread) but with 10X you would likely not be able to generate a good assembly from just the PacBio data.

Sergey

On Jul 15, 2015, at 10:47 PM, RAKESH rakeshponn@users.sf.net wrote:

[support-requests:#21] http://sourceforge.net/p/wgs-assembler/support-requests/21 PBCR reads split after error correction

Status: unread
Group: v1.0_(example)
Created: Thu Jul 16, 2015 02:46 AM UTC by RAKESH
Last Updated: Thu Jul 16, 2015 02:46 AM UTC
Owner: nobody

Hello,

I've attempted to use PBCR to error correct pacbio reads ( total reads = 170,000) with illumina short reads (35M, read length-100, single end). the results seem puzzling.

I’ve plotted read length for Pacbio reads before and after error correction.It appears that during the error correction process, PBCR removes a lot of long reads and error correct only short reads. Also after error correction the number of reads have increased and the mean read length have dropped from 10kb to 350bp? Could you advise why would this may have happened. Any configuration changes I missed on my end. It appears that long reads are split in to shorter reads and reads longer than 5 kb are discarded and not error corrected.

My understanding is long reads are error corrected by shorter reads. So the read length of pacbio reads should be the same with nearly the same read numbers? My concern is many long reads have been deleted after error correction. I don’t think it makes sense to move forward for me with assembly using long reads just because of shorter read length.

Below is the command line I have used and pacbio.spec.

stopAfter=overlapper

original asm settings

utgErrorRate = 0.25
utgErrorLimit = 4.5

cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25

merSize=14

merylMemory = 150000
merylThreads = 20

ovlStoreMemory = 150000

grid info

useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -V -A assembly
sgeScript = -pe threads 20
sgeConsensus = -pe threads 20
sgeOverlap = -pe threads 20
sgeFragmentCorrection = -pe threads 20
sgeOverlapCorrection = -pe threads 20

ovlMemory=8GB --hashload 0.7

ovlHashBits = 26
ovlThreads = 20

for mer overlapper

merCompression = 1
merOverlapperSeedBatchSize = 200000
merOverlapperExtendBatchSize = 750000

frgCorrThreads = 20
frgCorrBatchSize = 200000
ovlCorrBatchSize = 500000

non-Grid settings, if you set useGrid to 0 above these will be used

merylMemory = 150000
merylThreads = 20

ovlConcurrency = 20

cnsConcurrency = 20

merOverlapperThreads = 20
merOverlapperSeedConcurrency = 20
merOverlapperExtendConcurrency = 20

frgCorrConcurrency = 20
ovlCorrConcurrency = 20
cnsConcurrency = 20

nohup fastqToCA -libraryname MATED -reads /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &

nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &

Best Regards,
R

Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/wgs-assembler/support-requests/21/ https://sourceforge.net/p/wgs-assembler/support-requests/21
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ https://sourceforge.net/auth/subscriptions

Related

Support Requests: ~~#21~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- RAKESH - 2015-07-21
  
  Thanks Sergey. Appreciate your response. Would give a try with other
  assemblers and see what I get.
  Regards,
  RP
  
  On Mon, Jul 20, 2015 at 3:41 PM, Sergey Koren skoren@users.sf.net wrote:
  
  Hi,
  
  I replied to your previous email from June. Basically, if there is no
  short read coverage of the long reads, they will be split because the
  hybrid correction relies on the short reads to generate the corrected read
  consensus. It seems from your results that most PacBio reads do not have
  short reads mapping to them. If you’re not already, I’d use at least CA 8.1
  and follow the instructions here:
  
  http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29
  http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29
  
  This uses BLASR instead of the default overlapper which will work better
  for hybrid correction (you have to have BLASR/smrtportal installed and in
  your path). That said, the hybrid pipeline is quite slow and has not been
  updated since CA 8.1. I would instead recommend using just PacBio data for
  correction and assembly. The default settings will work fine if you have at
  least 30X but it looks like you have closer to 10X (based on your #reads,
  average length, and genome size 170000*10000 / 175000000). You try the
  recommended parameters from the wiki page:
  
  http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
  http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
  However, having < 15X will significantly impact the quality of your
  assembly due to coverage limitations but you should still end up with
  usable corrected sequences without assembly. This mode also has the benefit
  that the pipeline will tailor it’s memory/resource usage to your system so
  you can specify an empty spec file (your spec file is relatively old and
  some of the options are no longer need to be explicitly specified).
  
  You can also try one of the alternate hybrid correction methods (ECTools,
  LORDEC, proovread) but with 10X you would likely not be able to generate a
  good assembly from just the PacBio data.
  
  Sergey
  
  On Jul 15, 2015, at 10:47 PM, RAKESH rakeshponn@users.sf.net wrote:
  
  [support-requests:#21]
  http://sourceforge.net/p/brlcad/support-requests/21
  http://sourceforge.net/p/wgs-assembler/support-requests/21 PBCR reads
  split after error correction
  
  Status: unread
  Group: v1.0_(example)
  Created: Thu Jul 16, 2015 02:46 AM UTC by RAKESH
  Last Updated: Thu Jul 16, 2015 02:46 AM UTC
  Owner: nobody
  
  Hello,
  
  I've attempted to use PBCR to error correct pacbio reads ( total reads =
  170,000) with illumina short reads (35M, read length-100, single end). the
  results seem puzzling.
  
  I’ve plotted read length for Pacbio reads before and after error
  correction.It appears that during the error correction process, PBCR
  removes a lot of long reads and error correct only short reads. Also after
  error correction the number of reads have increased and the mean read
  length have dropped from 10kb to 350bp? Could you advise why would this may
  have happened. Any configuration changes I missed on my end. It appears
  that long reads are split in to shorter reads and reads longer than 5 kb
  are discarded and not error corrected.
  
  My understanding is long reads are error corrected by shorter reads. So
  the read length of pacbio reads should be the same with nearly the same
  read numbers? My concern is many long reads have been deleted after error
  correction. I don’t think it makes sense to move forward for me with
  assembly using long reads just because of shorter read length.
  
  Below is the command line I have used and pacbio.spec.
  
  stopAfter=overlapper
  
  original asm settings
  
  utgErrorRate = 0.25
  utgErrorLimit = 4.5
  
  cnsErrorRate = 0.25
  cgwErrorRate = 0.25
  ovlErrorRate = 0.25
  
  merSize=14
  
  merylMemory = 150000
  merylThreads = 20
  
  ovlStoreMemory = 150000
  
  grid info
  
  useGrid = 0
  scriptOnGrid = 0
  frgCorrOnGrid = 0
  ovlCorrOnGrid = 0
  
  sge = -V -A assembly
  sgeScript = -pe threads 20
  sgeConsensus = -pe threads 20
  sgeOverlap = -pe threads 20
  sgeFragmentCorrection = -pe threads 20
  sgeOverlapCorrection = -pe threads 20
  
  ovlMemory=8GB --hashload 0.7
  
  ovlHashBits = 26
  ovlThreads = 20
  
  for mer overlapper
  
  merCompression = 1
  merOverlapperSeedBatchSize = 200000
  merOverlapperExtendBatchSize = 750000
  
  frgCorrThreads = 20
  frgCorrBatchSize = 200000
  ovlCorrBatchSize = 500000
  
  non-Grid settings, if you set useGrid to 0 above these will be used
  
  merylMemory = 150000
  merylThreads = 20
  
  ovlConcurrency = 20
  
  cnsConcurrency = 20
  
  merOverlapperThreads = 20
  merOverlapperSeedConcurrency = 20
  merOverlapperExtendConcurrency = 20
  
  frgCorrConcurrency = 20
  ovlCorrConcurrency = 20
  cnsConcurrency = 20
  
  nohup fastqToCA -libraryname MATED -reads
  /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &
  
  nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq
  chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &
  
  Best Regards,
  R
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/wgs-assembler/support-requests/21/
  https://sourceforge.net/p/wgs-assembler/support-requests/21
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  https://sourceforge.net/auth/subscriptions
  
  [support-requests:#21]
  http://sourceforge.net/p/brlcad/support-requests/21 PBCR reads split
  after error correction*
  
  Status: unread
  Group: v1.0_(example)
  Created: Thu Jul 16, 2015 02:46 AM UTC by RAKESH
  Last Updated: Thu Jul 16, 2015 02:47 AM UTC
  Owner: nobody
  
  Hello,
  
  I've attempted to use PBCR to error correct pacbio reads ( total reads =
  170,000) with illumina short reads (35M, read length-100, single end). the
  results seem puzzling.
  
  I’ve plotted read length for Pacbio reads before and after error
  correction.It appears that during the error correction process, PBCR
  removes a lot of long reads and error correct only short reads. Also after
  error correction the number of reads have increased and the mean read
  length have dropped from 10kb to 350bp? Could you advise why would this may
  have happened. Any configuration changes I missed on my end. It appears
  that long reads are split in to shorter reads and reads longer than 5 kb
  are discarded and not error corrected.
  
  My understanding is long reads are error corrected by shorter reads. So
  the read length of pacbio reads should be the same with nearly the same
  read numbers? My concern is many long reads have been deleted after error
  correction. I don’t think it makes sense to move forward for me with
  assembly using long reads just because of shorter read length.
  
  Below is the command line I have used and pacbio.spec.
  
  stopAfter=overlapper
  original asm settings
  
  utgErrorRate = 0.25
  utgErrorLimit = 4.5
  
  cnsErrorRate = 0.25
  cgwErrorRate = 0.25
  ovlErrorRate = 0.25
  
  merSize=14
  
  merylMemory = 150000
  merylThreads = 20
  
  ovlStoreMemory = 150000
  grid info
  
  useGrid = 0
  scriptOnGrid = 0
  frgCorrOnGrid = 0
  ovlCorrOnGrid = 0
  
  sge = -V -A assembly
  sgeScript = -pe threads 20
  sgeConsensus = -pe threads 20
  sgeOverlap = -pe threads 20
  sgeFragmentCorrection = -pe threads 20
  sgeOverlapCorrection = -pe threads 20
  ovlMemory=8GB --hashload 0.7
  
  ovlHashBits = 26
  ovlThreads = 20
  for mer overlapper
  
  merCompression = 1
  merOverlapperSeedBatchSize = 200000
  merOverlapperExtendBatchSize = 750000
  
  frgCorrThreads = 20
  frgCorrBatchSize = 200000
  ovlCorrBatchSize = 500000
  non-Grid settings, if you set useGrid to 0 above these will be used
  
  merylMemory = 150000
  merylThreads = 20
  
  ovlConcurrency = 20
  
  cnsConcurrency = 20
  
  merOverlapperThreads = 20
  merOverlapperSeedConcurrency = 20
  merOverlapperExtendConcurrency = 20
  
  frgCorrConcurrency = 20
  ovlCorrConcurrency = 20
  cnsConcurrency = 20
  
  nohup fastqToCA -libraryname MATED -reads
  /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &
  
  nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq
  chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &
  
  Best Regards,
  R
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/wgs-assembler/support-requests/21/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Support Requests: ~~#21~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-11-16

status: unread --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

PBCR reads split after error correction

Group

Searches

Help

#21 PBCR reads split after error correction

original asm settings

grid info

ovlMemory=8GB --hashload 0.7

for mer overlapper

non-Grid settings, if you set useGrid to 0 above these will be used

Related

Discussion

Related

Related