Menu

#21 PBCR reads split after error correction

v1.0_(example)
closed
nobody
None
5
2015-11-16
2015-07-16
RAKESH
No

Hello,

I've attempted to use PBCR to error correct pacbio reads ( total reads = 170,000) with illumina short reads (35M, read length-100, single end). the results seem puzzling.

I’ve plotted read length for Pacbio reads before and after error correction.It appears that during the error correction process, PBCR removes a lot of long reads and error correct only short reads. Also after error correction the number of reads have increased and the mean read length have dropped from 10kb to 350bp? Could you advise why would this may have happened. Any configuration changes I missed on my end. It appears that long reads are split in to shorter reads and reads longer than 5 kb are discarded and not error corrected.

My understanding is long reads are error corrected by shorter reads. So the read length of pacbio reads should be the same with nearly the same read numbers? My concern is many long reads have been deleted after error correction. I don’t think it makes sense to move forward for me with assembly using long reads just because of shorter read length.

Below is the command line I have used and pacbio.spec.

stopAfter=overlapper

original asm settings

utgErrorRate = 0.25
utgErrorLimit = 4.5

cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25

merSize=14

merylMemory = 150000
merylThreads = 20

ovlStoreMemory = 150000

grid info

useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -V -A assembly
sgeScript = -pe threads 20
sgeConsensus = -pe threads 20
sgeOverlap = -pe threads 20
sgeFragmentCorrection = -pe threads 20
sgeOverlapCorrection = -pe threads 20

ovlMemory=8GB --hashload 0.7

ovlHashBits = 26
ovlThreads = 20

for mer overlapper

merCompression = 1
merOverlapperSeedBatchSize = 200000
merOverlapperExtendBatchSize = 750000

frgCorrThreads = 20
frgCorrBatchSize = 200000
ovlCorrBatchSize = 500000

non-Grid settings, if you set useGrid to 0 above these will be used

merylMemory = 150000
merylThreads = 20

ovlConcurrency = 20

cnsConcurrency = 20

merOverlapperThreads = 20
merOverlapperSeedConcurrency = 20
merOverlapperExtendConcurrency = 20

frgCorrConcurrency = 20
ovlCorrConcurrency = 20
cnsConcurrency = 20

nohup fastqToCA -libraryname MATED -reads /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &

nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &

Best Regards,
R

Related

Support Requests: #21

Discussion

  • Sergey Koren

    Sergey Koren - 2015-07-20

    Hi,

    I replied to your previous email from June. Basically, if there is no short read coverage of the long reads, they will be split because the hybrid correction relies on the short reads to generate the corrected read consensus. It seems from your results that most PacBio reads do not have short reads mapping to them. If you’re not already, I’d use at least CA 8.1 and follow the instructions here:
    http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29 http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29

    This uses BLASR instead of the default overlapper which will work better for hybrid correction (you have to have BLASR/smrtportal installed and in your path). That said, the hybrid pipeline is quite slow and has not been updated since CA 8.1. I would instead recommend using just PacBio data for correction and assembly. The default settings will work fine if you have at least 30X but it looks like you have closer to 10X (based on your #reads, average length, and genome size 170000*10000 / 175000000). You try the recommended parameters from the wiki page:
    http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
    However, having < 15X will significantly impact the quality of your assembly due to coverage limitations but you should still end up with usable corrected sequences without assembly. This mode also has the benefit that the pipeline will tailor it’s memory/resource usage to your system so you can specify an empty spec file (your spec file is relatively old and some of the options are no longer need to be explicitly specified).

    You can also try one of the alternate hybrid correction methods (ECTools, LORDEC, proovread) but with 10X you would likely not be able to generate a good assembly from just the PacBio data.

    Sergey

    On Jul 15, 2015, at 10:47 PM, RAKESH rakeshponn@users.sf.net wrote:

    [support-requests:#21] http://sourceforge.net/p/wgs-assembler/support-requests/21 PBCR reads split after error correction

    Status: unread
    Group: v1.0_(example)
    Created: Thu Jul 16, 2015 02:46 AM UTC by RAKESH
    Last Updated: Thu Jul 16, 2015 02:46 AM UTC
    Owner: nobody

    Hello,

    I've attempted to use PBCR to error correct pacbio reads ( total reads = 170,000) with illumina short reads (35M, read length-100, single end). the results seem puzzling.

    I’ve plotted read length for Pacbio reads before and after error correction.It appears that during the error correction process, PBCR removes a lot of long reads and error correct only short reads. Also after error correction the number of reads have increased and the mean read length have dropped from 10kb to 350bp? Could you advise why would this may have happened. Any configuration changes I missed on my end. It appears that long reads are split in to shorter reads and reads longer than 5 kb are discarded and not error corrected.

    My understanding is long reads are error corrected by shorter reads. So the read length of pacbio reads should be the same with nearly the same read numbers? My concern is many long reads have been deleted after error correction. I don’t think it makes sense to move forward for me with assembly using long reads just because of shorter read length.

    Below is the command line I have used and pacbio.spec.

    stopAfter=overlapper

    original asm settings

    utgErrorRate = 0.25
    utgErrorLimit = 4.5

    cnsErrorRate = 0.25
    cgwErrorRate = 0.25
    ovlErrorRate = 0.25

    merSize=14

    merylMemory = 150000
    merylThreads = 20

    ovlStoreMemory = 150000

    grid info

    useGrid = 0
    scriptOnGrid = 0
    frgCorrOnGrid = 0
    ovlCorrOnGrid = 0

    sge = -V -A assembly
    sgeScript = -pe threads 20
    sgeConsensus = -pe threads 20
    sgeOverlap = -pe threads 20
    sgeFragmentCorrection = -pe threads 20
    sgeOverlapCorrection = -pe threads 20

    ovlMemory=8GB --hashload 0.7

    ovlHashBits = 26
    ovlThreads = 20

    for mer overlapper

    merCompression = 1
    merOverlapperSeedBatchSize = 200000
    merOverlapperExtendBatchSize = 750000

    frgCorrThreads = 20
    frgCorrBatchSize = 200000
    ovlCorrBatchSize = 500000

    non-Grid settings, if you set useGrid to 0 above these will be used

    merylMemory = 150000
    merylThreads = 20

    ovlConcurrency = 20

    cnsConcurrency = 20

    merOverlapperThreads = 20
    merOverlapperSeedConcurrency = 20
    merOverlapperExtendConcurrency = 20

    frgCorrConcurrency = 20
    ovlCorrConcurrency = 20
    cnsConcurrency = 20

    nohup fastqToCA -libraryname MATED -reads /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &

    nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &

    Best Regards,
    R

    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/wgs-assembler/support-requests/21/ https://sourceforge.net/p/wgs-assembler/support-requests/21
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ https://sourceforge.net/auth/subscriptions

     

    Related

    Support Requests: #21

    • RAKESH

      RAKESH - 2015-07-21

      Thanks Sergey. Appreciate your response. Would give a try with other
      assemblers and see what I get.
      Regards,
      RP

      On Mon, Jul 20, 2015 at 3:41 PM, Sergey Koren skoren@users.sf.net wrote:

      Hi,

      I replied to your previous email from June. Basically, if there is no
      short read coverage of the long reads, they will be split because the
      hybrid correction relies on the short reads to generate the corrected read
      consensus. It seems from your results that most PacBio reads do not have
      short reads mapping to them. If you’re not already, I’d use at least CA 8.1
      and follow the instructions here:

      http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29
      http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Correcting_Large_.28.3E_100Mbp.29_Genomes_.28Using_high-identity_data_or_CA_8.1.29

      This uses BLASR instead of the default overlapper which will work better
      for hybrid correction (you have to have BLASR/smrtportal installed and in
      your path). That said, the hybrid pipeline is quite slow and has not been
      updated since CA 8.1. I would instead recommend using just PacBio data for
      correction and assembly. The default settings will work fine if you have at
      least 30X but it looks like you have closer to 10X (based on your #reads,
      average length, and genome size 170000*10000 / 175000000). You try the
      recommended parameters from the wiki page:

      http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
      http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Low_Coverage_Assembly
      However, having < 15X will significantly impact the quality of your
      assembly due to coverage limitations but you should still end up with
      usable corrected sequences without assembly. This mode also has the benefit
      that the pipeline will tailor it’s memory/resource usage to your system so
      you can specify an empty spec file (your spec file is relatively old and
      some of the options are no longer need to be explicitly specified).

      You can also try one of the alternate hybrid correction methods (ECTools,
      LORDEC, proovread) but with 10X you would likely not be able to generate a
      good assembly from just the PacBio data.

      Sergey

      On Jul 15, 2015, at 10:47 PM, RAKESH rakeshponn@users.sf.net wrote:

      [support-requests:#21]
      http://sourceforge.net/p/brlcad/support-requests/21
      http://sourceforge.net/p/wgs-assembler/support-requests/21 PBCR reads
      split after error correction

      Status: unread
      Group: v1.0_(example)
      Created: Thu Jul 16, 2015 02:46 AM UTC by RAKESH
      Last Updated: Thu Jul 16, 2015 02:46 AM UTC
      Owner: nobody

      Hello,

      I've attempted to use PBCR to error correct pacbio reads ( total reads =
      170,000) with illumina short reads (35M, read length-100, single end). the
      results seem puzzling.

      I’ve plotted read length for Pacbio reads before and after error
      correction.It appears that during the error correction process, PBCR
      removes a lot of long reads and error correct only short reads. Also after
      error correction the number of reads have increased and the mean read
      length have dropped from 10kb to 350bp? Could you advise why would this may
      have happened. Any configuration changes I missed on my end. It appears
      that long reads are split in to shorter reads and reads longer than 5 kb
      are discarded and not error corrected.

      My understanding is long reads are error corrected by shorter reads. So
      the read length of pacbio reads should be the same with nearly the same
      read numbers? My concern is many long reads have been deleted after error
      correction. I don’t think it makes sense to move forward for me with
      assembly using long reads just because of shorter read length.

      Below is the command line I have used and pacbio.spec.

      stopAfter=overlapper

      original asm settings

      utgErrorRate = 0.25
      utgErrorLimit = 4.5

      cnsErrorRate = 0.25
      cgwErrorRate = 0.25
      ovlErrorRate = 0.25

      merSize=14

      merylMemory = 150000
      merylThreads = 20

      ovlStoreMemory = 150000

      grid info

      useGrid = 0
      scriptOnGrid = 0
      frgCorrOnGrid = 0
      ovlCorrOnGrid = 0

      sge = -V -A assembly
      sgeScript = -pe threads 20
      sgeConsensus = -pe threads 20
      sgeOverlap = -pe threads 20
      sgeFragmentCorrection = -pe threads 20
      sgeOverlapCorrection = -pe threads 20

      ovlMemory=8GB --hashload 0.7

      ovlHashBits = 26
      ovlThreads = 20

      for mer overlapper

      merCompression = 1
      merOverlapperSeedBatchSize = 200000
      merOverlapperExtendBatchSize = 750000

      frgCorrThreads = 20
      frgCorrBatchSize = 200000
      ovlCorrBatchSize = 500000

      non-Grid settings, if you set useGrid to 0 above these will be used

      merylMemory = 150000
      merylThreads = 20

      ovlConcurrency = 20

      cnsConcurrency = 20

      merOverlapperThreads = 20
      merOverlapperSeedConcurrency = 20
      merOverlapperExtendConcurrency = 20

      frgCorrConcurrency = 20
      ovlCorrConcurrency = 20
      cnsConcurrency = 20

      nohup fastqToCA -libraryname MATED -reads
      /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &

      nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq
      chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &

      Best Regards,
      R

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/wgs-assembler/support-requests/21/
      https://sourceforge.net/p/wgs-assembler/support-requests/21
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      https://sourceforge.net/auth/subscriptions


      Status: unread
      Group: v1.0_(example)
      Created: Thu Jul 16, 2015 02:46 AM UTC by RAKESH
      Last Updated: Thu Jul 16, 2015 02:47 AM UTC
      Owner: nobody

      Hello,

      I've attempted to use PBCR to error correct pacbio reads ( total reads =
      170,000) with illumina short reads (35M, read length-100, single end). the
      results seem puzzling.

      I’ve plotted read length for Pacbio reads before and after error
      correction.It appears that during the error correction process, PBCR
      removes a lot of long reads and error correct only short reads. Also after
      error correction the number of reads have increased and the mean read
      length have dropped from 10kb to 350bp? Could you advise why would this may
      have happened. Any configuration changes I missed on my end. It appears
      that long reads are split in to shorter reads and reads longer than 5 kb
      are discarded and not error corrected.

      My understanding is long reads are error corrected by shorter reads. So
      the read length of pacbio reads should be the same with nearly the same
      read numbers? My concern is many long reads have been deleted after error
      correction. I don’t think it makes sense to move forward for me with
      assembly using long reads just because of shorter read length.

      Below is the command line I have used and pacbio.spec.

      stopAfter=overlapper
      original asm settings

      utgErrorRate = 0.25
      utgErrorLimit = 4.5

      cnsErrorRate = 0.25
      cgwErrorRate = 0.25
      ovlErrorRate = 0.25

      merSize=14

      merylMemory = 150000
      merylThreads = 20

      ovlStoreMemory = 150000
      grid info

      useGrid = 0
      scriptOnGrid = 0
      frgCorrOnGrid = 0
      ovlCorrOnGrid = 0

      sge = -V -A assembly
      sgeScript = -pe threads 20
      sgeConsensus = -pe threads 20
      sgeOverlap = -pe threads 20
      sgeFragmentCorrection = -pe threads 20
      sgeOverlapCorrection = -pe threads 20
      ovlMemory=8GB --hashload 0.7

      ovlHashBits = 26
      ovlThreads = 20
      for mer overlapper

      merCompression = 1
      merOverlapperSeedBatchSize = 200000
      merOverlapperExtendBatchSize = 750000

      frgCorrThreads = 20
      frgCorrBatchSize = 200000
      ovlCorrBatchSize = 500000
      non-Grid settings, if you set useGrid to 0 above these will be used

      merylMemory = 150000
      merylThreads = 20

      ovlConcurrency = 20

      cnsConcurrency = 20

      merOverlapperThreads = 20
      merOverlapperSeedConcurrency = 20
      merOverlapperExtendConcurrency = 20

      frgCorrConcurrency = 20
      ovlCorrConcurrency = 20
      cnsConcurrency = 20

      nohup fastqToCA -libraryname MATED -reads
      /chr1_shortreads_unmapped_shortreads.fastq > illumina.frg &

      nohup PBcR -length 500 -partitions 200 -l test -s pacbio.spec -fastq
      chr1_longreads_unmapped_longreads.fastq genomeSize=175000000 illumina.frg &

      Best Regards,
      R


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/wgs-assembler/support-requests/21/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Support Requests: #21

  • Sergey Koren

    Sergey Koren - 2015-11-16
    • status: unread --> closed
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.