Menu

#335 Run PBcR meryl failure

meryl
open
nobody
5
2015-12-10
2015-12-07
No

Hi!

I have a 800M genome to assemble, 13x coverage in PacBio reads and around 180x coverage in Illumina (PE, single and MP). I have a cluster with 32 nodes and 512 Gb of RAM. and I was wondering what would be the best parametrization for me to run PBcR using the Illumina reads to correct? It seems to me that the meryl is presenting failure due to memory problems too.. I set it up to use the total memory I have available, like that:

ovlMemory = 512
ovlStoreMemory= 512000
merylMemory = 512000
merylThreads = 32

Don't know if this it correct, though.

Thanks a lot for the help!

Related

Bugs: #335

Discussion

  • Marcela Uliano da Silva

    I'm attaching here my meryl.err file, it really seems like a memory error, right? Can you help my with each parameter to adapt in order to get PBcR to complete? My especifications are as follow:

    ovlMemory = 512000
    ovlStoreMemory= 512000
    merylMemory = 512000
    merylThreads = 32
    coverageCutoff = 60

    -genomeSize=800000000

     
  • Brian Walenz

    Brian Walenz - 2015-12-07

    That doesn't look like a memory error, but I'm not sure why it failed. It's just trying to read a sequence from the disk data store. Dropping memory limits by 10% won't hurt performance, and will be nice to the machine.

    Before you get too far into this process, stop. I can't recommend using this algorithm for correction with Illumina data. That aspect hasn't been maintained for several years, and has trouble with large complex genomes. Look into ECtools or proovread instead. ECtools assembles the illumina reads and uses that for correction. Proovread sounds similar, but I haven't looked into it.

    Once you get corrected reads I'd suggest assembling with CA's replacement, canu (https://github.com/marbl/canu).

     
    • Marcela Uliano da Silva

      Hi Brian! Thanks a lot for your answer. Ok, I understand your point. I've
      been, in fact, correcting the PacBio data with prooveread. But I'm also
      running PBcR for self-correction, although I only have 13 times genome
      coverage in PacBio data. I would like your advice on something else, if I
      may: so this is a 21Gb data, around 1,5 million subreads with 6Kb as medium
      size. I have PBcR running in a 24 cores, 72Gb RAM machine now, and its been
      running for 7 days.

      1-) Could you send me any information about the temporary files it
      creates?2-) I know its hard, but do you have any estimates in how long its
      going to take to run? I
      would just like to estimate how long its going to
      take: its running in a shared cluster and I gave to it a total of 25 days
      to run: I don't want it to get to the end and don't finish in time!

      Right now its running this "runPartition.sh 150" and creating these
      .tmp.m5, tmp.cns.fasta, *.tmp.aln.fasta files!

      Thank you so much for your help!!

      2015-12-07 18:56 GMT+01:00 Brian Walenz brianwalenz@users.sf.net:

      That doesn't look like a memory error, but I'm not sure why it failed.
      It's just trying to read a sequence from the disk data store. Dropping
      memory limits by 10% won't hurt performance, and will be nice to the
      machine.

      Before you get too far into this process, stop. I can't recommend using
      this algorithm for correction with Illumina data. That aspect hasn't been
      maintained for several years, and has trouble with large complex genomes.
      Look into ECtools or proovread instead. ECtools assembles the illumina
      reads and uses that for correction. Proovread sounds similar, but I haven't
      looked into it.

      Once you get corrected reads I'd suggest assembling with CA's replacement,
      canu (https://github.com/marbl/canu).


      Status: open
      Group: meryl
      Labels: best configuration
      Created: Mon Dec 07, 2015 10:38 AM UTC by Marcela Uliano da Silva
      Last Updated: Mon Dec 07, 2015 02:19 PM UTC
      Owner: nobody

      Hi!

      I have a 800M genome to assemble, 13x coverage in PacBio reads and around
      180x coverage in Illumina (PE, single and MP). I have a cluster with 32
      nodes and 512 Gb of RAM. and I was wondering what would be the best
      parametrization for me to run PBcR using the Illumina reads to correct? It
      seems to me that the meryl is presenting failure due to memory problems
      too.. I set it up to use the total memory I have available, like that:

      ovlMemory = 512
      ovlStoreMemory= 512000
      merylMemory = 512000
      merylThreads = 32

      Don't know if this it correct, though.

      Thanks a lot for the help!

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/wgs-assembler/bugs/335/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      --
      Msc Marcela Uliano da Silva

      PhD Student at Universidade Federal do Rio de Janeiro - Brazil
      Visiting researcher at Berlin Center for Genomics in Biodiversity Research
      (BeGenDiv)
      Botanischer Garten und Botanisches Museum Berlin-Dahlem
      Berlin - Germany
      CV: http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4261864A3
      website: http://improvisocientifico.blogspot.com.br/
      http://improvisocientifico.blogspot.com.br/

       

      Related

      Bugs: #335

  • Marcela Uliano da Silva

    Hi Brian! Thanks a lot for your answer. Ok, I understand your point. I've been, in fact, correcting the PacBio data with prooveread. But I'm also running PBcR for self-correction, although I only have 13 times genome coverage in PacBio data. I would like your advice on something else, if I may: so this is a 21Gb data, around 1,5 million subreads with 6Kb as medium size. I have PBcR running in a 24 cores, 72Gb RAM machine now, and its been running for 7 days.

    1-) Could you send me any information about the temporary files it creates?
    2-) I know its hard, but do you have any estimates in how long its going to take to run?

    I would just like to estimate how long its going to take: I have it running in a shared cluster and I gave to it a total of 25 days to run: I don't want it to get to the end and don't finish in time!

    Right now its running this "runPartition.sh 150" and creating these .tmp.m5, tmp.cns.fasta, *.tmp.aln.fasta files!

    Thank you so much for your help!!

     
  • Marcela Uliano da Silva

    Hey Brian, my job finished! Thanks for your help!

    So, as an estimation: 1 million PacBio reads (6Kb as medium size) took 7 days to run PBcR self-correction in a 24 core and 72Gb RAM cluster.

    Thank you!

     

Log in to post a comment.