Whole-Genome Shotgun Assembler / Bugs / #335 Run PBcR meryl failure

Marcela Uliano da Silva - 2015-12-07

I'm attaching here my meryl.err file, it really seems like a memory error, right? Can you help my with each parameter to adapt in order to get PBcR to complete? My especifications are as follow:

ovlMemory = 512000
ovlStoreMemory= 512000
merylMemory = 512000
merylThreads = 32
coverageCutoff = 60

-genomeSize=800000000

meryl.err

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-12-07

That doesn't look like a memory error, but I'm not sure why it failed. It's just trying to read a sequence from the disk data store. Dropping memory limits by 10% won't hurt performance, and will be nice to the machine.

Before you get too far into this process, stop. I can't recommend using this algorithm for correction with Illumina data. That aspect hasn't been maintained for several years, and has trouble with large complex genomes. Look into ECtools or proovread instead. ECtools assembles the illumina reads and uses that for correction. Proovread sounds similar, but I haven't looked into it.

Once you get corrected reads I'd suggest assembling with CA's replacement, canu (https://github.com/marbl/canu).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Marcela Uliano da Silva - 2015-12-09
  
  Hi Brian! Thanks a lot for your answer. Ok, I understand your point. I've
  been, in fact, correcting the PacBio data with prooveread. But I'm also
  running PBcR for self-correction, although I only have 13 times genome
  coverage in PacBio data. I would like your advice on something else, if I
  may: so this is a 21Gb data, around 1,5 million subreads with 6Kb as medium
  size. I have PBcR running in a 24 cores, 72Gb RAM machine now, and its been
  running for 7 days.
  
  1-) Could you send me any information about the temporary files it
  creates?2-) I know its hard, but do you have any estimates in how long its
  going to take to run? I would just like to estimate how long its going to
  take: its running in a shared cluster and I gave to it a total of 25 days
  to run: I don't want it to get to the end and don't finish in time!
  
  Right now its running this "runPartition.sh 150" and creating these
  .tmp.m5, tmp.cns.fasta, *.tmp.aln.fasta files!
  
  Thank you so much for your help!!
  
  2015-12-07 18:56 GMT+01:00 Brian Walenz brianwalenz@users.sf.net:
  
  That doesn't look like a memory error, but I'm not sure why it failed.
  It's just trying to read a sequence from the disk data store. Dropping
  memory limits by 10% won't hurt performance, and will be nice to the
  machine.
  
  Before you get too far into this process, stop. I can't recommend using
  this algorithm for correction with Illumina data. That aspect hasn't been
  maintained for several years, and has trouble with large complex genomes.
  Look into ECtools or proovread instead. ECtools assembles the illumina
  reads and uses that for correction. Proovread sounds similar, but I haven't
  looked into it.
  
  Once you get corrected reads I'd suggest assembling with CA's replacement,
  canu (https://github.com/marbl/canu).
  
  [bugs:#335] http://sourceforge.net/p/wgs-assembler/bugs/335/ Run PBcR
  meryl failure*
  
  Status: open
  Group: meryl
  Labels: best configuration
  Created: Mon Dec 07, 2015 10:38 AM UTC by Marcela Uliano da Silva
  Last Updated: Mon Dec 07, 2015 02:19 PM UTC
  Owner: nobody
  
  Hi!
  
  I have a 800M genome to assemble, 13x coverage in PacBio reads and around
  180x coverage in Illumina (PE, single and MP). I have a cluster with 32
  nodes and 512 Gb of RAM. and I was wondering what would be the best
  parametrization for me to run PBcR using the Illumina reads to correct? It
  seems to me that the meryl is presenting failure due to memory problems
  too.. I set it up to use the total memory I have available, like that:
  
  ovlMemory = 512
  ovlStoreMemory= 512000
  merylMemory = 512000
  merylThreads = 32
  
  Don't know if this it correct, though.
  
  Thanks a lot for the help!
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/wgs-assembler/bugs/335/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  --
  Msc Marcela Uliano da Silva
  
  PhD Student at Universidade Federal do Rio de Janeiro - Brazil
  Visiting researcher at Berlin Center for Genomics in Biodiversity Research
  (BeGenDiv)
  Botanischer Garten und Botanisches Museum Berlin-Dahlem
  Berlin - Germany
  CV: http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4261864A3
  website: http://improvisocientifico.blogspot.com.br/
  http://improvisocientifico.blogspot.com.br/
  
  Related
  
  Bugs: #335
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marcela Uliano da Silva - 2015-12-09

Hi Brian! Thanks a lot for your answer. Ok, I understand your point. I've been, in fact, correcting the PacBio data with prooveread. But I'm also running PBcR for self-correction, although I only have 13 times genome coverage in PacBio data. I would like your advice on something else, if I may: so this is a 21Gb data, around 1,5 million subreads with 6Kb as medium size. I have PBcR running in a 24 cores, 72Gb RAM machine now, and its been running for 7 days.

1-) Could you send me any information about the temporary files it creates?
2-) I know its hard, but do you have any estimates in how long its going to take to run?

I would just like to estimate how long its going to take: I have it running in a shared cluster and I gave to it a total of 25 days to run: I don't want it to get to the end and don't finish in time!

Right now its running this "runPartition.sh 150" and creating these .tmp.m5, tmp.cns.fasta, *.tmp.aln.fasta files!

Thank you so much for your help!!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marcela Uliano da Silva - 2015-12-10

Hey Brian, my job finished! Thanks for your help!

So, as an estimation: 1 million PacBio reads (6Kb as medium size) took 7 days to run PBcR self-correction in a 24 core and 72Gb RAM cluster.

Thank you!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Run PBcR meryl failure

Group

Searches

Help

#335 Run PBcR meryl failure

Related

Discussion

Thanks a lot for the help!

Related