Whole-Genome Shotgun Assembler / Bugs / #323 Fragment correction job 0001 failed.

Mahul Chakraborty - 2015-09-03

I don't know if this information will help, but I was able to replicate the same error with an independent dataset that was successfully assembled with rc1.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-09-04

That sounds like https://sourceforge.net/p/wgs-assembler/bugs/301/ (also from you) but that was fixed a long time ago, and should be resolved in rc2.

Do you have the 'stack trace' from the error log?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-09-04
  
  Hi Brian,
  I also thought that it was the same issue. After getting the segfault
  several times (I repeated the run to see if the issue was reproducible), I
  used the fix you had provided last time. However, this time it didn't work.
  I am attaching the entire .err file from 3-overlapcorrection folder. Is
  that what you wanted?
  Thanks.
  Mahul
  
  Last edit: Brian Walenz 2015-09-04
  
  .err
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-09-04

Dang. That's the signature of exceeding the bounds of array. Those are hard to find without access to the running program.

Lets try to step around it. Try decreasing frgCorrBatchSize to 75000. Alternatively, try increasing it to 150000 (careful of memory usage though). Remove the 3-overlapcorrection directory first, otherwise runCA will possibly reuse the existing shell script without resetting the batch size.

Is this data you can share, and is the gkpStore + ovlStore small enough to share? Happy to try debugging it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-09-05
  
  Hi Brian,
  
  Ahh. I see. I can certainly share the data with you. Do you want the fastq
  file and gkpStore+ovlStore ?
  Thanks,
  Mahul
  
  Last edit: Brian Walenz 2015-09-07
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mahul Chakraborty - 2015-09-07
    
    Hi Brian,
    
    setting frgCorrBatchSize=75000 or 150000 did not work. Here is the link to
    the data -
    
    http://hpc.oit.uci.edu/~mchakrab/for_Brian.tar.gz
    
    the fastq file is the sequence file. Let me know if are unable to download
    the file. Hopefully you'll be able to obtain more information about the
    issue.
    Best,
    Mahul
    
    Last edit: Brian Walenz 2015-09-07
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-09-07

Great! I set up an FTP site for you to upload the data, then got sidetracked and never sent you a link. Data retrieved!

I can't (yet) reproduce a crash. I'm also more than a little confused by the 'err' file you posted earlier. It is showing both a seg fault AND successful termination ("Finished" near the end of the file). There seem to be two jobs writing to the same log file.

Can you post the frgcorr.sh (I think that's what it's called) script that is running these?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-09-07
  
  It's interesting that the pipeline has gone past the 3- overlapcorrection
  stage for you. I have attached the frgcorr script.
  
  Last edit: Brian Walenz 2015-09-09
  
  alternate
  
  frgcorr.sh
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-09-09

I can't make it crash. I tried both rc2 and the latest code in svn.

The log shows two runs, one doing 100,000 reads that crashes, and one doing ~25,000 reads that works. The larger run is using about 34gb of ram. I'm wondering if you're just running out of memory.

Options now:

1) Use a batch size of 25000, which uses about 10gb memory.

2) Disable this with doFragmentCorrection=0.

3) Recompile the assembler with debug symbols, rerun. The debug symbols should annotate the crash report with the line that the code fails on. Maybe this will give enough of a clue to find the problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-09-09
  
  Hi Brian,
  Thanks for the pointers. Our nodes have 256-512g RAM and these jobs were
  the only ones running on a node.
  1) I am running a job with 25000 batch size to rule out the memory problem.
  2) Will this not affect the quality of the assembly?
  3) Does the makefile for CA already has -g option added? If not, where do
  I add it (I mean it has to passed via CFLAGS, right?) ?
  
  PS: Did your run go all the way to 9-terminator? If it did, would you mind
  sharing the asm.ctg.fasta ?
  
  Last edit: Brian Walenz 2015-09-09
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mahul Chakraborty - 2015-09-09
    
    quick update: setting batch size as 25000 seems to have fixed the issue.
    will keep you posted on how it goes.
    
    Last edit: Brian Walenz 2015-09-09
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mahul Chakraborty - 2015-09-09
      
      The pipeline went to completion :) So it seems memory usage was the issue.
      Interesting.
      
      Last edit: Brian Walenz 2015-09-09
      
      alternate
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-09-09

Based on that, I'd say that something is imposing a memory limit on your jobs. This could be a recent change at your site, or it could be that the latest code is using more memory. Submitting a job with just "ulimit -a" will report the limits (or add this to the start of any of the assembler shell scripts).

I wasn't running this under runCA control; I was just running the correct-frags command directly.

To answer the questions:

2) Not clear how much assembly quality will be affected. I think not much when long reads and/or deep coverage are used.

3) gmake BUILDDEBUG=1. The kmer component doesn't need to be recompiled, just the assembler proper (in src/). Be sure to remove all of Linux-amd64 before building!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fragment correction job 0001 failed.

Group

Searches

Help

#323 Fragment correction job 0001 failed.

Starting at Tue Sep 1 15:19:04 2015

Starting at Tue Sep 1 15:19:04 2015

Using 16 pthreads (new version)

Using 16 pthreads (new version)

Discussion