Menu

#323 Fragment correction job 0001 failed.

overlapper
open
nobody
None
5
2015-09-09
2015-09-01
No

I am running PBcR (rc2) on SGE. Following the overlapper stage of CA, I get a segfault:

$ head -n 20 3-overlapcorrection/.err

gkpStore = '/pub/mchakrab/A4/pbcr_assembly/mel/asm.gkpStore'

Starting at Tue Sep 1 15:19:04 2015

gkpStore = '/pub/mchakrab/A4/pbcr_assembly/mel/asm.gkpStore'

Starting at Tue Sep 1 15:19:04 2015

Starting Read_Frags ()
Starting Read_Frags ()
Read_Frags - at 0
Read_Frags - at 0
Starting Read_Olaps ()
Before sort 1370123 overlaps
Before Stream_Old_Frags Num_Olaps = 1370123

Using 16 pthreads (new version)

Starting Read_Olaps ()
Before sort 5400317 overlaps
Before Stream_Old_Frags Num_Olaps = 5400317

Using 16 pthreads (new version)

Extracted 99596 of 99596 fragments in iid range 1 .. 100000

Failed with 'Segmentation fault'

Any idea what's causing this? I ran rc2 couple of days ago with a subset of the same dataset, and it ran fine.

Discussion

  • Mahul  Chakraborty

    I don't know if this information will help, but I was able to replicate the same error with an independent dataset that was successfully assembled with rc1.

     
  • Brian Walenz

    Brian Walenz - 2015-09-04

    That sounds like https://sourceforge.net/p/wgs-assembler/bugs/301/ (also from you) but that was fixed a long time ago, and should be resolved in rc2.

    Do you have the 'stack trace' from the error log?

     
    • Mahul  Chakraborty

      Hi Brian,
      I also thought that it was the same issue. After getting the segfault
      several times (I repeated the run to see if the issue was reproducible), I
      used the fix you had provided last time. However, this time it didn't work.
      I am attaching the entire .err file from 3-overlapcorrection folder. Is
      that what you wanted?
      Thanks.
      Mahul

       

      Last edit: Brian Walenz 2015-09-04
  • Brian Walenz

    Brian Walenz - 2015-09-04

    Dang. That's the signature of exceeding the bounds of array. Those are hard to find without access to the running program.

    Lets try to step around it. Try decreasing frgCorrBatchSize to 75000. Alternatively, try increasing it to 150000 (careful of memory usage though). Remove the 3-overlapcorrection directory first, otherwise runCA will possibly reuse the existing shell script without resetting the batch size.

    Is this data you can share, and is the gkpStore + ovlStore small enough to share? Happy to try debugging it.

     
    • Mahul  Chakraborty

      Hi Brian,

      Ahh. I see. I can certainly share the data with you. Do you want the fastq
      file and gkpStore+ovlStore ?
      Thanks,
      Mahul

       

      Last edit: Brian Walenz 2015-09-07
      • Mahul  Chakraborty

        Hi Brian,

        setting frgCorrBatchSize=75000 or 150000 did not work. Here is the link to
        the data -

        http://hpc.oit.uci.edu/~mchakrab/for_Brian.tar.gz

        the fastq file is the sequence file. Let me know if are unable to download
        the file. Hopefully you'll be able to obtain more information about the
        issue.
        Best,
        Mahul

         

        Last edit: Brian Walenz 2015-09-07
  • Brian Walenz

    Brian Walenz - 2015-09-07

    Great! I set up an FTP site for you to upload the data, then got sidetracked and never sent you a link. Data retrieved!

    I can't (yet) reproduce a crash. I'm also more than a little confused by the 'err' file you posted earlier. It is showing both a seg fault AND successful termination ("Finished" near the end of the file). There seem to be two jobs writing to the same log file.

    Can you post the frgcorr.sh (I think that's what it's called) script that is running these?

     
    • Mahul  Chakraborty

      It's interesting that the pipeline has gone past the 3- overlapcorrection
      stage for you. I have attached the frgcorr script.

       

      Last edit: Brian Walenz 2015-09-09
  • Brian Walenz

    Brian Walenz - 2015-09-09

    I can't make it crash. I tried both rc2 and the latest code in svn.

    The log shows two runs, one doing 100,000 reads that crashes, and one doing ~25,000 reads that works. The larger run is using about 34gb of ram. I'm wondering if you're just running out of memory.

    Options now:

    1) Use a batch size of 25000, which uses about 10gb memory.

    2) Disable this with doFragmentCorrection=0.

    3) Recompile the assembler with debug symbols, rerun. The debug symbols should annotate the crash report with the line that the code fails on. Maybe this will give enough of a clue to find the problem.

     
    • Mahul  Chakraborty

      Hi Brian,
      Thanks for the pointers. Our nodes have 256-512g RAM and these jobs were
      the only ones running on a node.
      1) I am running a job with 25000 batch size to rule out the memory problem.
      2) Will this not affect the quality of the assembly?
      3) Does the makefile for CA already has -g option added? If not, where do
      I add it (I mean it has to passed via CFLAGS, right?) ?

      PS: Did your run go all the way to 9-terminator? If it did, would you mind
      sharing the asm.ctg.fasta ?

       

      Last edit: Brian Walenz 2015-09-09
      • Mahul  Chakraborty

        quick update: setting batch size as 25000 seems to have fixed the issue.
        will keep you posted on how it goes.

         

        Last edit: Brian Walenz 2015-09-09
        • Mahul  Chakraborty

          The pipeline went to completion :) So it seems memory usage was the issue.
          Interesting.

           

          Last edit: Brian Walenz 2015-09-09
  • Brian Walenz

    Brian Walenz - 2015-09-09

    Based on that, I'd say that something is imposing a memory limit on your jobs. This could be a recent change at your site, or it could be that the latest code is using more memory. Submitting a job with just "ulimit -a" will report the limits (or add this to the start of any of the assembler shell scripts).

    I wasn't running this under runCA control; I was just running the correct-frags command directly.

    To answer the questions:

    2) Not clear how much assembly quality will be affected. I think not much when long reads and/or deep coverage are used.

    3) gmake BUILDDEBUG=1. The kmer component doesn't need to be recompiled, just the assembler proper (in src/). Be sure to remove all of Linux-amd64 before building!

     

Log in to post a comment.