I am running PBcR (rc2) on SGE. Following the overlapper stage of CA, I get a segfault:
$ head -n 20 3-overlapcorrection/.err
gkpStore = '/pub/mchakrab/A4/pbcr_assembly/mel/asm.gkpStore'
gkpStore = '/pub/mchakrab/A4/pbcr_assembly/mel/asm.gkpStore'
Starting Read_Frags ()
Starting Read_Frags ()
Read_Frags - at 0
Read_Frags - at 0
Starting Read_Olaps ()
Before sort 1370123 overlaps
Before Stream_Old_Frags Num_Olaps = 1370123
Starting Read_Olaps ()
Before sort 5400317 overlaps
Before Stream_Old_Frags Num_Olaps = 5400317
Extracted 99596 of 99596 fragments in iid range 1 .. 100000
Failed with 'Segmentation fault'
Any idea what's causing this? I ran rc2 couple of days ago with a subset of the same dataset, and it ran fine.
I don't know if this information will help, but I was able to replicate the same error with an independent dataset that was successfully assembled with rc1.
That sounds like https://sourceforge.net/p/wgs-assembler/bugs/301/ (also from you) but that was fixed a long time ago, and should be resolved in rc2.
Do you have the 'stack trace' from the error log?
Hi Brian,
I also thought that it was the same issue. After getting the segfault
several times (I repeated the run to see if the issue was reproducible), I
used the fix you had provided last time. However, this time it didn't work.
I am attaching the entire .err file from 3-overlapcorrection folder. Is
that what you wanted?
Thanks.
Mahul
Last edit: Brian Walenz 2015-09-04
Dang. That's the signature of exceeding the bounds of array. Those are hard to find without access to the running program.
Lets try to step around it. Try decreasing frgCorrBatchSize to 75000. Alternatively, try increasing it to 150000 (careful of memory usage though). Remove the 3-overlapcorrection directory first, otherwise runCA will possibly reuse the existing shell script without resetting the batch size.
Is this data you can share, and is the gkpStore + ovlStore small enough to share? Happy to try debugging it.
Hi Brian,
Ahh. I see. I can certainly share the data with you. Do you want the fastq
file and gkpStore+ovlStore ?
Thanks,
Mahul
Last edit: Brian Walenz 2015-09-07
Hi Brian,
setting frgCorrBatchSize=75000 or 150000 did not work. Here is the link to
the data -
http://hpc.oit.uci.edu/~mchakrab/for_Brian.tar.gz
the fastq file is the sequence file. Let me know if are unable to download
the file. Hopefully you'll be able to obtain more information about the
issue.
Best,
Mahul
Last edit: Brian Walenz 2015-09-07
Great! I set up an FTP site for you to upload the data, then got sidetracked and never sent you a link. Data retrieved!
I can't (yet) reproduce a crash. I'm also more than a little confused by the 'err' file you posted earlier. It is showing both a seg fault AND successful termination ("Finished" near the end of the file). There seem to be two jobs writing to the same log file.
Can you post the frgcorr.sh (I think that's what it's called) script that is running these?
It's interesting that the pipeline has gone past the 3- overlapcorrection
stage for you. I have attached the frgcorr script.
Last edit: Brian Walenz 2015-09-09
I can't make it crash. I tried both rc2 and the latest code in svn.
The log shows two runs, one doing 100,000 reads that crashes, and one doing ~25,000 reads that works. The larger run is using about 34gb of ram. I'm wondering if you're just running out of memory.
Options now:
1) Use a batch size of 25000, which uses about 10gb memory.
2) Disable this with doFragmentCorrection=0.
3) Recompile the assembler with debug symbols, rerun. The debug symbols should annotate the crash report with the line that the code fails on. Maybe this will give enough of a clue to find the problem.
Hi Brian,
Thanks for the pointers. Our nodes have 256-512g RAM and these jobs were
the only ones running on a node.
1) I am running a job with 25000 batch size to rule out the memory problem.
2) Will this not affect the quality of the assembly?
3) Does the makefile for CA already has -g option added? If not, where do
I add it (I mean it has to passed via CFLAGS, right?) ?
PS: Did your run go all the way to 9-terminator? If it did, would you mind
sharing the asm.ctg.fasta ?
Last edit: Brian Walenz 2015-09-09
quick update: setting batch size as 25000 seems to have fixed the issue.
will keep you posted on how it goes.
Last edit: Brian Walenz 2015-09-09
The pipeline went to completion :) So it seems memory usage was the issue.
Interesting.
Last edit: Brian Walenz 2015-09-09
Based on that, I'd say that something is imposing a memory limit on your jobs. This could be a recent change at your site, or it could be that the latest code is using more memory. Submitting a job with just "ulimit -a" will report the limits (or add this to the start of any of the assembler shell scripts).
I wasn't running this under runCA control; I was just running the correct-frags command directly.
To answer the questions:
2) Not clear how much assembly quality will be affected. I think not much when long reads and/or deep coverage are used.
3) gmake BUILDDEBUG=1. The kmer component doesn't need to be recompiled, just the assembler proper (in src/). Be sure to remove all of Linux-amd64 before building!