Hello,
I tried to run the rc2 version on SGE and the pipeline quit (before the overlaps jobs are submitted) with the following error -
ERROR: Overlap job assembly~//tempdtes_dip/1-overlapper/001/000001 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000002 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000003 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000004 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000005 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000006 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000007 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000008 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000009 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000010 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000011 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000012 FAILED.
12 sam conversion jobs failed.
Do you know what caused this?
Thank you.
Mahul
Hi,
Is this a new run or resuming an existing one? What is your command line and the full stdout/err output of the run?
Hi Sergey,
This was a new run. Here is the command line -
PBcR -l dtes_hq -s pacbio.spec -sensitive -noclean -fastq dtes_80_all.fq genomeSize=130000000 localStaging=/jje/tmp/
Here is the link to the entire sterr/stdout until the pipeline quits -
http://hpc.oit.uci.edu/~mchakrab/tes.err
FYI, the same command works fine with rc1.
Thanks,
Mahul
Hi Sergey,
This was a new run. Here is the command line -
PBcR -l dtes_hq -s pacbio.spec -sensitive -noclean -fastq dtes_80_all.fq
genomeSize=130000000 localStaging=/jje/tmp/
Here is the link to the entire sterr/stdout until the pipeline quits -
http://hpc.oit.uci.edu/~mchakrab/tes.err
FYI, the same command works fine with rc1.
Thanks,
Mahul
Last edit: Brian Walenz 2015-06-01
I verified the grid submission with local staging works on our local system so I am not sure what would have changed between rc2 and rc1. One thing I noticed is that you're specifying that overlapper should use 128GB of ram but I didn't see a request for memory in your qsub options. Is it possible the jobs are getting scheduled on machines with insufficient memory? Do you have the error logs from the 1-overlapper/*.err files? That should have more information on why the jobs failed.
There are no 1-overlapper/*.err files, presumably because the overlapper
did not run (see the last part of the error). Our nodes have 256GB and
512GB RAM so I didn't think I would need to request for 128GB RAM (also
because rc1 worked with the same config). However, I can try rerunning the
pipeline with explicit request of 128GB RAM for overlapper.
Last edit: Brian Walenz 2015-06-01
Based on the output the overlapper was submitted to the grid:
qsub -q jje,free64,pub64,abio,bio -pe openmp 32 -ckpt blcr -l kernel=blcr -cwd -N "pBcR_ovl_asm_dtes_hq" -t 1-12 -j y -o /dev/null /share/jje/mchakrab/pacbio/dtes/pacbio_assembly//tempdtes_hq/1-overlapper/overlap.sh
So the err files should get created as soon as an overlap job launches on the grid. They could be missing if there was a scheduling error or another problem running the overlapper. You can try re-running the submission command (as long as you still have your temporary directory) but modifying -o /dev/null to -o `pwd`/\$TASK_ID.out (assuming you have SGE) and checking the error output of those files.
Last edit: Sergey Koren 2015-06-01
ok. I submitted the job. There is a long que wait in our cluster right now.
I'll send you an update when I get the results back.
Last edit: Brian Walenz 2015-06-01
Things I'm sure are correct or are known:
1) Verify that 'qconf -sp openmp' has allocation_rule = $pe_slots.
2) Your memory setting for sge needs to be 128/32 = 4g.
Change "-j y -o /dev/null" to "-j y -o somefile.sgeout" so we can get the full log output, including any issues starting the command itself. You might be able to qalter this and keep the job in queue.
the issue is fixed. The path to java/1.8 was incorrect. the stdout/stderr revealed it.
By the way, I am seeing a new error in celera assembler (after the 5-consensus stage). This is repeatable. here is the relevant stdout/stderr -
/dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/cgw \ -j 1 -k 5 \ -r 5 \ -s 2 \ -filter 1 \ -minmergeweight 2 \ -S 0 \ -G \ -z \ -P 2 \ -B 2572 \ -shatter 0 \ -missingMate 0 \ -m 100 \ -g /jje/mchakrab/pacbio/dpse/pse80/asm.gkpStore \ -t /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore \ -o /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm \
runCA failed.
Stack trace:
at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA line 1651
main::caFailure('scaffolder failed', '/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out') called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA
line 5350
main::CGW('6-clonesize', undef, '/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore', 2, undef, 0) called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-am
d64/bin/runCA line 5561
main::scaffolder() called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA line 6566
Last few lines of the relevant log file (/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out):
ERROR: Frag 320689 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 320775 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 321180 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 321628 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323029 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323159 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323455 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323552 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 324752 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 324969 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 325737 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 326154 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 326484 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 326489 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 327514 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 328743 has null cid or CIid. Fragment is not in an input unitig!
ERROR: 868 fragments are not in unitigs.
cgw: Input_CGW.C:426: void ProcessInput(int, int, char**): Assertion `numErrors == 0' failed.
Do you know what might be causing this? I can send you the 6-clonesize/cgw.out file or any other information that you think would be helpful.
Thanks.
Mahul
This is hinting that at least one unitig failed to generate consensus sequence. Look for errors in 5-consensus. Or, remove the *_???.err files and resubmit the consensus.sh script to the grid. It will only recompute unitigs with no consensus sequence.
I've seen some rare cases with large contigs (i.e. > 35Mbp) where the PacBio unitig consensus can use over 32GB of ram. What I'd suggest is checking for files named tmp in 5-consensus directory. If you have them, it indicates at least one job failed, most likely out of memory. You can remove those tmp files as well as the .success files corresponding to those partitions (i.e. if there is an asm_001.tmp.fasta then remove asm001.success as well as consensus.success). Then, remove 5-consensus- and 6- and manually run the failed partitions (cd 5-consensus && consensus.sh <failed id="">). Then, you can run the latest runCA*.sh job to resume the assembly.
I'd also suggesting modifying your spec to request 64GB of ram total for the consensus step which should avoid this error in the future (that has been sufficient for up >80Mbp contigs in my experience).
Also, the java path can be explicitly set with the parameter javaPath= in your spec file. It will then use /java to run all the code so you avoid having to make sure the right java is in your path.
Last edit: Sergey Koren 2015-06-03
So, indeed a consensus job had failed (but it did not have any contig >1
Mbp). I ran the job manually as Sergey has suggested and it completed
successfully. then I restarted the runCA*.sh but the error came back. I
have attached the cgw.out file. Please let me know if you need any other
information.
Thanks.
Last edit: Brian Walenz 2015-06-03
You have to follow the steps I listed to remove the 5-consensus/consensus.success file as well as the 5-consensus-* folders and the 6-clonesize folder. Without that, it will not update the latest version of the unitig store, causing the same error.
Last edit: Brian Walenz 2015-06-03
Here is what I did
$ rm -r 5-consensus-
$ rm -r 6-clonesize/
$ rm 5-consensus/consensus.success
$ bash pse80/runCA..sh
The error comes back.
May be I am missing something. Should I have deleted something else too?
Last edit: Brian Walenz 2015-06-03
The rm -r 5-consensus- is missing a wildcard on the end, it should be rm -r 5-consensus-*.
Last edit: Brian Walenz 2015-06-03
oops! sorry, it was actually
$ rm -r 5-consensus-*
When I copied from the terminal, the wild card probably didn't get selected.
Last edit: Brian Walenz 2015-06-03
Please paste the contents of the tigStore directory (ls -l) and the log from the one utgcns job that found a unitig to compute.
Is recomputing all unitig consensus sequences a possibility?
$ ls -l asm.tigStore/ | head
total 523956
-rw-r--r-- 1 mchakrab jje 20 May 31 11:06 seqDB.v001.ctg
-rw-r--r-- 1 mchakrab jje 93332 May 31 11:06 seqDB.v001.p001.dat
-rw-r--r-- 1 mchakrab jje 3192 May 31 11:06 seqDB.v001.p001.utg
-rw-r--r-- 1 mchakrab jje 358772 May 31 11:06 seqDB.v001.p002.dat
-rw-r--r-- 1 mchakrab jje 72 May 31 11:06 seqDB.v001.p002.utg
-rw-r--r-- 1 mchakrab jje 68216 May 31 11:06 seqDB.v001.p003.dat
-rw-r--r-- 1 mchakrab jje 1164 May 31 11:06 seqDB.v001.p003.utg
-rw-r--r-- 1 mchakrab jje 183900 May 31 11:06 seqDB.v001.p004.dat
-rw-r--r-- 1 mchakrab jje 2048 May 31 11:06 seqDB.v001.p004.utg
I didn't paste the entire list because it is really long. I can attach the
list as a file if you want. Are you asking about the one unitig job that
failed earlier but completed successfully later? If so, here it is
Saving fixed unitigs to
'/jje/mchakrab/pacbio/dpse/dpse_160/pse80/5-consensus/asm_066.fixes'; store
NOT updated.
Loading reads into memory.
Checking unitig consensus for b=0 to e=8179
Consensus finished successfully. Bye.
I could recompute all unitig sequences. To do that, I will have to delete
4- unitigger, 5-consensus, asm.tigStore, and relaunch runCA, right?
Last edit: Brian Walenz 2015-06-03
I was going to say something similar to Brian. The asm.tigStore is versioned so each step reads a fixed version it expects. The first time that version gets opened it is imported from the previous version. If, for some reason, the restart didn’t clean up some files, a step can run and read an existing (outdated) version and not realize your consensus is now fixed. I think that is what’s happening with your run. I think the cleanup list I had missed one file, if you clean up as:
rm -rf 5-consensus-
rm -rf 6-
rm -rf 5-consensus/consensus.success
rm -rf 5-consensus/asm.fixes (the one I missed)
You can also remove all entries from asm.tigStore that are past v2 (so .v001. and .v002. should remain but greater numbers can be erased).
If that continues to give you trouble, you can remove 5-consensus* and asm.tigStore folders which would force a recompute of all consensus. Make sure you add a memory request flag to your consensus submission command to make sure the jobs don’t run out of memory (since your unitigs are small, a few GB, say 8-10) should do.
Last edit: Brian Walenz 2015-06-03
Wiki markup is eating the asterisks.
I'll secnd the removal of asm.fixes and v00[345].
To rerun consensus, just remove 5-consensus and all the tigStore/v002 files - NOT the entire tigStore directory. In all cases, remove 6-clonesize (and set computeInsertSize=0).
It's fixed now. This is what happened. Tried removing asm.fixes and
v00[345] and then relaunching runCA. A new error showed up. So deleted
5-consensus folder (and 5-consensus- and 6-clonesize) and tigStore/v002
files. Also set computeInsertSize=0 and then reran runCA. Still the same
error came back, this time with the 7-cgw folder. So I checked -
$ ls 5-consensus/tmp
asm.083.tmp.cns.in asm.083.tmp.layout
so I ran the consensus 83 job manually and then reran runCA (deleted
5-consensus-, 7-0-CGW, and tigStore/v00[356] files). It went to completion
:)
Thank you both for patiently walking me through the troubleshooting.
On Wed, Jun 3, 2015 at 2:06 PM Brian Walenz brianwalenz@users.sf.net
wrote:
Related
Bugs:
#309Yay!
I think the key was deleting v005 from tigStore. Not sure how it made it past consensus, but once v005 exists, runCA assumes consensus is finished. This means that your initial rerun(s) were completely ignored.
Hi,
I have the same problem here,
I was runing this command to correct pacbio reads with illumina reads
the error
no tmp in 5-consensus directory , So it may not be memory problem
Last edit: medhat 2015-08-20
Hi,
You can try re-computing consensus by following the steps above to remove the results:
rm -rf 5-consensus*
rm -rf 6-clonesize
rm -rf asm.tigStore/asm.v002 and greater (leave v001*)
Is it possible you have contigs < 500bp? The logs from PBcR should have a runCA command which specifies a frgMinLen and an ovlMinLength. If frgMinLen is < 500 bp, it is possible to have contigs shorter than this which would cause the PacBio consensus module (PBDAGCON) to not output a consensus. You can then re-run the runCA command but set consensus=cns to use CA's built in consensus. It will be much slower than PBDAGCON but should produce results for all your contigs.
Last edit: Sergey Koren 2015-08-22