Whole-Genome Shotgun Assembler / Bugs / #309 sam conversion jobs failed

Sergey Koren - 2015-05-29

Hi,

Is this a new run or resuming an existing one? What is your command line and the full stdout/err output of the run?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-01
  
  Hi Sergey,
  
  This was a new run. Here is the command line -
  
  PBcR -l dtes_hq -s pacbio.spec -sensitive -noclean -fastq dtes_80_all.fq genomeSize=130000000 localStaging=/jje/tmp/
  
  Here is the link to the entire sterr/stdout until the pipeline quits -
  http://hpc.oit.uci.edu/~mchakrab/tes.err
  
  FYI, the same command works fine with rc1.
  Thanks,
  Mahul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-01
  
  Hi Sergey,
  
  This was a new run. Here is the command line -
  
  PBcR -l dtes_hq -s pacbio.spec -sensitive -noclean -fastq dtes_80_all.fq
  genomeSize=130000000 localStaging=/jje/tmp/
  
  Here is the link to the entire sterr/stdout until the pipeline quits -
  http://hpc.oit.uci.edu/~mchakrab/tes.err
  
  FYI, the same command works fine with rc1.
  Thanks,
  Mahul
  
  Last edit: Brian Walenz 2015-06-01
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-06-01

I verified the grid submission with local staging works on our local system so I am not sure what would have changed between rc2 and rc1. One thing I noticed is that you're specifying that overlapper should use 128GB of ram but I didn't see a request for memory in your qsub options. Is it possible the jobs are getting scheduled on machines with insufficient memory? Do you have the error logs from the 1-overlapper/*.err files? That should have more information on why the jobs failed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-01
  
  There are no 1-overlapper/*.err files, presumably because the overlapper
  did not run (see the last part of the error). Our nodes have 256GB and
  512GB RAM so I didn't think I would need to request for 128GB RAM (also
  because rc1 worked with the same config). However, I can try rerunning the
  pipeline with explicit request of 128GB RAM for overlapper.
  
  Last edit: Brian Walenz 2015-06-01
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-06-01

Based on the output the overlapper was submitted to the grid:
qsub -q jje,free64,pub64,abio,bio -pe openmp 32 -ckpt blcr -l kernel=blcr -cwd -N "pBcR_ovl_asm_dtes_hq" -t 1-12 -j y -o /dev/null /share/jje/mchakrab/pacbio/dtes/pacbio_assembly//tempdtes_hq/1-overlapper/overlap.sh

So the err files should get created as soon as an overlap job launches on the grid. They could be missing if there was a scheduling error or another problem running the overlapper. You can try re-running the submission command (as long as you still have your temporary directory) but modifying -o /dev/null to -o `pwd`/\$TASK_ID.out (assuming you have SGE) and checking the error output of those files.

Last edit: Sergey Koren 2015-06-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-01
  
  ok. I submitted the job. There is a long que wait in our cluster right now.
  I'll send you an update when I get the results back.
  
  Last edit: Brian Walenz 2015-06-01
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-06-01

Things I'm sure are correct or are known:

1) Verify that 'qconf -sp openmp' has allocation_rule = $pe_slots.
2) Your memory setting for sge needs to be 128/32 = 4g.

Change "-j y -o /dev/null" to "-j y -o somefile.sgeout" so we can get the full log output, including any issues starting the command itself. You might be able to qalter this and keep the job in queue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mahul Chakraborty - 2015-06-03

the issue is fixed. The path to java/1.8 was incorrect. the stdout/stderr revealed it.

By the way, I am seeing a new error in celera assembler (after the 5-consensus stage). This is repeatable. here is the relevant stdout/stderr -

/dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/cgw \ -j 1 -k 5 \ -r 5 \ -s 2 \ -filter 1 \ -minmergeweight 2 \ -S 0 \ -G \ -z \ -P 2 \ -B 2572 \ -shatter 0 \ -missingMate 0 \ -m 100 \ -g /jje/mchakrab/pacbio/dpse/pse80/asm.gkpStore \ -t /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore \ -o /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm \

/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out 2>&1
sh: line 17: 38470 Aborted (core dumped) /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/cgw -j 1 -k 5 -r 5 -s 2 -filter 1 -minmergeweight 2 -S 0 -
G -z -P 2 -B 2572 -shatter 0 -missingMate 0 -m 100 -g /jje/mchakrab/pacbio/dpse/pse80/asm.gkpStore -t /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore -o /jje/m
chakrab/pacbio/dpse/pse80/6-clonesize/asm > /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out 2>&1
----------------------------------------END Sun May 31 02:29:28 2015 (8 seconds)
ERROR: Failed with signal ABRT (6)
================================================================================

runCA failed.

Stack trace:

at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA line 1651
main::caFailure('scaffolder failed', '/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out') called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA
line 5350
main::CGW('6-clonesize', undef, '/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore', 2, undef, 0) called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-am
d64/bin/runCA line 5561
main::scaffolder() called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA line 6566

Last few lines of the relevant log file (/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out):

ERROR: Frag 320689 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 320775 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 321180 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 321628 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323029 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323159 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323455 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 323552 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 324752 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 324969 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 325737 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 326154 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 326484 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 326489 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 327514 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 328743 has null cid or CIid. Fragment is not in an input unitig!
ERROR: 868 fragments are not in unitigs.
cgw: Input_CGW.C:426: void ProcessInput(int, int, char**): Assertion `numErrors == 0' failed.

Do you know what might be causing this? I can send you the 6-clonesize/cgw.out file or any other information that you think would be helpful.
Thanks.
Mahul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-06-03

This is hinting that at least one unitig failed to generate consensus sequence. Look for errors in 5-consensus. Or, remove the *_???.err files and resubmit the consensus.sh script to the grid. It will only recompute unitigs with no consensus sequence.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-06-03

I've seen some rare cases with large contigs (i.e. > 35Mbp) where the PacBio unitig consensus can use over 32GB of ram. What I'd suggest is checking for files named tmp in 5-consensus directory. If you have them, it indicates at least one job failed, most likely out of memory. You can remove those tmp files as well as the .success files corresponding to those partitions (i.e. if there is an asm_001.tmp.fasta then remove asm001.success as well as consensus.success). Then, remove 5-consensus- and 6- and manually run the failed partitions (cd 5-consensus && consensus.sh <failed id="">). Then, you can run the latest runCA*.sh job to resume the assembly.

I'd also suggesting modifying your spec to request 64GB of ram total for the consensus step which should avoid this error in the future (that has been sufficient for up >80Mbp contigs in my experience).

Also, the java path can be explicitly set with the parameter javaPath= in your spec file. It will then use /java to run all the code so you avoid having to make sure the right java is in your path.

Last edit: Sergey Koren 2015-06-03

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-03
  
  So, indeed a consensus job had failed (but it did not have any contig >1
  Mbp). I ran the job manually as Sergey has suggested and it completed
  successfully. then I restarted the runCA*.sh but the error came back. I
  have attached the cgw.out file. Please let me know if you need any other
  information.
  Thanks.
  
  Last edit: Brian Walenz 2015-06-03
  
  alternate
  
  cgw.out
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey Koren - 2015-06-03
    
    You have to follow the steps I listed to remove the 5-consensus/consensus.success file as well as the 5-consensus-* folders and the 6-clonesize folder. Without that, it will not update the latest version of the unitig store, causing the same error.
    
    Last edit: Brian Walenz 2015-06-03
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mahul Chakraborty - 2015-06-03
      
      Here is what I did
      $ rm -r 5-consensus-
      $ rm -r 6-clonesize/
      $ rm 5-consensus/consensus.success
      $ bash pse80/runCA..sh
      
      The error comes back.
      May be I am missing something. Should I have deleted something else too?
      
      Last edit: Brian Walenz 2015-06-03
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Sergey Koren - 2015-06-03
        
        The rm -r 5-consensus- is missing a wildcard on the end, it should be rm -r 5-consensus-*.
        
        Last edit: Brian Walenz 2015-06-03
        
        alternate
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mahul Chakraborty - 2015-06-03
        
        oops! sorry, it was actually
        $ rm -r 5-consensus-*
        
        When I copied from the terminal, the wild card probably didn't get selected.
        
        Last edit: Brian Walenz 2015-06-03
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-06-03

Please paste the contents of the tigStore directory (ls -l) and the log from the one utgcns job that found a unitig to compute.

Is recomputing all unitig consensus sequences a possibility?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-03
  
  $ ls -l asm.tigStore/ | head
  total 523956
  -rw-r--r-- 1 mchakrab jje 20 May 31 11:06 seqDB.v001.ctg
  -rw-r--r-- 1 mchakrab jje 93332 May 31 11:06 seqDB.v001.p001.dat
  -rw-r--r-- 1 mchakrab jje 3192 May 31 11:06 seqDB.v001.p001.utg
  -rw-r--r-- 1 mchakrab jje 358772 May 31 11:06 seqDB.v001.p002.dat
  -rw-r--r-- 1 mchakrab jje 72 May 31 11:06 seqDB.v001.p002.utg
  -rw-r--r-- 1 mchakrab jje 68216 May 31 11:06 seqDB.v001.p003.dat
  -rw-r--r-- 1 mchakrab jje 1164 May 31 11:06 seqDB.v001.p003.utg
  -rw-r--r-- 1 mchakrab jje 183900 May 31 11:06 seqDB.v001.p004.dat
  -rw-r--r-- 1 mchakrab jje 2048 May 31 11:06 seqDB.v001.p004.utg
  
  I didn't paste the entire list because it is really long. I can attach the
  list as a file if you want. Are you asking about the one unitig job that
  failed earlier but completed successfully later? If so, here it is
  
  Saving fixed unitigs to
  '/jje/mchakrab/pacbio/dpse/dpse_160/pse80/5-consensus/asm_066.fixes'; store
  NOT updated.
  Loading reads into memory.
  Checking unitig consensus for b=0 to e=8179
  
  Consensus finished successfully. Bye.
  
  I could recompute all unitig sequences. To do that, I will have to delete
  4- unitigger, 5-consensus, asm.tigStore, and relaunch runCA, right?
  
  Last edit: Brian Walenz 2015-06-03
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey Koren - 2015-06-03
    
    I was going to say something similar to Brian. The asm.tigStore is versioned so each step reads a fixed version it expects. The first time that version gets opened it is imported from the previous version. If, for some reason, the restart didn’t clean up some files, a step can run and read an existing (outdated) version and not realize your consensus is now fixed. I think that is what’s happening with your run. I think the cleanup list I had missed one file, if you clean up as:
    rm -rf 5-consensus-
    rm -rf 6-
    rm -rf 5-consensus/consensus.success
    rm -rf 5-consensus/asm.fixes (the one I missed)
    You can also remove all entries from asm.tigStore that are past v2 (so .v001. and .v002. should remain but greater numbers can be erased).
    
    If that continues to give you trouble, you can remove 5-consensus* and asm.tigStore folders which would force a recompute of all consensus. Make sure you add a memory request flag to your consensus submission command to make sure the jobs don’t run out of memory (since your unitigs are small, a few GB, say 8-10) should do.
    
    Last edit: Brian Walenz 2015-06-03
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-06-03

Wiki markup is eating the asterisks.

I'll secnd the removal of asm.fixes and v00[345].

To rerun consensus, just remove 5-consensus and all the tigStore/v002 files - NOT the entire tigStore directory. In all cases, remove 6-clonesize (and set computeInsertSize=0).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mahul Chakraborty - 2015-06-04
  
  It's fixed now. This is what happened. Tried removing asm.fixes and
  v00[345] and then relaunching runCA. A new error showed up. So deleted
  5-consensus folder (and 5-consensus- and 6-clonesize) and tigStore/v002
  files. Also set computeInsertSize=0 and then reran runCA. Still the same
  error came back, this time with the 7-cgw folder. So I checked -
  $ ls 5-consensus/tmp
  asm.083.tmp.cns.in asm.083.tmp.layout
  so I ran the consensus 83 job manually and then reran runCA (deleted
  5-consensus-, 7-0-CGW, and tigStore/v00[356] files). It went to completion
  :)
  Thank you both for patiently walking me through the troubleshooting.
  
  On Wed, Jun 3, 2015 at 2:06 PM Brian Walenz brianwalenz@users.sf.net
  wrote:
  
  Wiki markup is eating the asterisks.
  
  I'll secnd the removal of asm.fixes and v00[345].
  
  To rerun consensus, just remove 5-consensus and all the tigStore/v002
  files - NOT the entire tigStore directory. In all cases, remove 6-clonesize
  (and set computeInsertSize=0).
  
  [bugs:#309] http://sourceforge.net/p/wgs-assembler/bugs/309 sam
  conversion jobs failed*
  
  Status: open
  Group: correction
  Created: Fri May 29, 2015 01:25 AM UTC by Mahul Chakraborty
  
  Last Updated: Wed Jun 03, 2015 08:38 PM UTC
  Owner: nobody
  
  Hello,
  I tried to run the rc2 version on SGE and the pipeline quit (before the
  overlaps jobs are submitted) with the following error -
  
  ERROR: Overlap job assembly~//tempdtes_dip/1-overlapper/001/000001 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000002 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000003 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000004 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000005 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000006 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000007 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000008 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000009 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000010 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000011 FAILED.
  ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000012 FAILED.
  
  12 sam conversion jobs failed.
  
  Do you know what caused this?
  
  Thank you.
  Mahul
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/wgs-assembler/bugs/309/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: ~~#309~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-06-05

Yay!

I think the key was deleting v005 from tigStore. Not sure how it made it past consensus, but once v005 exists, runCA assumes consensus is finished. This means that your initial rerun(s) were completely ignored.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Walenz - 2015-06-05

status: open --> closed

assigned_to: Sergey Koren
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

medhat - 2015-08-20

Hi,

I have the same problem here,
I was runing this command to correct pacbio reads with illumina reads

~/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/pacBioToCA -length 500 -partitions 200 -l rice_pacbio -t 10 -s /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/pacbio.spec -fastq /data/test_data_from_server_maize/pbcr/rice/rice_pacbio.fa_0001.fastq

the error

/home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/cgw \ -j 1 -k 5 \ -r 5 \ -s 2 \ -filter 1 \ -minmergeweight 2 \ -S 0 \ -G \ -z \ -B 2349 \ -shatter 0 \ -missingMate 0 \ -m 100 \ -g /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/asm.gkpStore \ -t /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/asm.tigStore \ -o /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/asm \

/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/cgw.out 2>&1
----------------------------------------END Thu Aug 20 12:00:32 2015 (2 seconds)
ERROR: Failed with signal ABRT (6)

runCA failed.

Stack trace:

at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 1628.
main::caFailure('scaffolder failed', '/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clo...') called at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 5044
main::CGW('6-clonesize', undef, '/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clo...', 2, undef, 0) called at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 5255
main::scaffolder() called at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 6260

Last few lines of the relevant log file (/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/cgw.out):

ERROR: Frag 299012 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299044 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299095 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299097 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299207 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299275 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299295 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299376 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299574 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 299800 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 300226 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 300335 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 300393 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 300450 has null cid or CIid. Fragment is not in an input unitig!
ERROR: Frag 300613 has null cid or CIid. Fragment is not in an input unitig!
ERROR: 4937 fragments are not in unitigs.
cgw: Input_CGW.C:426: void ProcessInput(int, int, char**): Assertion `numErrors == 0' failed.

Failed with 'Aborted'

no tmp in 5-consensus directory , So it may not be memory problem

Last edit: medhat 2015-08-20

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergey Koren - 2015-08-22

Hi,

You can try re-computing consensus by following the steps above to remove the results:
rm -rf 5-consensus*
rm -rf 6-clonesize
rm -rf asm.tigStore/asm.v002 and greater (leave v001*)

Is it possible you have contigs < 500bp? The logs from PBcR should have a runCA command which specifies a frgMinLen and an ovlMinLength. If frgMinLen is < 500 bp, it is possible to have contigs shorter than this which would cause the PacBio consensus module (PBDAGCON) to not output a consensus. You can then re-run the runCA command but set consensus=cns to use CA's built in consensus. It will be much slower than PBDAGCON but should produce results for all your contigs.

Last edit: Sergey Koren 2015-08-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sam conversion jobs failed

Group

Searches

Help

#309 sam conversion jobs failed

Related

Discussion

runCA failed.

Related