Menu

#309 sam conversion jobs failed

correction
closed
None
5
2015-10-05
2015-05-29
No

Hello,
I tried to run the rc2 version on SGE and the pipeline quit (before the overlaps jobs are submitted) with the following error -

ERROR: Overlap job assembly~//tempdtes_dip/1-overlapper/001/000001 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000002 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000003 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000004 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000005 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000006 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000007 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000008 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000009 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000010 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000011 FAILED.
ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000012 FAILED.

12 sam conversion jobs failed.

Do you know what caused this?

Thank you.
Mahul

Related

Bugs: #309

Discussion

1 2 > >> (Page 1 of 2)
  • Sergey Koren

    Sergey Koren - 2015-05-29

    Hi,

    Is this a new run or resuming an existing one? What is your command line and the full stdout/err output of the run?

     
    • Mahul  Chakraborty

      Hi Sergey,

      This was a new run. Here is the command line -

      PBcR -l dtes_hq -s pacbio.spec -sensitive -noclean -fastq dtes_80_all.fq genomeSize=130000000 localStaging=/jje/tmp/

      Here is the link to the entire sterr/stdout until the pipeline quits -
      http://hpc.oit.uci.edu/~mchakrab/tes.err

      FYI, the same command works fine with rc1.
      Thanks,
      Mahul

       
    • Mahul  Chakraborty

      Hi Sergey,

      This was a new run. Here is the command line -

      PBcR -l dtes_hq -s pacbio.spec -sensitive -noclean -fastq dtes_80_all.fq
      genomeSize=130000000 localStaging=/jje/tmp/

      Here is the link to the entire sterr/stdout until the pipeline quits -
      http://hpc.oit.uci.edu/~mchakrab/tes.err

      FYI, the same command works fine with rc1.
      Thanks,
      Mahul

       

      Last edit: Brian Walenz 2015-06-01
  • Sergey Koren

    Sergey Koren - 2015-06-01

    I verified the grid submission with local staging works on our local system so I am not sure what would have changed between rc2 and rc1. One thing I noticed is that you're specifying that overlapper should use 128GB of ram but I didn't see a request for memory in your qsub options. Is it possible the jobs are getting scheduled on machines with insufficient memory? Do you have the error logs from the 1-overlapper/*.err files? That should have more information on why the jobs failed.

     
    • Mahul  Chakraborty

      There are no 1-overlapper/*.err files, presumably because the overlapper
      did not run (see the last part of the error). Our nodes have 256GB and
      512GB RAM so I didn't think I would need to request for 128GB RAM (also
      because rc1 worked with the same config). However, I can try rerunning the
      pipeline with explicit request of 128GB RAM for overlapper.

       

      Last edit: Brian Walenz 2015-06-01
  • Sergey Koren

    Sergey Koren - 2015-06-01

    Based on the output the overlapper was submitted to the grid:
    qsub -q jje,free64,pub64,abio,bio -pe openmp 32 -ckpt blcr -l kernel=blcr -cwd -N "pBcR_ovl_asm_dtes_hq" -t 1-12 -j y -o /dev/null /share/jje/mchakrab/pacbio/dtes/pacbio_assembly//tempdtes_hq/1-overlapper/overlap.sh

    So the err files should get created as soon as an overlap job launches on the grid. They could be missing if there was a scheduling error or another problem running the overlapper. You can try re-running the submission command (as long as you still have your temporary directory) but modifying -o /dev/null to -o `pwd`/\$TASK_ID.out (assuming you have SGE) and checking the error output of those files.

     

    Last edit: Sergey Koren 2015-06-01
    • Mahul  Chakraborty

      ok. I submitted the job. There is a long que wait in our cluster right now.
      I'll send you an update when I get the results back.

       

      Last edit: Brian Walenz 2015-06-01
  • Brian Walenz

    Brian Walenz - 2015-06-01

    Things I'm sure are correct or are known:

    1) Verify that 'qconf -sp openmp' has allocation_rule = $pe_slots.
    2) Your memory setting for sge needs to be 128/32 = 4g.

    Change "-j y -o /dev/null" to "-j y -o somefile.sgeout" so we can get the full log output, including any issues starting the command itself. You might be able to qalter this and keep the job in queue.

     
  • Mahul  Chakraborty

    the issue is fixed. The path to java/1.8 was incorrect. the stdout/stderr revealed it.

    By the way, I am seeing a new error in celera assembler (after the 5-consensus stage). This is repeatable. here is the relevant stdout/stderr -

    /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/cgw \ -j 1 -k 5 \ -r 5 \ -s 2 \ -filter 1 \ -minmergeweight 2 \ -S 0 \ -G \ -z \ -P 2 \ -B 2572 \ -shatter 0 \ -missingMate 0 \ -m 100 \ -g /jje/mchakrab/pacbio/dpse/pse80/asm.gkpStore \ -t /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore \ -o /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm \

    /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out 2>&1
    sh: line 17: 38470 Aborted (core dumped) /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/cgw -j 1 -k 5 -r 5 -s 2 -filter 1 -minmergeweight 2 -S 0 -
    G -z -P 2 -B 2572 -shatter 0 -missingMate 0 -m 100 -g /jje/mchakrab/pacbio/dpse/pse80/asm.gkpStore -t /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore -o /jje/m
    chakrab/pacbio/dpse/pse80/6-clonesize/asm > /jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out 2>&1
    ----------------------------------------END Sun May 31 02:29:28 2015 (8 seconds)
    ERROR: Failed with signal ABRT (6)
    ================================================================================

    runCA failed.

    Stack trace:

    at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA line 1651
    main::caFailure('scaffolder failed', '/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out') called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA
    line 5350
    main::CGW('6-clonesize', undef, '/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/asm.tigStore', 2, undef, 0) called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-am
    d64/bin/runCA line 5561
    main::scaffolder() called at /dfs1/bio/mchakrab/pacbio/wgs-8.3rc1/Linux-amd64/bin/runCA line 6566


    Last few lines of the relevant log file (/jje/mchakrab/pacbio/dpse/pse80/6-clonesize/cgw.out):

    ERROR: Frag 320689 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 320775 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 321180 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 321628 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 323029 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 323159 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 323455 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 323552 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 324752 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 324969 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 325737 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 326154 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 326484 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 326489 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 327514 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 328743 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: 868 fragments are not in unitigs.
    cgw: Input_CGW.C:426: void ProcessInput(int, int, char**): Assertion `numErrors == 0' failed.

    Do you know what might be causing this? I can send you the 6-clonesize/cgw.out file or any other information that you think would be helpful.
    Thanks.
    Mahul

     
  • Brian Walenz

    Brian Walenz - 2015-06-03

    This is hinting that at least one unitig failed to generate consensus sequence. Look for errors in 5-consensus. Or, remove the *_???.err files and resubmit the consensus.sh script to the grid. It will only recompute unitigs with no consensus sequence.

     
  • Sergey Koren

    Sergey Koren - 2015-06-03

    I've seen some rare cases with large contigs (i.e. > 35Mbp) where the PacBio unitig consensus can use over 32GB of ram. What I'd suggest is checking for files named tmp in 5-consensus directory. If you have them, it indicates at least one job failed, most likely out of memory. You can remove those tmp files as well as the .success files corresponding to those partitions (i.e. if there is an asm_001.tmp.fasta then remove asm001.success as well as consensus.success). Then, remove 5-consensus- and 6- and manually run the failed partitions (cd 5-consensus && consensus.sh <failed id="">). Then, you can run the latest runCA*.sh job to resume the assembly.

    I'd also suggesting modifying your spec to request 64GB of ram total for the consensus step which should avoid this error in the future (that has been sufficient for up >80Mbp contigs in my experience).

    Also, the java path can be explicitly set with the parameter javaPath= in your spec file. It will then use /java to run all the code so you avoid having to make sure the right java is in your path.

     

    Last edit: Sergey Koren 2015-06-03
    • Mahul  Chakraborty

      So, indeed a consensus job had failed (but it did not have any contig >1
      Mbp). I ran the job manually as Sergey has suggested and it completed
      successfully. then I restarted the runCA*.sh but the error came back. I
      have attached the cgw.out file. Please let me know if you need any other
      information.
      Thanks.

       

      Last edit: Brian Walenz 2015-06-03
      • Sergey Koren

        Sergey Koren - 2015-06-03

        You have to follow the steps I listed to remove the 5-consensus/consensus.success file as well as the 5-consensus-* folders and the 6-clonesize folder. Without that, it will not update the latest version of the unitig store, causing the same error.

         

        Last edit: Brian Walenz 2015-06-03
        • Mahul  Chakraborty

          Here is what I did
          $ rm -r 5-consensus-
          $ rm -r 6-clonesize/
          $ rm 5-consensus/consensus.success
          $ bash pse80/runCA.
          .sh

          The error comes back.
          May be I am missing something. Should I have deleted something else too?

           

          Last edit: Brian Walenz 2015-06-03
          • Sergey Koren

            Sergey Koren - 2015-06-03

            The rm -r 5-consensus- is missing a wildcard on the end, it should be rm -r 5-consensus-*.

             

            Last edit: Brian Walenz 2015-06-03
            • Mahul  Chakraborty

              oops! sorry, it was actually
              $ rm -r 5-consensus-*

              When I copied from the terminal, the wild card probably didn't get selected.

               

              Last edit: Brian Walenz 2015-06-03
  • Brian Walenz

    Brian Walenz - 2015-06-03

    Please paste the contents of the tigStore directory (ls -l) and the log from the one utgcns job that found a unitig to compute.

    Is recomputing all unitig consensus sequences a possibility?

     
    • Mahul  Chakraborty

      $ ls -l asm.tigStore/ | head
      total 523956
      -rw-r--r-- 1 mchakrab jje 20 May 31 11:06 seqDB.v001.ctg
      -rw-r--r-- 1 mchakrab jje 93332 May 31 11:06 seqDB.v001.p001.dat
      -rw-r--r-- 1 mchakrab jje 3192 May 31 11:06 seqDB.v001.p001.utg
      -rw-r--r-- 1 mchakrab jje 358772 May 31 11:06 seqDB.v001.p002.dat
      -rw-r--r-- 1 mchakrab jje 72 May 31 11:06 seqDB.v001.p002.utg
      -rw-r--r-- 1 mchakrab jje 68216 May 31 11:06 seqDB.v001.p003.dat
      -rw-r--r-- 1 mchakrab jje 1164 May 31 11:06 seqDB.v001.p003.utg
      -rw-r--r-- 1 mchakrab jje 183900 May 31 11:06 seqDB.v001.p004.dat
      -rw-r--r-- 1 mchakrab jje 2048 May 31 11:06 seqDB.v001.p004.utg

      I didn't paste the entire list because it is really long. I can attach the
      list as a file if you want. Are you asking about the one unitig job that
      failed earlier but completed successfully later? If so, here it is

      Saving fixed unitigs to
      '/jje/mchakrab/pacbio/dpse/dpse_160/pse80/5-consensus/asm_066.fixes'; store
      NOT updated.
      Loading reads into memory.
      Checking unitig consensus for b=0 to e=8179

      Consensus finished successfully. Bye.

      I could recompute all unitig sequences. To do that, I will have to delete
      4- unitigger, 5-consensus, asm.tigStore, and relaunch runCA, right?

       

      Last edit: Brian Walenz 2015-06-03
      • Sergey Koren

        Sergey Koren - 2015-06-03

        I was going to say something similar to Brian. The asm.tigStore is versioned so each step reads a fixed version it expects. The first time that version gets opened it is imported from the previous version. If, for some reason, the restart didn’t clean up some files, a step can run and read an existing (outdated) version and not realize your consensus is now fixed. I think that is what’s happening with your run. I think the cleanup list I had missed one file, if you clean up as:
        rm -rf 5-consensus-
        rm -rf 6-

        rm -rf 5-consensus/consensus.success
        rm -rf 5-consensus/asm.fixes (the one I missed)
        You can also remove all entries from asm.tigStore that are past v2 (so .v001. and .v002. should remain but greater numbers can be erased).

        If that continues to give you trouble, you can remove 5-consensus* and asm.tigStore folders which would force a recompute of all consensus. Make sure you add a memory request flag to your consensus submission command to make sure the jobs don’t run out of memory (since your unitigs are small, a few GB, say 8-10) should do.

         

        Last edit: Brian Walenz 2015-06-03
  • Brian Walenz

    Brian Walenz - 2015-06-03

    Wiki markup is eating the asterisks.

    I'll secnd the removal of asm.fixes and v00[345].

    To rerun consensus, just remove 5-consensus and all the tigStore/v002 files - NOT the entire tigStore directory. In all cases, remove 6-clonesize (and set computeInsertSize=0).

     
    • Mahul  Chakraborty

      It's fixed now. This is what happened. Tried removing asm.fixes and
      v00[345] and then relaunching runCA. A new error showed up. So deleted
      5-consensus folder (and 5-consensus- and 6-clonesize) and tigStore/v002
      files. Also set computeInsertSize=0 and then reran runCA. Still the same
      error came back, this time with the 7-cgw folder. So I checked -
      $ ls 5-consensus/
      tmp
      asm.083.tmp.cns.in asm.083.tmp.layout
      so I ran the consensus 83 job manually and then reran runCA (deleted
      5-consensus-
      , 7-0-CGW, and tigStore/v00[356] files). It went to completion
      :)
      Thank you both for patiently walking me through the troubleshooting.

      On Wed, Jun 3, 2015 at 2:06 PM Brian Walenz brianwalenz@users.sf.net
      wrote:

      Wiki markup is eating the asterisks.

      I'll secnd the removal of asm.fixes and v00[345].

      To rerun consensus, just remove 5-consensus and all the tigStore/v002
      files - NOT the entire tigStore directory. In all cases, remove 6-clonesize
      (and set computeInsertSize=0).


      Status: open
      Group: correction
      Created: Fri May 29, 2015 01:25 AM UTC by Mahul Chakraborty

      Last Updated: Wed Jun 03, 2015 08:38 PM UTC
      Owner: nobody

      Hello,
      I tried to run the rc2 version on SGE and the pipeline quit (before the
      overlaps jobs are submitted) with the following error -

      ERROR: Overlap job assembly~//tempdtes_dip/1-overlapper/001/000001 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000002 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000003 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000004 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000005 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000006 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000007 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000008 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000009 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000010 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000011 FAILED.
      ERROR: Overlap job assembly//tempdtes_dip/1-overlapper/001/000012 FAILED.

      12 sam conversion jobs failed.

      Do you know what caused this?

      Thank you.
      Mahul


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/wgs-assembler/bugs/309/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #309

  • Brian Walenz

    Brian Walenz - 2015-06-05

    Yay!

    I think the key was deleting v005 from tigStore. Not sure how it made it past consensus, but once v005 exists, runCA assumes consensus is finished. This means that your initial rerun(s) were completely ignored.

     
  • Brian Walenz

    Brian Walenz - 2015-06-05
    • status: open --> closed
    • assigned_to: Sergey Koren
     
  • medhat

    medhat - 2015-08-20

    Hi,

    I have the same problem here,
    I was runing this command to correct pacbio reads with illumina reads

    ~/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/pacBioToCA -length 500 -partitions 200 -l rice_pacbio -t 10 -s /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/pacbio.spec -fastq /data/test_data_from_server_maize/pbcr/rice/rice_pacbio.fa_0001.fastq

    the error

    /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/cgw \ -j 1 -k 5 \ -r 5 \ -s 2 \ -filter 1 \ -minmergeweight 2 \ -S 0 \ -G \ -z \ -B 2349 \ -shatter 0 \ -missingMate 0 \ -m 100 \ -g /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/asm.gkpStore \ -t /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/asm.tigStore \ -o /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/asm \

    /data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/cgw.out 2>&1
    ----------------------------------------END Thu Aug 20 12:00:32 2015 (2 seconds)
    ERROR: Failed with signal ABRT (6)

    runCA failed.


    Stack trace:

    at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 1628.
    main::caFailure('scaffolder failed', '/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clo...') called at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 5044
    main::CGW('6-clonesize', undef, '/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clo...', 2, undef, 0) called at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 5255
    main::scaffolder() called at /home/medhat/source/wgs-8.3rc2/wgs-8.3rc2/Linux-amd64/bin/runCA line 6260


    Last few lines of the relevant log file (/data/test_data_from_server_maize/pbcr/rice/rice_pacbio/6-clonesize/cgw.out):

    ERROR: Frag 299012 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299044 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299095 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299097 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299207 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299275 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299295 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299376 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299574 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 299800 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 300226 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 300335 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 300393 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 300450 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: Frag 300613 has null cid or CIid. Fragment is not in an input unitig!
    ERROR: 4937 fragments are not in unitigs.
    cgw: Input_CGW.C:426: void ProcessInput(int, int, char**): Assertion `numErrors == 0' failed.

    Failed with 'Aborted'

    no tmp in 5-consensus directory , So it may not be memory problem

     

    Last edit: medhat 2015-08-20
  • Sergey Koren

    Sergey Koren - 2015-08-22

    Hi,

    You can try re-computing consensus by following the steps above to remove the results:
    rm -rf 5-consensus*
    rm -rf 6-clonesize
    rm -rf asm.tigStore/asm.v002 and greater (leave v001*)

    Is it possible you have contigs < 500bp? The logs from PBcR should have a runCA command which specifies a frgMinLen and an ovlMinLength. If frgMinLen is < 500 bp, it is possible to have contigs shorter than this which would cause the PacBio consensus module (PBDAGCON) to not output a consensus. You can then re-run the runCA command but set consensus=cns to use CA's built in consensus. It will be much slower than PBDAGCON but should produce results for all your contigs.

     

    Last edit: Sergey Koren 2015-08-22
1 2 > >> (Page 1 of 2)

Log in to post a comment.