#271 Unreliable behavior of LSF gridHoldOption

consensus
closed-fixed
nobody
None
5
2014-12-15
2014-04-18
No

In the current svn version, the gridHoldOption for LSF is set on the job(array) name using the "done" dependency condition. In our environment this condition is not working reliably and it seems that one finished job in a job array is enough to satisfy the dependency. As a result, the pipeline often picks up again while there are e.g. still overlapper jobs running. Another disadvantage of the current setup is that runCA will not run properly on the grid if a failed run is resubmitted with the same 'sgeName' (since any dependencies will already be satisfied by the jobs that ran before).

I've experimented a bit with various alternative submission options, and what seems to work best is to set inter-job dependencies using 'numended' instead of 'done' in the LSF gridHoldOption. 'numended' needs to be set using a job ID rather than a job name, which simultaneously solves dependency issues when re-using the same 'sgeName' when resubmitting jobs. The diff included below outlines a possible way this can be implemented, though it could probably use some cleanup:

197c197
<         setGlobal("gridHoldOption",         "-w \"numended\(\"WAIT_TAG\", \*\)\"");
---
>         setGlobal("gridHoldOption",         "-w \"done\(\"WAIT_TAG\"\)\"");
204c204
<        setGlobal("gridNameToJobIDCommand", "bjobs -A -J \"WAIT_TAG\" | grep -v JOBID");
---
>         setGlobal("gridNameToJobIDCommand", "bjobs -J \"WAIT_TAG\" | grep -v JOBID");
1438,1454c1438
<         if (getGlobal("gridEngine") eq "LSF"){
<             my $tcmd = getGlobal("gridNameToJobIDCommand");
<             $tcmd =~ s/WAIT_TAG/$waitTag/g;
<             my $propJobCount = `$tcmd |wc -l`;
<             my $list = `$tcmd`;
<             print STDERR $list;
<             chomp $propJobCount;
<             if ($propJobCount != 1) {
<                 print STDERR "Warning: multiple IDs for job $sgePropHold got $propJobCount and should have been 1.\n";
<             }
<             my $jobID = `$tcmd |tail -n 1 |awk '{print \$1}'`;
<             chomp $jobID;
<             print STDERR "$tcmd\nTranslated $waitTag to be job $jobID\n";
<             $hold =~ s/WAIT_TAG/$jobID/g;
<         } else{
<             $hold =~ s/WAIT_TAG/$waitTag/g;
<         }
---
>         $hold =~ s/WAIT_TAG/$waitTag/g;

Discussion

  • Sergey Koren

    Sergey Koren - 2014-07-30

    Hi,

    I've updated the code in the repo based on your suggestions. It worked on the one LSF system I have access to and should hopefully work for your system.

    Sergey

     
  • Sergey Koren

    Sergey Koren - 2014-07-30
    • status: open --> closed-fixed
     

Log in to post a comment.