Our HPC resources at Princeton are moving from SGE to SLURM for resource management and I was wondering if anyone has put an work towards getting runCA to work with SLURM and/or DRMAA (http://www.drmaa.org). I haven't done extensive work with Celera, but as our researchers are starting to use PacBio for sequencing more, the usage has picked up. Even if someone could give pointers on where to start to properly support such systems, that would be helpful.
I am also interested in runCA with SLURM. Any news?
Thanks
There was a user at Rutgers who was looking into this as well and was planning to contribute changes back to the project but I’m not sure what the status of that effort is. We have no plans to do it ourselves and don’t have access to a SLURM system for testing. I can give general guidance on how CA supports the various grid engines. It’s not the cleanest implementation but it’s hard to keep it generic when the clusters vary so much in how they submit/schedule jobs. The basic requirements from a scheduler system to run CA is that it supports array jobs, supports individual nodes submitting jobs (i.e. all nodes on the cluster have to be able to submit a job), supports a job holding for other jobs, and supports altering a running job to change it’s hold/other stats. There are three blocks of code in runCA that configures a set of options for each grid. For example, here is the PBS block:
if (($var eq "gridEngine") && ($val eq "PBS")) {
setGlobal("gridEngineSubmitCommand", "qsub");
setGlobal("gridEngineHoldOption", "-W depend=afteranyarray:\"WAIT_TAG\"");
setGlobal("gridEngineHoldOptionNoArray", "-W depend=afterany:\"WAIT_TAG\"");
setGlobal("gridEngineSyncOption", "");
setGlobal("gridEngineNameOption", "-d
pwd
-N");setGlobal("gridEngineArrayOption", "-t ARRAY_JOBS");
setGlobal("gridEngineArrayName", "ARRAY_NAME[ARRAY_JOBS]");
setGlobal("gridEngineOutputOption", "-j oe -o");
setGlobal("gridEnginePropagateCommand", "qalter -W depend=afterany:\"WAIT_TAG\"");
setGlobal("gridEngineNameToJobIDCommand", "qstat -f |grep -F -B 1 WAIT_TAG | grep Id: | grep -F [] |awk '{print \$NF}'");
setGlobal("gridEngineNameToJobIDCommandNoArray", "qstat -f |grep -F -B 1 WAIT_TAG | grep Id: |awk '{print \$NF}'");
setGlobal("gridEngineTaskID", "PBS_ARRAYID");
setGlobal("gridEngineArraySubmitID", "\\$PBS_ARRAYID");
setGlobal("gridEngineJobID", "PBS_JOBID");
}
Basically, it has options that tell it how to submit jobs, how to hold for array and non-array jobs, how to specify an array jobs, along with how to get identifiers for a running job (the grep/awk command) for systems that do not supports holds based on job names. You can look at the code to see the options for LSF and SGE as well. I’ve never used SLURM so I don’t know how similar it is to any of the other engines and whether you can fit it into the above framework. If you can, then it should be relatively straightforward to customize the above parameters to it. Not all options must be defined. It’s possible you can significantly simplify your options if SLUM supports jobs holding for other jobs by name (SGE supports this but neither PBS nor LSF do). That is what the gridEngineNameToJobIDCommand* options do and they are undefined on SGE. SGE also doesn’t differentiate holds for array jobs or regular jobs (LSF and PBS both do) so if SLUM is more like SGE in that sense you may also only need to define gridEngineHoldOption not gridEngineHoldOptionNoArray. Here is the SGE chunk of code, you can see several undefined variables:
if (($var eq "gridEngine") && ($val eq "SGE")) {
setGlobal("gridEngineSubmitCommand", "qsub");
setGlobal("gridEngineHoldOption", "-hold_jid \"WAIT_TAG\"");
setGlobal("gridEngineHoldOptionNoArray", undef);
setGlobal("gridEngineSyncOption", "-sync y");
setGlobal("gridEngineNameOption", "-cwd -N");
setGlobal("gridEngineArrayOption", "-t ARRAY_JOBS");
setGlobal("gridEngineArrayName", "ARRAY_NAME");
setGlobal("gridEngineOutputOption", "-j y -o");
setGlobal("gridEnginePropagateCommand", "qalter -hold_jid \"WAIT_TAG\"");
setGlobal("gridEngineNameToJobIDCommand", undef);
setGlobal("gridEngineNameToJobIDCommandNoArray", undef);
setGlobal("gridEngineTaskID", "SGE_TASK_ID");
setGlobal("gridEngineArraySubmitID", "\\$TASK_ID");
setGlobal("gridEngineJobID", "JOB_ID");
}
Sergey
Related
Feature Requests: #139
Hi Sergey et al.,
Noticed a tweet about this, and was first time for me hearing about SLURM, so thought would be fun exercise to look into modifying 'runCA.pl' to support (based on your suggestions).
I've put together a patch script that I've checked into github here:
https://github.com/brettwhitty/bw-ca-tools/blob/master/runCA-slurm-patch/do_runCA_slurm_patch.sh
that adds the following SLURM-supporting variables:
if (($var eq "gridEngine") && ($val eq "SLURM")) {
+ setGlobal("gridEngineSubmitCommand", "sbatch");
+ setGlobal("gridEngineHoldOption", "--depend=afterany:\"WAIT_TAG\"");
+ setGlobal("gridEngineHoldOptionNoArray", "--depend=afterany:\"WAIT_TAG\"");
+ setGlobal("gridEngineSyncOption", ""); ## TODO: SLURM may not support w/out wrapper; See LSF bsub manpage to compare
+ setGlobal("gridEngineNameOption", "-D
pwd
-J");+ setGlobal("gridEngineArrayOption", "-a ARRAY_JOBS");
+ setGlobal("gridEngineArrayName", "ARRAY_NAME[ARRAY_JOBS]");
+ setGlobal("gridEngineOutputOption", "-o"); ## NB: SLURM default joins STDERR & STDOUT if no -e specified
+ setGlobal("gridEnginePropagateCommand", "scontrol update job=\"WAIT_TAG\""); ## TODO: manually verify this in all cases
+ setGlobal("gridEngineNameToJobIDCommand", "squeue -h -o\%F_* -n \"WAIT_TAG\" | uniq"); ## TODO: manually verify this in all cases
+ setGlobal("gridEngineNameToJobIDCommandNoArray", "squeue -h -o\%i -n \"WAIT_TAG\""); ## TODO: manually verify this in all cases
+ setGlobal("gridEngineTaskID", "SLURM_ARRAY_TASK_ID");
+ setGlobal("gridEngineArraySubmitID", "%A_%a");
+ setGlobal("gridEngineJobID", "SLURM_JOB_ID");
+ }
I initially worked from the SLURM man pages; n1/s/oge is old hat for me, so I started from the SGE (and PBS/torque) examples to map over the behavior the code seemed to be expecting. Then I set up a small SLURM virtual cluster and did a few test runs.
Was away last few days and haven't gotten back to testing yet this week, but as far as I can tell it works OK.
Array submissions seem to work, holds on array jobs seem to work, output file naming seems OK.
Only thing that was a bit sketchy to me in the code is the nested variable replacement that happens with 'WAIT_TAG', especially as it relates to 'gridEnginePropagateCommand'; the thinking wasn't clear to me with the naming of that variable, but following through the code I think everything does seem to work as it should; but I haven't tested enough to understand the special cases that may be buried in those deeply nested code blocks.
Would be happy to polish this off should anyone have any feedback or errors from their own testing, but otherwise after I do a couple more tests and am satisfied I will consider it a completed exercise. Hope this is useful to a few people.
Regards,
Brett
I have an opportunity to apply for resources (in the form of a competent person's time) to do work towards this. Should I go ahead or has somebody already started?
Several users have had success with the SLURM patch on github so I would recommend this route for now. The only change I've seen users have to make was:
Changing
setGlobal("gridEngineNameToJobIDCommand", "squeue -h -o\%F_* -n \"WAIT_TAG\" | uniq");
to
setGlobal("gridEngineNameToJobIDCommand", "squeue -h -o\%F -n \"WAIT_TAG\" | uniq");
That is, removing the * from the option.