Re: [BProc] Re: is this a good candidate for bproc?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> I have a simple python batch/queuing system that up until now has worked for
> me. I looked at sge+bproc - but as far as I can tell you have to manually
> reconfigure sge when nodes become unavailable. It can probably be set up to
> automatically recongnize cluster reconfigurations but it's not obvious to me
> how to do it.

It should be easy to do dynamic configuration of SGE when nodes become
available/unavailable. With my approach where node looks like a queue
on the master node one just have to call

  qmod -e $N

to enable the queue when node N becomas available, so this command
should probably go to the end of the bproc's node_up script (where
N=$1), and

  qmod -d $N

to disable the queue when node becomes unavailable. I am not sure
there is anything like node_down script in bproc (I thought there is
but I do not see it in my cluster just now); if it is, it shoud start
with "qmod -d $N". We could also test node's sanity in SGE's prolog
and epilog scripts (run before and after the job) and call "qmod -d
$N" there when needed. (Epilog script could even re-schedule the job
when node died while running the job, if the job is re-runnable.)

Another simple approach is to run script doing "bpstat" and then "qmod
-d ..." every 30 seconds or so (on the master).

If all the jobs are written as re-runnable (can be aborted at any
moment and run again on a different node, this usually means that the
job does not change any of its input files), it should be easy to
create a node-fault-tolerant system.

All this is untested, please let me know if you try it.

Best Regards

Vaclav Hanzl