From: <ha...@no...> - 2003-06-27 08:59:26
|
> I have a simple python batch/queuing system that up until now has worked for > me. I looked at sge+bproc - but as far as I can tell you have to manually > reconfigure sge when nodes become unavailable. It can probably be set up to > automatically recongnize cluster reconfigurations but it's not obvious to me > how to do it. It should be easy to do dynamic configuration of SGE when nodes become available/unavailable. With my approach where node looks like a queue on the master node one just have to call qmod -e $N to enable the queue when node N becomas available, so this command should probably go to the end of the bproc's node_up script (where N=$1), and qmod -d $N to disable the queue when node becomes unavailable. I am not sure there is anything like node_down script in bproc (I thought there is but I do not see it in my cluster just now); if it is, it shoud start with "qmod -d $N". We could also test node's sanity in SGE's prolog and epilog scripts (run before and after the job) and call "qmod -d $N" there when needed. (Epilog script could even re-schedule the job when node died while running the job, if the job is re-runnable.) Another simple approach is to run script doing "bpstat" and then "qmod -d ..." every 30 seconds or so (on the master). If all the jobs are written as re-runnable (can be aborted at any moment and run again on a different node, this usually means that the job does not change any of its input files), it should be easy to create a node-fault-tolerant system. All this is untested, please let me know if you try it. Best Regards Vaclav Hanzl |