From: <er...@he...> - 2004-03-04 18:17:48
|
On Thu, Mar 04, 2004 at 11:54:36AM -0500, Daniel Gruner wrote: > Hi > > I have experienced a strange phenomenon on my alpha cluster. It is running > bproc 3.2.6, on alpha UX machines. For the most part the cluster behaves > quite normally, allowing me to run jobs, and all the normal stuff. > > However, I am testing a fairly short, highly cpu-intensive job, simply to > have a way of submitting many jobs using bjs and learn its functioning, > and it appears that the job is so cpu-intensive that the node appears > dead to the master (i.e. it does not respond or something like that), > and it dies before the job is completed. Well, actually the job manages > to complete, but the node is reset anyway. If I do "ps" on the master I > don't see the job (actually, it says it has not used up any time), nor do > I see it appear in "top". Is it possible that the node gets TOO busy with > the computation? I append two files: The program itself (waster.cpp), > and its output (junk). The command line to run the program was: > > bpsh 1 -I /dev/null ./waster > & junk & > > The program ends with: > [1] Exit 255 bpsh 1 -I /dev/null ./waster >& junk > and the node is reset. From the /var/log/messages file I get: > Mar 4 11:29:57 racaille bpmaster: ping timeout on slave 1 > > It looks like the node is too busy computing to even respond to pings... That shouldn't be possible but maybe something is going wrong with priorities or something. The slave daemon is supposed to run with an elevated priority to avoid these starvation issues. I saw this kind of behavior once when somebody decided to start 1500 cpu intensive processes on a slave node once. In any case, sharing with one other process shouldn't be a problem. A possible problem could come up if the slave daemon failed to reset priorities for the child processes it created. I'm not seeing this problem on our systems here so I suspect that that's not it. First a few questions: What kernel version? Are you starting with a kernel.org kernel? Any other patches other than BProc? How many cpus? Anyway, a few things you can do to try to figure out that's what's going on... (I tried this on our alphas real quick and it doesn't seem to be happening here.) - Comment out this stuff in the slave daemon: /* bump our priority to RT to avoid getting hosed by errant * stuff that gets run on our node */ p.sched_priority = 1; if (sched_setscheduler(0, SCHED_FIFO, &p)) syslog(LOG_NOTICE, "Failed to set real-time scheduling for" " slave daemon.\n"); and rebuild the slave daemon. (also reinstall libbpslave.a and rebuild the phase 2 boot image if you're using the rest of clustermatic) - bpsh other stuff (e.g. uptime) to the node while this job is running but before the node dies. Is it responsive or is it just completely dead? Is it really slow? - run something else alongside the waster that does something like: while(1) { printf("hi\n"); fflush(stdout); sleep(1); } does it get starved and stop printing? - Finally, if everything seems to be working but slow or something like that you can up the ping timeout by adding a line like this to your /etc/beowulf/config. pingtimeout 120 120 is the timeout in seconds. The default is 30. - Erik |