Re: [BProc] abnormal dying of nodes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Mar 04, 2004 at 11:54:36AM -0500, Daniel Gruner wrote:
> Hi
> 
> I have experienced a strange phenomenon on my alpha cluster.  It is running
> bproc 3.2.6, on alpha UX machines.  For the most part the cluster behaves
> quite normally, allowing me to run jobs, and all the normal stuff.
> 
> However, I am testing a fairly short, highly cpu-intensive job, simply to
> have a way of submitting many jobs using bjs and learn its functioning,
> and it appears that the job is so cpu-intensive that the node appears
> dead to the master (i.e. it does not respond or something like that),
> and it dies before the job is completed.  Well, actually the job manages
> to complete, but the node is reset anyway.  If I do "ps" on the master I
> don't see the job (actually, it says it has not used up any time), nor do
> I see it appear in "top".  Is it possible that the node gets TOO busy with
> the computation?  I append two files:  The program itself (waster.cpp),
> and its output (junk).  The command line to run the program was:
> 
> 	bpsh 1 -I /dev/null ./waster > & junk &
> 
> The program ends with:
> 	[1]    Exit 255                      bpsh 1 -I /dev/null ./waster >& junk
> and the node is reset.  From the /var/log/messages file I get:
> 	Mar  4 11:29:57 racaille bpmaster: ping timeout on slave 1
> 
> It looks like the node is too busy computing to even respond to pings...

That shouldn't be possible but maybe something is going wrong with
priorities or something.  The slave daemon is supposed to run with an
elevated priority to avoid these starvation issues.  I saw this kind
of behavior once when somebody decided to start 1500 cpu intensive
processes on a slave node once.  In any case, sharing with one other
process shouldn't be a problem.  A possible problem could come up if
the slave daemon failed to reset priorities for the child processes it
created.  I'm not seeing this problem on our systems here so I suspect
that that's not it.

First a few questions:

What kernel version?
Are you starting with a kernel.org kernel?
Any other patches other than BProc?
How many cpus?

Anyway, a few things you can do to try to figure out that's what's
going on...  (I tried this on our alphas real quick and it doesn't
seem to be happening here.)

  - Comment out this stuff in the slave daemon:

    /* bump our priority to RT to avoid getting hosed by errant
     * stuff that gets run on our node */
    p.sched_priority = 1;
    if (sched_setscheduler(0, SCHED_FIFO, &p))
        syslog(LOG_NOTICE, "Failed to set real-time scheduling for"
               " slave daemon.\n");

    and rebuild the slave daemon.  (also reinstall libbpslave.a and
    rebuild the phase 2 boot image if you're using the rest of
    clustermatic)

  - bpsh other stuff (e.g. uptime) to the node while this job is
    running but before the node dies.  Is it responsive or is it just
    completely dead?  Is it really slow?

  - run something else alongside the waster that does something like:

    while(1) { printf("hi\n"); fflush(stdout); sleep(1); }

    does it get starved and stop printing?

  - Finally, if everything seems to be working but slow or something
    like that you can up the ping timeout by adding a line like this
    to your /etc/beowulf/config.

    pingtimeout 120

    120 is the timeout in seconds.  The default is 30.

- Erik