Re: [BProc] abnormal dying of nodes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Mar 04, 2004 at 02:16:05PM -0500, Daniel Gruner wrote:
> > Anyway, a few things you can do to try to figure out that's what's
> > going on...  (I tried this on our alphas real quick and it doesn't
> > seem to be happening here.)
> > 
> >   - Comment out this stuff in the slave daemon:
> > 
> >     /* bump our priority to RT to avoid getting hosed by errant
> >      * stuff that gets run on our node */
> >     p.sched_priority = 1;
> >     if (sched_setscheduler(0, SCHED_FIFO, &p))
> >         syslog(LOG_NOTICE, "Failed to set real-time scheduling for"
> >                " slave daemon.\n");
> > 
> >     and rebuild the slave daemon.  (also reinstall libbpslave.a and
> >     rebuild the phase 2 boot image if you're using the rest of
> >     clustermatic)
> 
> I will try, if I get some time...
> 
> > 
> >   - bpsh other stuff (e.g. uptime) to the node while this job is
> >     running but before the node dies.  Is it responsive or is it just
> >     completely dead?  Is it really slow?
> 
> Nothing else runs.  "bpsh 1 uptime" just hangs.  The process table on the
> master does not get updated, and the waster does not appear on "top" at
> all.  The node dies (is killed, actually, because it times out) although the
> waster job manages to finish.  I guess even the bpctl does not get through...

Well, it sounds like the the starvation of the slave daemon is more or
less complete in that case.  It's pretty much got to be a scheduler
problem.  If commenting out the bit of code above and this tid bit in
kernel/slave.c fixes it, then that will confirm it.

    /* Knock our priority back to default */
    current->policy      = SCHED_OTHER;
    current->nice        = DEF_NICE;
    current->rt_priority = 0;

If that fixes it I'm pretty sure that either the kernel you're using
has some scheduler related patch in it or my code is subtly buggy in a
way that doesn't crop up on the kernels that I've built.

If you're running as root, you can also try to do
sched_setscheduler(0, SCHED_OTHER, ...) in your program to try and
confirm it.

> >   - Finally, if everything seems to be working but slow or something
> >     like that you can up the ping timeout by adding a line like this
> >     to your /etc/beowulf/config.
> > 
> >     pingtimeout 120
> > 
> >     120 is the timeout in seconds.  The default is 30.
> 
> Not a solution yet...
> 
> Could it be something to do with the network driver?  I am using eepro100
> cards.  Here is the list of loaded modules on the nodes:

I seriously doubt it.  You're still getting output from the remote
process which means the network and TCP are all still working fine.

- Erik