Re: [BProc] abnormal dying of nodes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Erik,

Thanks for the comments.  I will do my best to work on the testing,
but I have another question:  Have you tried either clustermatic
3 or 4 with a machine like mine?  They are UX (ruffian) boards, made
by Samsung, with EV56 @600 MHz.

Perhaps somebody else on the list has tried these?

Thanks,
Daniel

On Thu, Mar 04, 2004 at 06:13:15PM -0700, er...@he... wrote:
> On Thu, Mar 04, 2004 at 02:16:05PM -0500, Daniel Gruner wrote:
> > > Anyway, a few things you can do to try to figure out that's what's
> > > going on...  (I tried this on our alphas real quick and it doesn't
> > > seem to be happening here.)
> > > 
> > >   - Comment out this stuff in the slave daemon:
> > > 
> > >     /* bump our priority to RT to avoid getting hosed by errant
> > >      * stuff that gets run on our node */
> > >     p.sched_priority = 1;
> > >     if (sched_setscheduler(0, SCHED_FIFO, &p))
> > >         syslog(LOG_NOTICE, "Failed to set real-time scheduling for"
> > >                " slave daemon.\n");
> > > 
> > >     and rebuild the slave daemon.  (also reinstall libbpslave.a and
> > >     rebuild the phase 2 boot image if you're using the rest of
> > >     clustermatic)
> > 
> > I will try, if I get some time...
> > 
> > > 
> > >   - bpsh other stuff (e.g. uptime) to the node while this job is
> > >     running but before the node dies.  Is it responsive or is it just
> > >     completely dead?  Is it really slow?
> > 
> > Nothing else runs.  "bpsh 1 uptime" just hangs.  The process table on the
> > master does not get updated, and the waster does not appear on "top" at
> > all.  The node dies (is killed, actually, because it times out) although the
> > waster job manages to finish.  I guess even the bpctl does not get through...
> 
> Well, it sounds like the the starvation of the slave daemon is more or
> less complete in that case.  It's pretty much got to be a scheduler
> problem.  If commenting out the bit of code above and this tid bit in
> kernel/slave.c fixes it, then that will confirm it.
> 
>     /* Knock our priority back to default */
>     current->policy      = SCHED_OTHER;
>     current->nice        = DEF_NICE;
>     current->rt_priority = 0;
> 
> If that fixes it I'm pretty sure that either the kernel you're using
> has some scheduler related patch in it or my code is subtly buggy in a
> way that doesn't crop up on the kernels that I've built.
> 
> If you're running as root, you can also try to do
> sched_setscheduler(0, SCHED_OTHER, ...) in your program to try and
> confirm it.
> 
> > >   - Finally, if everything seems to be working but slow or something
> > >     like that you can up the ping timeout by adding a line like this
> > >     to your /etc/beowulf/config.
> > > 
> > >     pingtimeout 120
> > > 
> > >     120 is the timeout in seconds.  The default is 30.
> > 
> > Not a solution yet...
> > 
> > Could it be something to do with the network driver?  I am using eepro100
> > cards.  Here is the list of loaded modules on the nodes:
> 
> I seriously doubt it.  You're still getting output from the remote
> process which means the network and TCP are all still working fine.
> 
> - Erik

-- 

Dr. Daniel Gruner                        dg...@ti...
Dept. of Chemistry                       dan...@ut...
University of Toronto                    phone:  (416)-978-8689
80 St. George Street                     fax:    (416)-978-5325
Toronto, ON  M5S 3H6, Canada             finger for PGP public key