Re: [BProc] abnormal dying of nodes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks for the quick response, Erik.

On Thu, Mar 04, 2004 at 10:51:09AM -0700, er...@he... wrote:
> On Thu, Mar 04, 2004 at 11:54:36AM -0500, Daniel Gruner wrote:
> > Hi
> > 
> > I have experienced a strange phenomenon on my alpha cluster.  It is running
> > bproc 3.2.6, on alpha UX machines.  For the most part the cluster behaves
> > quite normally, allowing me to run jobs, and all the normal stuff.
> > 
> > However, I am testing a fairly short, highly cpu-intensive job, simply to
> > have a way of submitting many jobs using bjs and learn its functioning,
> > and it appears that the job is so cpu-intensive that the node appears
> > dead to the master (i.e. it does not respond or something like that),
> > and it dies before the job is completed.  Well, actually the job manages
> > to complete, but the node is reset anyway.  If I do "ps" on the master I
> > don't see the job (actually, it says it has not used up any time), nor do
> > I see it appear in "top".  Is it possible that the node gets TOO busy with
> > the computation?  I append two files:  The program itself (waster.cpp),
> > and its output (junk).  The command line to run the program was:
> > 
> > 	bpsh 1 -I /dev/null ./waster > & junk &
> > 
> > The program ends with:
> > 	[1]    Exit 255                      bpsh 1 -I /dev/null ./waster >& junk
> > and the node is reset.  From the /var/log/messages file I get:
> > 	Mar  4 11:29:57 racaille bpmaster: ping timeout on slave 1
> > 
> > It looks like the node is too busy computing to even respond to pings...
> 
> That shouldn't be possible but maybe something is going wrong with
> priorities or something.  The slave daemon is supposed to run with an
> elevated priority to avoid these starvation issues.  I saw this kind
> of behavior once when somebody decided to start 1500 cpu intensive
> processes on a slave node once.  In any case, sharing with one other
> process shouldn't be a problem.  A possible problem could come up if
> the slave daemon failed to reset priorities for the child processes it
> created.  I'm not seeing this problem on our systems here so I suspect
> that that's not it.
> 
> First a few questions:
> 
> What kernel version?
> Are you starting with a kernel.org kernel?
> Any other patches other than BProc?
> How many cpus?

2.4.18-27.7hdbp.ux.0

It is basically a stock kernel that has been patched for BProc, and made
to run on the UX (ruffian) board.  These are single cpu machines, with 
EV56 at 600 MHz.

Here is the list of packages installed:
beoboot-cm.1.5-hddcs.2.alpha.rpm
beoboot-modules-cm.1.5-22.4.18_27.7hdbp.ux.0.alpha.rpm
beonss-1.0.12-lanl.2.1.alpha.rpm
bjs-1.2-hd.3.alpha.rpm
bproc-3.2.6-hddcs.2.alpha.rpm
bproc-devel-3.2.6-hddcs.2.alpha.rpm
bproc-libs-3.2.6-hddcs.2.alpha.rpm
bproc-modules-3.2.6-2.k2.4.18_27.7hdbp.ux.0.alpha.rpm
cmtools-1.1-1.alpha.rpm
cmtools-devel-1.1-1.alpha.rpm
kernel-2.4.18-27.7hdbp.ux.0.alpha.rpm
kernel-beoboot-2.4.18-27.7hdbp.ux.0.alpha.rpm
kernel-doc-2.4.18-27.7hdbp.ux.0.alpha.rpm
kernel-source-2.4.18-27.7hdbp.ux.0.alpha.rpm
supermon-1.4-hddcs.3.alpha.rpm
supermon-modules-1.4-3.k2.4.18_27.7hdbp.ux.0.alpha.rpm

It was built by the good folks at HardData (Michal Jaegermann).

> 
> Anyway, a few things you can do to try to figure out that's what's
> going on...  (I tried this on our alphas real quick and it doesn't
> seem to be happening here.)
> 
>   - Comment out this stuff in the slave daemon:
> 
>     /* bump our priority to RT to avoid getting hosed by errant
>      * stuff that gets run on our node */
>     p.sched_priority = 1;
>     if (sched_setscheduler(0, SCHED_FIFO, &p))
>         syslog(LOG_NOTICE, "Failed to set real-time scheduling for"
>                " slave daemon.\n");
> 
>     and rebuild the slave daemon.  (also reinstall libbpslave.a and
>     rebuild the phase 2 boot image if you're using the rest of
>     clustermatic)

I will try, if I get some time...

> 
>   - bpsh other stuff (e.g. uptime) to the node while this job is
>     running but before the node dies.  Is it responsive or is it just
>     completely dead?  Is it really slow?

Nothing else runs.  "bpsh 1 uptime" just hangs.  The process table on the
master does not get updated, and the waster does not appear on "top" at
all.  The node dies (is killed, actually, because it times out) although the
waster job manages to finish.  I guess even the bpctl does not get through...

> 
>   - run something else alongside the waster that does something like:
> 
>     while(1) { printf("hi\n"); fflush(stdout); sleep(1); }
> 
>     does it get starved and stop printing?

In this case both jobs seem to run, but the node dies anyway, since at
least the network stuff does not respond.  Even when the job that just
prints "hi" is running alone on the node I don't see it updating the
process table.  You'd think it would work if only because it spends
most of its time sleep()ing...

> 
>   - Finally, if everything seems to be working but slow or something
>     like that you can up the ping timeout by adding a line like this
>     to your /etc/beowulf/config.
> 
>     pingtimeout 120
> 
>     120 is the timeout in seconds.  The default is 30.

Not a solution yet...

Could it be something to do with the network driver?  I am using eepro100
cards.  Here is the list of loaded modules on the nodes:

racaille:dgruner{116}> bpsh 1 /sbin/lsmod
Module                  Size  Used by    Not tainted
vfat                   18816   0 (unused)
fat                    52176   0 [vfat]
ext3                   94152   0 (unused)
jbd                    74840   0 [ext3]
nfs                   140328   2
lockd                  84488   0 [nfs]
sunrpc                110736   1 [nfs lockd]
sym53c8xx_2           101438   0 (unused)
de4x5                  66452   0 (unused)
eepro100               29688   1
bproc                  99736   2
vmadump                19320   0 [bproc]

Daniel

-- 

Dr. Daniel Gruner                        dg...@ti...
Dept. of Chemistry                       dan...@ut...
University of Toronto                    phone:  (416)-978-8689
80 St. George Street                     fax:    (416)-978-5325
Toronto, ON  M5S 3H6, Canada             finger for PGP public key