Re: [BProc] bpmaster: lost connection to slave

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Mon, Nov 11, 2002 at 02:31:00PM -0500, gor...@ph... wrote:
> 
> Twice in the past 30 days, bpmaster has lost connections to most of the
> slave nodes, at exactly the same instant.  On October 20 we were running
> BProc 3.1.9.  Last night we were running BProc 3.2.2.  On Oct 20, some of
> the nodes had bproc jobs running, last night there were no bproc jobs
> running.  On Oct 20th all but 2 of 102 node connections were lost, last
> night all but 5 connections were lost.
> 
> I suspected a network outage, but the client nodes didn't log any
> complaints about NFS services not being available, in either instance.  I'm
> at a loss as to how to further trace this problem, as it doesn't happen
> frequently enough to allow running bpmaster with debugging turned on.  Any
> suggestions would be appreciated.
> 
> Sample log messages are:
> 
> Oct 20 04:13:21 lxsrvr bpmaster: lost connection to slave 0
> Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 8
> Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 11
> Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 13
> Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 21
> Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 22

Do the slave nodes log any messages like ping timeout?  This would be
something printed to the console if you're using beoboot.

I had a problem where a lot of system activity on the front end caused
a bunch of slaves disappear at the same time.  I "solved" (worked
around, actually) this by turning up the ping timeout.  You can try
that by adding a line that looks like "pingtimeout ###" to
/etc/beowulf/config.  The default is about 30 seconds.  (pings @ 15
sec intervals - if no response by the next time to ping, then the
remote end is deemed dead.)  Note that both the master and the slaves
send pings to each other.

- Erik