From: <er...@he...> - 2002-11-12 19:27:54
|
On Mon, Nov 11, 2002 at 02:31:00PM -0500, gor...@ph... wrote: > > Twice in the past 30 days, bpmaster has lost connections to most of the > slave nodes, at exactly the same instant. On October 20 we were running > BProc 3.1.9. Last night we were running BProc 3.2.2. On Oct 20, some of > the nodes had bproc jobs running, last night there were no bproc jobs > running. On Oct 20th all but 2 of 102 node connections were lost, last > night all but 5 connections were lost. > > I suspected a network outage, but the client nodes didn't log any > complaints about NFS services not being available, in either instance. I'm > at a loss as to how to further trace this problem, as it doesn't happen > frequently enough to allow running bpmaster with debugging turned on. Any > suggestions would be appreciated. > > Sample log messages are: > > Oct 20 04:13:21 lxsrvr bpmaster: lost connection to slave 0 > Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 8 > Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 11 > Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 13 > Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 21 > Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 22 Do the slave nodes log any messages like ping timeout? This would be something printed to the console if you're using beoboot. I had a problem where a lot of system activity on the front end caused a bunch of slaves disappear at the same time. I "solved" (worked around, actually) this by turning up the ping timeout. You can try that by adding a line that looks like "pingtimeout ###" to /etc/beowulf/config. The default is about 30 seconds. (pings @ 15 sec intervals - if no response by the next time to ping, then the remote end is deemed dead.) Note that both the master and the slaves send pings to each other. - Erik |