From: <gor...@ph...> - 2002-11-11 19:30:30
|
Twice in the past 30 days, bpmaster has lost connections to most of the slave nodes, at exactly the same instant. On October 20 we were running BProc 3.1.9. Last night we were running BProc 3.2.2. On Oct 20, some of the nodes had bproc jobs running, last night there were no bproc jobs running. On Oct 20th all but 2 of 102 node connections were lost, last night all but 5 connections were lost. I suspected a network outage, but the client nodes didn't log any complaints about NFS services not being available, in either instance. I'm at a loss as to how to further trace this problem, as it doesn't happen frequently enough to allow running bpmaster with debugging turned on. Any suggestions would be appreciated. Sample log messages are: Oct 20 04:13:21 lxsrvr bpmaster: lost connection to slave 0 Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 8 Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 11 Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 13 Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 21 Oct 20 04:13:22 lxsrvr bpmaster: lost connection to slave 22 . . . Nov 4 09:17:36 lxsrvr kernel: Kernel command line: auto BOOT_IMAGE=k2419-bp322 ro root=301 BOOT_FILE=/boot/bzImage-2419-bp322 Nov 4 09:17:46 lxsrvr kernel: bproc: Beowulf Distributed Process Space Version 3.2.2 Nov 4 09:17:46 lxsrvr kernel: bproc: (C) 1999-2002 Erik Hendriks <er...@he...> Nov 4 09:17:46 lxsrvr bproc: succeeded Nov 4 09:17:46 lxsrvr bpmaster: machine contains 103 nodes Nov 4 09:17:46 lxsrvr bpmaster: IO daemon started; pid=2301 Nov 4 09:17:46 lxsrvr bproc: bpmaster startup succeeded Nov 4 09:23:59 lxsrvr bpmaster: Setting status of node 0 to 4. Nov 4 09:23:59 lxsrvr in.rshd[2767]: root@lxsrvr0 as root: cmd ='/usr/sbin/bpctl -S `bpstat | awk '/162.86.130.26/ { print $1 }'` -u any -s up' Nov 4 09:45:04 lxsrvr bpmaster: Setting status of node 1 to 4. . . . Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 10 Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 14 Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 40 Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 0 Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 1 Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 2 Nov 11 04:14:41 lxsrvr bpmaster: lost connection to slave 3 . . . On both occaisions, cron.daily was coming to the end of running "slocate" on the head node. However, slocate runs most nights without problems. Here is a record of cron jobs run last night: Mon Nov 11 04:02:00 EST 2002 Start ======================= Mon Nov 11 04:02:00 EST 2002 /etc/cron.daily/00-logwatch Mon Nov 11 04:02:01 EST 2002 /etc/cron.daily/0anacron Mon Nov 11 04:02:01 EST 2002 /etc/cron.daily/cleanscratch.cron Mon Nov 11 04:02:01 EST 2002 /etc/cron.daily/hwclock.sh Mon Nov 11 04:02:03 EST 2002 /etc/cron.daily/logrotate Mon Nov 11 04:02:03 EST 2002 /etc/cron.daily/makewhatis.cron Mon Nov 11 04:02:07 EST 2002 /etc/cron.daily/rpm Mon Nov 11 04:02:08 EST 2002 /etc/cron.daily/slocate.cron Mon Nov 11 04:14:56 EST 2002 /etc/cron.daily/sysstat Mon Nov 11 04:14:56 EST 2002 /etc/cron.daily/tmpwatch Mon Nov 11 04:14:56 EST 2002 /etc/cron.daily/~reblast |