From: <gor...@ph...> - 2003-06-11 13:46:17
|
On a couple of occaisions, bproc has abruptly lost a mojority (but not all) connections to its slaves. The first time this happened was with bproc 3.2.0 on kernel 2.4.19. None of the connections dropped at the time had jobs running on them. Only 4 jobs were running at the time. The second time was with bproc version 3.2.5 on kernel 2.4.20. This time most, but not all, of the dropped connections had jobs running across them. There were about 75 jobs running. Connections to all but two nodes were dropped. Unlike the first occurance, this was was preceeded by error messages from NTP. system log: Jun 11 00:23:42 lxsrvr ntpd[559]: too many recvbufs allocated (40) Jun 11 00:23:42 lxsrvr last message repeated 3 times Jun 11 00:23:43 lxsrvr bpmaster: lost connection to slave 4 Jun 11 00:23:43 lxsrvr bpmaster: lost connection to slave 13 Jun 11 00:23:43 lxsrvr bpmaster: lost connection to slave 6 I have the usual suspects disbled under cron (makewhatis, locate.db) Does anyone have any idea what's taking place here? Could this be some sort of network storm? Goran |