From: <er...@he...> - 2003-08-29 15:39:43
|
On Thu, Aug 21, 2003 at 01:10:47PM -0400, Nicholas Henke wrote: > Hey Erik, > I am running the 2.4.18 with bproc 3.2.5 ( and additional patch ) here. > > Here is a link to the syslog from the head node -- it seems we get a ton > of bproc: connect errors. There are also unknown FD errors -- below is a > sampling of what we are seeing. > > Is this some sort of odd network problem, or is this a bproc problem? Looks like maybe half and half to me - or rather two problems at once. > Aug 20 03:22:44 alpha kernel: bproc: connect: connect to > 192.168.0.36:34642 failed; errno=11 > Aug 20 03:22:44 alpha kernel: bproc: connect: connect to > 192.168.0.57:37692 failed; errno=11 > Aug 20 03:22:44 alpha kernel: bproc: connect: connect to > 192.168.0.35:34859 failed; errno=11 According to the connect main page, this might be cauesd by: EAGAIN No more free local ports or insufficient entries in the routing cache. For PF_INET see the net.ipv4.ip_local_port_range sysctl in ip(7) on how to increase the number of local ports. Are there tons of open or recently closed (in TIME_WAIT) connections on that machine? (check netstat -t) You might try expanding the local port range (it's pretty small by default) and see if the problem goes away. I often find myself having to do that if anything which makes a lot of connections fast runs for a long time. > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: read(slave): Connection reset > by peer > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: lost connection to slave 80 > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: read(slave): Connection reset > by peer > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: lost connection to slave 98 > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: replacing slave 83 > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 37 > (192.168.0.47)): Connection reset by peer > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 113 > (192.168.0.123)): Connection reset by peer > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 81 > (192.168.0.91)): Connection reset by peer > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 66 > (192.168.0.76)): Connection reset by peer > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 27 > (192.168.0.37)): Connection reset by peer > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: Internal error: unknown FD in > write set: 120 > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 38 > (192.168.0.48)): Connection reset by peer > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: Internal error: unknown FD in > read set: 6 > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 53 > (192.168.0.63)): Connection reset by peer Connection reset by peer means the TCP socket went away for some reason. You'd probably have to look at tcpdump to really figure who's responsible for the connection going away. As far as the unknown FD stuff, that sounds like a screw-up on the master daemon's part. I think the unknown FD in the read set is benign error caused by "fd_rmap" getting cleared as part of the write. It can be safely ignored (although it obviously shouldn't complain so much). The error in the write set is a bit more puzzling. It could be a similar sort of problem. Are you sending SIGHUP to bpmaster at any point? There looks like there might be a little flub in reconfiguration code that would leave a stray bit in the write set. Here's a little patch to fix that: diff -u -r1.130 master.c --- daemons/master.c 30 May 2003 02:52:51 -0000 1.130 +++ daemons/master.c 29 Aug 2003 15:05:34 -0000 @@ -759,6 +759,7 @@ for (i=0; i < c->if_list_size; i++) { FD_CLR(c->if_list[i].fd, rset_in); + FD_CLR(c->if_list[i].fd, wset_in); fd_rmap[c->if_list[i].fd] = 0; close(c->if_list[i].fd); } - Erik |