Re: [BProc] bproc: connect... error=11

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Aug 21, 2003 at 01:10:47PM -0400, Nicholas Henke wrote:
> Hey Erik,
> 	I am running the 2.4.18 with bproc 3.2.5 ( and additional patch ) here.
> 
> Here is a link to the syslog from the head node -- it seems we get a ton
> of bproc: connect errors. There are also unknown FD errors -- below is a
> sampling of what we are seeing.
> 
>  Is this some sort of odd network problem, or is this a bproc problem?

Looks like maybe half and half to me - or rather two problems at once.

> Aug 20 03:22:44 alpha kernel: bproc: connect: connect to
> 192.168.0.36:34642 failed; errno=11
> Aug 20 03:22:44 alpha kernel: bproc: connect: connect to
> 192.168.0.57:37692 failed; errno=11
> Aug 20 03:22:44 alpha kernel: bproc: connect: connect to
> 192.168.0.35:34859 failed; errno=11

According to the connect main page, this might be cauesd by:

EAGAIN No  more free local ports or insufficient entries in the routing
       cache. For PF_INET see the  net.ipv4.ip_local_port_range  sysctl
       in ip(7) on how to increase the number of local ports.

Are there tons of open or recently closed (in TIME_WAIT) connections
on that machine?  (check netstat -t) You might try expanding the local
port range (it's pretty small by default) and see if the problem goes
away.  I often find myself having to do that if anything which makes a
lot of connections fast runs for a long time.

> Aug 20 16:52:17 alpha /usr/sbin/bpmaster: read(slave): Connection reset
> by peer
> Aug 20 16:52:17 alpha /usr/sbin/bpmaster: lost connection to slave 80
> Aug 20 16:52:17 alpha /usr/sbin/bpmaster: read(slave): Connection reset
> by peer
> Aug 20 16:52:17 alpha /usr/sbin/bpmaster: lost connection to slave 98
> Aug 20 16:52:17 alpha /usr/sbin/bpmaster: replacing slave 83
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 37
> (192.168.0.47)): Connection reset by peer
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 113
> (192.168.0.123)): Connection reset by peer
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 81
> (192.168.0.91)): Connection reset by peer
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 66
> (192.168.0.76)): Connection reset by peer
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 27
> (192.168.0.37)): Connection reset by peer
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: Internal error: unknown FD in
> write set: 120 
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 38
> (192.168.0.48)): Connection reset by peer
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: Internal error: unknown FD in
> read set: 6 
> Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 53
> (192.168.0.63)): Connection reset by peer

Connection reset by peer means the TCP socket went away for some
reason.  You'd probably have to look at tcpdump to really figure
who's responsible for the connection going away.

As far as the unknown FD stuff, that sounds like a screw-up on the
master daemon's part.  I think the unknown FD in the read set is
benign error caused by "fd_rmap" getting cleared as part of the write.
It can be safely ignored (although it obviously shouldn't complain so
much).

The error in the write set is a bit more puzzling.  It could be a
similar sort of problem.  Are you sending SIGHUP to bpmaster at any
point?  There looks like there might be a little flub in
reconfiguration code that would leave a stray bit in the write set.
Here's a little patch to fix that:

diff -u -r1.130 master.c

--- daemons/master.c	30 May 2003 02:52:51 -0000	1.130
+++ daemons/master.c	29 Aug 2003 15:05:34 -0000
@@ -759,6 +759,7 @@
 
     for (i=0; i < c->if_list_size; i++) {
 	FD_CLR(c->if_list[i].fd, rset_in);
+	FD_CLR(c->if_list[i].fd, wset_in);
 	fd_rmap[c->if_list[i].fd] = 0;
 	close(c->if_list[i].fd);
     }

- Erik