From: Nicholas H. <he...@se...> - 2003-08-29 15:54:03
|
On Fri, 2003-08-29 at 11:21, er...@he... wrote: > On Thu, Aug 21, 2003 at 01:10:47PM -0400, Nicholas Henke wrote: > > Hey Erik, > > I am running the 2.4.18 with bproc 3.2.5 ( and additional patch ) here. > > > > Here is a link to the syslog from the head node -- it seems we get a ton > > of bproc: connect errors. There are also unknown FD errors -- below is a > > sampling of what we are seeing. > > > > Is this some sort of odd network problem, or is this a bproc problem? > > Looks like maybe half and half to me - or rather two problems at once. > > > Aug 20 03:22:44 alpha kernel: bproc: connect: connect to > > 192.168.0.36:34642 failed; errno=11 > > Aug 20 03:22:44 alpha kernel: bproc: connect: connect to > > 192.168.0.57:37692 failed; errno=11 > > Aug 20 03:22:44 alpha kernel: bproc: connect: connect to > > 192.168.0.35:34859 failed; errno=11 > > According to the connect main page, this might be cauesd by: > > EAGAIN No more free local ports or insufficient entries in the routing > cache. For PF_INET see the net.ipv4.ip_local_port_range sysctl > in ip(7) on how to increase the number of local ports. > > Are there tons of open or recently closed (in TIME_WAIT) connections > on that machine? (check netstat -t) You might try expanding the local > port range (it's pretty small by default) and see if the problem goes > away. I often find myself having to do that if anything which makes a > lot of connections fast runs for a long time. Will do -- we have not seen this since I emailed you, but the load on the cluster is lower now as well. *sigh* I guess if those pesky users would quit running jobs.. :) > > > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: read(slave): Connection reset > > by peer > > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: lost connection to slave 80 > > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: read(slave): Connection reset > > by peer > > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: lost connection to slave 98 > > Aug 20 16:52:17 alpha /usr/sbin/bpmaster: replacing slave 83 > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 37 > > (192.168.0.47)): Connection reset by peer > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 113 > > (192.168.0.123)): Connection reset by peer > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 81 > > (192.168.0.91)): Connection reset by peer > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 66 > > (192.168.0.76)): Connection reset by peer > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 27 > > (192.168.0.37)): Connection reset by peer > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: Internal error: unknown FD in > > write set: 120 > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 38 > > (192.168.0.48)): Connection reset by peer > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: Internal error: unknown FD in > > read set: 6 > > Aug 20 16:52:18 alpha /usr/sbin/bpmaster: write(slave 53 > > (192.168.0.63)): Connection reset by peer > > Connection reset by peer means the TCP socket went away for some > reason. You'd probably have to look at tcpdump to really figure > who's responsible for the connection going away. Hrm - fun. I will take a whack at that. > > As far as the unknown FD stuff, that sounds like a screw-up on the > master daemon's part. I think the unknown FD in the read set is > benign error caused by "fd_rmap" getting cleared as part of the write. > It can be safely ignored (although it obviously shouldn't complain so > much). Whiny little bugger eh ? I knew keeping all of those syslogs was a bad idea. > > The error in the write set is a bit more puzzling. It could be a > similar sort of problem. Are you sending SIGHUP to bpmaster at any > point? There looks like there might be a little flub in > reconfiguration code that would leave a stray bit in the write set. > Here's a little patch to fix that: > > diff -u -r1.130 master.c > --- daemons/master.c 30 May 2003 02:52:51 -0000 1.130 > +++ daemons/master.c 29 Aug 2003 15:05:34 -0000 > @@ -759,6 +759,7 @@ > > for (i=0; i < c->if_list_size; i++) { > FD_CLR(c->if_list[i].fd, rset_in); > + FD_CLR(c->if_list[i].fd, wset_in); > fd_rmap[c->if_list[i].fd] = 0; > close(c->if_list[i].fd); > } > > - Erik I will add that as well -- thanks for the reply on this :) Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |