From: Steve G. <SGi...@nv...> - 2004-06-23 21:36:53
|
Howdy Gangliati, I'm having a strange problem that seems to be with multicast, but I'm not really sure. I had a very similar problem in the past and posted here about it...that turned out to a be a problem on one of the network switches, but my network team insists that this is not the same issue. I'm running Ganglia 2.5.4 (need to upgrade, I know) on about 16 different clusters/subnets of ~200 hosts each. Each subnet has a "control" host that also runs gmond as well as named, ypserv, dhcpd, etc. I have a central monitoring host that is dedicated to running gmetad and the webfrontend that talks to the 16 different control nodes. Hope that makes sense. We've been running this way with no major trouble for quite a while. I recently brought a new subnet/cluster online, and now I'm having trouble. The control box on this subnet seems to be isolated from the rest. gstat --all only shows itself, not the rest of the subnet. The rest of the subnet sees everything except this control box. I've rebooted all the machines as well as restarted all gmonds several times. When you first start up gmond on the control box, it only sees itself...then some random amount of time later, it will list the other nodes in the subnet as being dead. Similarly, the other hosts report the control box as being dead. I can point my gmetad to a random node in the subnet, and that works fine...I just can't get the control box to be part of the cluster. So it seems to me that they do communicate at some point to at least populate the dead list. I've done tcpdumps looking for multicast traffic between the control box and the rest, but nothing ever shows up. The control box is on a different physical network segment...the nodes are plugged into 48-port Cisco switches (100 Mb), and those have a GigE connection back to a big Cisco 6500. The control box has a direct GigE connection to the 6500. Same deal as with all our other subnets. I'm no network whiz, but I've had our network team beating their heads against this, and they insist there is nothing wrong on their end. Anyone else have any ideas? Thanks! Steve Gilbert Unix Systems Administrator sgi...@nv... |