From: Kevin J. F. <kf...@uw...> - 2003-07-11 21:28:56
|
Thanks so much for your help, Steven. We fixed the problem - it turned out that there was a configuration option in our switch's software that was munging up multicast traffic. Ganglia is working fine now. Thanks a lot! Kevin Flasch On Wed, 2 Jul 2003, steven wagner wrote: > Kevin James Flasch wrote: > >>* Check some of the gmond-only nodes' XML port output. How many nodes > >>do they see? Do they see 289-295 nodes or just their own output? > > > > > > I believe you're referring to the mcast_port (by default 8649). When I telnet > > to it, I see what appears to be all/most of them. > > (`telnet localhost 8649 | grep "<HOST " | wc -l` gives me 300). > > wc -l's a good start but you should actually check each host's timestamp > value. If the timestamps are fairly close to one another and close to > NOW(), then you know that the monitoring core you're polling is > receiving packets from all 300 hosts often enough for them to be > considered "up" - the REPORTED attribute is updated every time any > metric is received from a given host. > > > They are not all in the same subnet. There are two subnets that they reside > > in. > > > > They are all physically connected to the same switch. > > > > There is no firewalling of the sort that blocks ports, drops packets on the > > master. The idea that there is something wrong with the network connection > > seems reasonable. I can't see anything outstanding about it, however, and > > there have been no network problems with the connection otherwise. > > So far so good... > > >>* Consider polling a different set of monitoring cores as your gmetad > >>cluster data source. > > > > > > I'm not sure I follow. Can you explain or give an example, please? > > Sure. gmetad has a configuration file, /etc/gmetad.conf by default, > that specifies data sources. gmetad considers each of these data > sources to be a different cluster. You can specify a polling frequency > and a list of IP(:port) combos for each cluster. These will be checked > from left to right. > > Example: > > data_source mycluster 15 10.0.0.2 10.0.0.3:2463 10.0.0.4 10.0.0.5 > data_source anothercluster 60 192.168.7.15 > > In order to debug gmetad, it helps to "see what the killer sees" by > telnetting to each of these sources in the same order from the node > running the metadaemon. This should at least point you at the > misbehaving monitoring core. > > It may well be that the local monitoring core on the front-end is the > one that's misconfigured somehow. > > >>* Run a monitoring core in debug mode. You will see what metrics it's > >>sending and what metrics it's hearing on the multicast channel. > > > > > > Hmm.. I'm not sure what the output of that should look like on node in a > > functioning ganglia environment. It seems like it's communicating somewhat > > with the other nodes, but most of the entries seem to be about itself. One of > > the entries mentioning another machine look like this: > > > > Is that less data than typical? > > > On a 300-node Ganglia cluster you should be seeing at least load average > metrics being multicast from every node every 15-60 seconds, plus the > various other metrics according to their thresholds. Regardless, you > should see more than a packet every few seconds. > > In fact if you didn't find it necessary to redirect the debug output to > a file, you're probably not getting all the packets. :) > > >>* tcpdump. Limit it to just the multicast IP or port and you should be > >>able to get all Ganglia-related traffic that the running host can hear. > > > > > > That's what I did before to check the frequency of ganglia traffic. Most of > > the traffic is the machine itself broadcasting 8 byte (ocassionally 12 byte) > > udp packets on the multicast channel. Once and a while an 8 byte udp packet > > from another node will come on the multicast channel (after every 5-15 > > originating packets on the multicast channel). > > See above, you should be getting them more than once in a while. It > would be interesting to check two monitoring cores to see if they're > receiving one another's packets, what the ratio is of dropped packets to > total packets sent, and if any of the packets that make it through have > anything in common with one another. Might give you some clues if > nothing else does. > > >>I know, it's not much, but it's something. > > > > > > Thanks so much for your help. I suppose this only makes me think that there is > > some networking issue, hardware or software, but I have no idea what it is at > > this point. > > Well, the only thing harder than troubleshooting your own hardware is > troubleshooting someone else's. :) > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 > _______________________________________________ > Ganglia-general mailing list > Gan...@li... > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > |