Re: [Ganglia-general] nodes running gmond reporting incorrectly

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Thanks so much for your help, Steven.

We fixed the problem - it turned out that there was a configuration option in
our switch's software that was munging up multicast traffic.

Ganglia is working fine now. Thanks a lot!

Kevin Flasch

On Wed, 2 Jul 2003, steven wagner wrote:

> Kevin James Flasch wrote:
> >>*  Check some of the gmond-only nodes' XML port output.  How many nodes
> >>do they see?  Do they see 289-295 nodes or just their own output?
> >
> >
> > I believe you're referring to the mcast_port (by default 8649). When I telnet
> > to it, I see what appears to be all/most of them.
> > (`telnet localhost 8649 | grep "<HOST " | wc -l`  gives me 300).
>
> wc -l's a good start but you should actually check each host's timestamp
> value.  If the timestamps are fairly close to one another and close to
> NOW(), then you know that the monitoring core you're polling is
> receiving packets from all 300 hosts often enough for them to be
> considered "up" - the REPORTED attribute is updated every time any
> metric is received from a given host.
>
> > They are not all in the same subnet. There are two subnets that they reside
> > in.
> >
> > They are all physically connected to the same switch.
> >
> > There is no firewalling of the sort that blocks ports, drops packets on the
> > master. The idea that there is something wrong with the network connection
> > seems reasonable. I can't see anything outstanding about it, however, and
> > there have been no network problems with the connection otherwise.
>
> So far so good...
>
> >>*  Consider polling a different set of monitoring cores as your gmetad
> >>cluster data source.
> >
> >
> > I'm not sure I follow. Can you explain or give an example, please?
>
> Sure.  gmetad has a configuration file, /etc/gmetad.conf by default,
> that specifies data sources.  gmetad considers each of these data
> sources to be a different cluster.  You can specify a polling frequency
> and a list of IP(:port) combos for each cluster.  These will be checked
> from left to right.
>
> Example:
>
> data_source mycluster 15 10.0.0.2 10.0.0.3:2463 10.0.0.4 10.0.0.5
> data_source anothercluster 60 192.168.7.15
>
> In order to debug gmetad, it helps to "see what the killer sees" by
> telnetting to each of these sources in the same order from the node
> running the metadaemon.  This should at least point you at the
> misbehaving monitoring core.
>
> It may well be that the local monitoring core on the front-end is the
> one that's misconfigured somehow.
>
> >>*  Run a monitoring core in debug mode.  You will see what metrics it's
> >>sending and what metrics it's hearing on the multicast channel.
> >
> >
> > Hmm.. I'm not sure what the output of that should look like on node in a
> > functioning ganglia environment. It seems like it's communicating somewhat
> > with the other nodes, but most of the entries seem to be about itself. One of
> > the entries mentioning another machine look like this:
>  >
>  > Is that less data than typical?
>
>
> On a 300-node Ganglia cluster you should be seeing at least load average
> metrics being multicast from every node every 15-60 seconds, plus the
> various other metrics according to their thresholds.  Regardless, you
> should see more than a packet every few seconds.
>
> In fact if you didn't find it necessary to redirect the debug output to
> a file, you're probably not getting all the packets.  :)
>
> >>*  tcpdump.  Limit it to just the multicast IP or port and you should be
> >>able to get all Ganglia-related traffic that the running host can hear.
> >
> >
> > That's what I did before to check the frequency of ganglia traffic. Most of
> > the traffic is the machine itself broadcasting 8 byte (ocassionally 12 byte)
> > udp packets on the multicast channel. Once and a while an 8 byte udp packet
> > from another node will come on the multicast channel (after every 5-15
> > originating packets on the multicast channel).
>
> See above, you should be getting them more than once in a while.  It
> would be interesting to check two monitoring cores to see if they're
> receiving one another's packets, what the ratio is of dropped packets to
> total packets sent, and if any of the packets that make it through have
> anything in common with one another.  Might give you some clues if
> nothing else does.
>
> >>I know, it's not much, but it's something.
> >
> >
> > Thanks so much for your help. I suppose this only makes me think that there is
> > some networking issue, hardware or software, but I have no idea what it is at
> > this point.
>
> Well, the only thing harder than troubleshooting your own hardware is
> troubleshooting someone else's.  :)
>
>
>
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> Data Reports, E-commerce, Portals, and Forums are available now.
> Download today and enter to win an XBOX or Visual Studio .NET.
> http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
> _______________________________________________
> Ganglia-general mailing list
> Gan...@li...
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>

Re: [Ganglia-general] nodes running gmond reporting incorrectly

Scalable, distributed monitoring system for high-performance computing

Re: [Ganglia-general] nodes running gmond reporting incorrectly