Re: [Ganglia-developers] reorganizing clusters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

A setup which "solves" the update issue while maintaining a level of HA
is to have 2 (or more) unicast send channels from each node to a pair
(or more) of gmond aggregators and to have a multicast channel setup
between the aggregators themselves.

The cost is more network traffic, but it's pretty insignificant anyway.
Even on a 100Mb/s wire.

As for resending data on whenever a new node appears on the multicast
channel, I haven't looked at the code (it's far too late at night to do
that now), but I hope that wasn't implemented the way you describe...
Think of a cluster with a few thousand nodes on one channel (not the
best idea probably probably) - a new node shows up and everyone starts
coughing their data on the wire. Add a really low dmax value and a node
rebooting every few minutes (or a bad wire/switch/NIC) and you have a
lovely little mess.

Alex

Jason A. Smith wrote:

> On Thu, 2006-03-23 at 15:47 -0800, Chuck Simmons wrote:
>   
>> Alex --
>>
>> Thanks for the details.  Telneting to a gmond XML port to dump
>> internal state is a nice debugging technique.
>>
>> One of my problems is that I'm running a secondary daemon using the
>> gmetric subroutine libraries, and it took me awhile to realize that
>> daemon is in some ways equivalent to 'gmond'.  In particular, I have
>> to reboot it in addition to 'gmond'.  The problem was immediately
>> obvious once I used the telnet trick you mentioned.
>>     
>
> Metrics also have a dmax attribute that should force their removal from
> memory once expired, but I don't remember if this is actually
> implemented or not.
>
>   
>> So for the missing cpu data issue...  Let me write down what's
>> happening real slowly to make sure I understand.  I'm running a
>> multicast gmond on each cluster to aggregate data, implying that each
>> node of the cluster eventually aggregates data about all other nodes
>> of the same cluster.  I'm using a centralized gmetad to pull data from
>> a node of each cluster.  Presumably 'gmetad' doesn't really remember a
>> whole lot about the outlying nodes.
>>     
>
> I am not really sure what you mean here, but gmetad basically keeps info
> about all nodes in each cluster in memory, similar to how gmond keeps
> info about all nodes in its cluster in memory.  Just like gmond, gmetad
> also respects the dmax attributes.  If you don't have dmax set or don't
> want to wait that long then you will have to restart gmetad also.
>
>   
>>     I go out to the cluster and kill gmond on each node.  Then I go
>> through the nodes and start gmond back up on each node.  As each node
>> starts, it broadcasts number of cpus throughout the cluster.  Thus,
>> when I'm done restarting, one of the nodes (the first to restart)
>> knows how many cpus each node has, but nodes that were restarted last
>> don't have complete state information.
>>     
>
> Not exactly true, see below.
>
>   
>> When I then restart 'gmetad' at the central location, it connects up
>> to one of the nodes in the cluster, and if that node doesn't have full
>> state informatin, gmetad incorrectly reports the number of cpus in the
>> cluster.  [Since I am using a background process that gathers metrics
>> separately from 'gmond' relatively frequently, this background process
>> is probably causing all nodes in the cluster to know about all of the
>> hosts in the cluster if not all of the metrics of all of the hosts in
>> the cluster.]
>>     This will eventually correct itself since all metrics are
>> periodically rebroadcast.
>>     Possible alternate fixes may include:
>>         (1) When a node receives a broadcast from another node that it
>> hasn't seen before, it may want to send its data back to the first
>> node.  If I start node A and it broadcasts to an empty cluster, then I
>> start node B and it broadcasts to A, then it might be nice if node A
>> sends data back to B because it can reasonably infer that B doesn't
>> have A's state and that B should have A's state.
>>     
>
> I haven't checked the gmond sources lately, but this is exactly what it
> was designed to do.  Anytime gmond sees data from a new node that is
> hasn't seen before, it assumes that node doesn't know anything about
> itself either, and sends a complete set of its own metrics out on the
> multicast address.  This can actually cause part of the problem,
> especially if you restart gmond on a lot of nodes all at the same time,
> basically because multicast is udp based and therefore does not have
> guaranteed packet delivery.  I think during this burst of udp metrics
> from many nodes, some get lost and you will just have to wait till they
> are resent later.
>
>   
>>         (2) maybe daemons that gather metrics should not directly
>> broadcast them throughout a cluster.  Instead the metrics should be
>> accumulated within a central daemon and then be broadcast.  (In other
>> words, treat 'gmond' as having two separate components:  a metrics
>> gathering component and a metric/cluster aggregation component.  Then
>> both the metrics component of 'gmond' and the metrics that I am
>> gathering should be handed to the aggregation component.)  [This is
>> probably not useful without also implementing (1) above.]
>>         (3)  Alex implies that there may be alternate ways to
>> configure a cluster without using multicasting which may handle some
>> or all aspects of this problem.
>>     
>
> You can configure gmond to use unicast if you don't need or care about
> the HA feature that multicast gives you.
>
>   
>>        [We can treat each node as maintaining a list of metrics and
>> their current values and broadcasting deltas to that list on a
>> periodic basis.  In the current system, it is possible to recieve a
>> delta without having the background data to which the delta applies.
>> Multiple daemons each spitting out deltas to their own metrics is
>> compatible with the current model.  However, we may want to have all
>> the background data in a single list; we may also want each node to
>> know which metric gathering daemons exist so that we can better report
>> when one of the metric gathering daemons dies.]
>>
>> Moving on to the issue of correcting configuration problems.  While we
>> can say that having a timeout is the way to correct configuration
>> issues, this is not necessarily the best implementation.  Part of my
>> problem is that I have multiple daemons that gather and broadcast
>> metrics.  If we address parts of that as discussed above, then it
>> becomes easier to fix the broadcast address by just resetting a single
>> daemon.
>>     
>
> There was a plan to provide a plugin architecture for writing custom
> metrics in ganglia, I am not sure what happened to that though.
>
>   
>>     So, at the current time, we can configure the system in a couple
>> of ways.  We can configure the system so that a host is considered
>> removed from a cluster when the host has been down sufficiently long,
>> or we can manually remove the host from the cluster by restarting all
>> gmond daemons in the cluster.
>>     Possible alternate approaches might include providing a command
>> that could be sent to a 'gmond' daemon in a cluster to remove a host
>> from the cluster.  It may be that there already exist mechanisms to
>> restart all gmond daemons in a cluster, but this mechanism is not
>> integrated into ganglia.  
>>
>> So, thanks, I think I now understand what's going on.
>>
>> Cheers, Chuck
>>
>>
>>
>> Alex Balk wrote: 
>>     
>>> Hi Chuck,
>>>
>>>
>>> See below...
>>>
>>>
>>>
>>> Chuck Simmons wrote:
>>>
>>>   
>>>       
>>>> The number of cpus does get sorted out, but I don't believe that
>>>> restarting 'gmond' is a solution.  The problem occurs after restarting
>>>> a number of 'gmond' processes, and the problem is caused because
>>>> 'gmond' is not reporting the information.  Does 'gmond' maintain a
>>>> timestamp on disk as to when it last reported the number of cpus and
>>>> insist on waiting sufficiently long to report again?  Does the
>>>> collective distributed memory of the system remember when the number
>>>> of cpus was last reported but not remember what the last reported
>>>> value was?  Is there any chance that anyone can give me hints to how
>>>> the code works without me having to read the code and reverse engineer
>>>> the intent?
>>>>
>>>>     
>>>>         
>>> The reporting interval for number of CPUs is defined within /etc/gmond.conf.
>>> For example:
>>>
>>>   collection_group {
>>>     collect_once   = yes
>>>     time_threshold = 1800
>>>     metric {
>>>      name = "cpu_num"
>>>     }
>>>
>>> The above defines that the number of CPUs is collected once at the
>>> startup of gmond and reported every 1800 seconds.
>>> Your problem occurs because gmond doesn't save any data on disk, but
>>> rather in memory. This means that if you're using a single gmond
>>> aggregator (in unicast mode) and that aggregator gets restarted, it will
>>> will not receive another report the number of CPUs till 1800 seconds
>>> elapsed since the previous report.
>>> The case of multicast is a more interesting one, since every node holds
>>> data for all nodes on the multicast channel. The question here is
>>> whether an update with a newer timestamp overrides all previous XML data
>>> for the host. I don't think that's the case, it seems more likely that
>>> only existing data is overwritten... but then, I don't use multicast, so
>>> you may qualify this answer as throwing useless, obvious crap your way.
>>>
>>> Generally speaking, there are 2 cases when a host reports a metric via
>>> its send_channel:
>>> 1. When a time_threshold expires.
>>> 2. When a value_threshold is exceeded.
>>>
>>> You're welcome to read the code for more insight, but a simple telnet to
>>> a predefined TCP channel would probably be quicker. You could just look
>>> at the XML data and compare pre-update and post-update values (yes,
>>> you'll need to take note of the timestamps - again, in the XML).
>>>
>>>   
>>>       
>>>> I understand that I can group nodes via /etc/gmond.conf.  The question
>>>> is, once I have screwed up the configuration, how do I recover from
>>>> that screw up?  I have restarted various gmetad's and various
>>>> gmond's.  The grouping is still incorrect.  Exactly which gmetad's and
>>>> gmond's do I have to shut down when.  And, again, my real question is
>>>> about understanding how the code works -- how the distributed memory
>>>> works.
>>>>
>>>>     
>>>>         
>>> As far as I know, you cannot recover from a configuration error unless
>>> you've made sure host_dmax was set to a fairly small, non-zero value.
>>>
>>> From the docs:
>>>
>>>    The host_dmax value is an integer with units in seconds. When set to
>>>    zero (0), gmond will never delete a host from its list even when a
>>>    remote host has stopped responding. If host_dmax is set to a positive
>>>    number then gmond will flush a host after it has not heard from it for
>>>    host_dmax seconds. By the way, dmax means ``delete max''.
>>>
>>> This way, once a host's configuration was modified to point at a
>>> different send channel, the aggregator(s) on its previous channel will
>>> forget about its existence once delete_max expires.
>>>
>>> Personally, I don't use multicast due to various reasons, the main one
>>> actually being its main advantage - every node keeps data on the entire
>>> cluster. While this provides for maximal high availability, it also has
>>> a bigger memory footprint. Especially when you have a few thousands of
>>> nodes.
>>>
>>>   
>>>       
>>>> I'd much rather be ignored than have people try to pawn off facile
>>>> answers on me.
>>>>
>>>>     
>>>>         
>>> I'd provide you with more information on a possible setup which balances
>>> high availability with performance, but I wouldn't want to overflow you
>>> with useless data any more than I've done so far.
>>> Let me know if you'd like more information.
>>>
>>> Cheers,
>>> Alex
>>>
>>>   
>>>       
>>>> Cheers, Chuck
>>>>
>>>>
>>>>
>>>> Bernard Li wrote:
>>>>     
>>>>         
>>>>> Hi Chuck:
>>>>>  
>>>>> For the first issue - give it time, it should sort itself out. 
>>>>> Alternatively, you can find out which node is reporting incorrect
>>>>> information, and restart gmond on it.
>>>>>  
>>>>> For the second issue, you can group nodes in different data_source
>>>>> via the multicast port in /etc/gmond.conf.  Use the same port # for
>>>>> nodes you want belonging to the same group.
>>>>>  
>>>>> You'll need to restart gmetad and gmond for the new groupings to take
>>>>> effect.
>>>>>  
>>>>> Cheers,
>>>>>  
>>>>> Bernard
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* gan...@li... on behalf of
>>>>> Chuck Simmons
>>>>> *Sent:* Wed 22/03/2006 17:54
>>>>> *To:* gan...@li...
>>>>> *Subject:* [Ganglia-developers] reorganizing clusters
>>>>>
>>>>> I need help understanding two things.
>>>>>
>>>>> I currently have a grid.  One of the clusters in the grid is named
>>>>> "staiu" and the "grid" level web page reports that this has 8 hosts
>>>>> containing 4 cpus.  In actuality, this has 8 hosts each containing 4
>>>>> cpus, but apparently the hosts are not reporting the current number of
>>>>> cpus to the front end.  Why not?  I recently restarted gmond on each of
>>>>> the 8 hosts.
>>>>>
>>>>> Another cluster is named "staqp05-08" and the "grid" level web page
>>>>> reports that this has 12 hosts.  In actual fact, it only has 4 hosts. 
>>>>> The extra 8 hosts are the 8 hosts of the 'staiu' cluster.  On the
>>>>> cluster level page for staqp05-08, the "choose a node" pull down menu
>>>>> lists the 8 staiu hosts, and the "hosts up" number contains the staiu
>>>>> hosts, and there are undrawn graphs for each of the staiu hosts in the
>>>>> "load one" section.  What do I have to do so that the web pages or gmond
>>>>> daemons or whatever won't think that the staqp cluster contains the
>>>>> staiu hosts?  What is the specific mechanism that causes this
>>>>> association to persist despite having shutdown all staqp gmond daemons
>>>>> and both the gmond and gmetad daemons on the web server, simultaneously,
>>>>> and then starting up that collection of daemons?
>>>>>
>>>>> Thanks, Chuck
>>>>>
>>>>>
>>>>> -------------------------------------------------------
>>>>> This SF.Net email is sponsored by xPML, a groundbreaking scripting
>>>>> language
>>>>> that extends applications into web and mobile media. Attend the live
>>>>> webcast
>>>>> and join the prime developer group breaking into this new coding
>>>>> territory!
>>>>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>>>>> <http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642>
>>>>> _______________________________________________
>>>>> Ganglia-developers mailing list
>>>>> Gan...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>>>>>
>>>>>       
>>>>>           

Re: [Ganglia-developers] reorganizing clusters

Scalable, distributed monitoring system for high-performance computing

Re: [Ganglia-developers] reorganizing clusters