From: Alex B. <dv...@ac...> - 2006-03-24 00:53:44
|
A setup which "solves" the update issue while maintaining a level of HA is to have 2 (or more) unicast send channels from each node to a pair (or more) of gmond aggregators and to have a multicast channel setup between the aggregators themselves. The cost is more network traffic, but it's pretty insignificant anyway. Even on a 100Mb/s wire. As for resending data on whenever a new node appears on the multicast channel, I haven't looked at the code (it's far too late at night to do that now), but I hope that wasn't implemented the way you describe... Think of a cluster with a few thousand nodes on one channel (not the best idea probably probably) - a new node shows up and everyone starts coughing their data on the wire. Add a really low dmax value and a node rebooting every few minutes (or a bad wire/switch/NIC) and you have a lovely little mess. Alex Jason A. Smith wrote: > On Thu, 2006-03-23 at 15:47 -0800, Chuck Simmons wrote: > >> Alex -- >> >> Thanks for the details. Telneting to a gmond XML port to dump >> internal state is a nice debugging technique. >> >> One of my problems is that I'm running a secondary daemon using the >> gmetric subroutine libraries, and it took me awhile to realize that >> daemon is in some ways equivalent to 'gmond'. In particular, I have >> to reboot it in addition to 'gmond'. The problem was immediately >> obvious once I used the telnet trick you mentioned. >> > > Metrics also have a dmax attribute that should force their removal from > memory once expired, but I don't remember if this is actually > implemented or not. > > >> So for the missing cpu data issue... Let me write down what's >> happening real slowly to make sure I understand. I'm running a >> multicast gmond on each cluster to aggregate data, implying that each >> node of the cluster eventually aggregates data about all other nodes >> of the same cluster. I'm using a centralized gmetad to pull data from >> a node of each cluster. Presumably 'gmetad' doesn't really remember a >> whole lot about the outlying nodes. >> > > I am not really sure what you mean here, but gmetad basically keeps info > about all nodes in each cluster in memory, similar to how gmond keeps > info about all nodes in its cluster in memory. Just like gmond, gmetad > also respects the dmax attributes. If you don't have dmax set or don't > want to wait that long then you will have to restart gmetad also. > > >> I go out to the cluster and kill gmond on each node. Then I go >> through the nodes and start gmond back up on each node. As each node >> starts, it broadcasts number of cpus throughout the cluster. Thus, >> when I'm done restarting, one of the nodes (the first to restart) >> knows how many cpus each node has, but nodes that were restarted last >> don't have complete state information. >> > > Not exactly true, see below. > > >> When I then restart 'gmetad' at the central location, it connects up >> to one of the nodes in the cluster, and if that node doesn't have full >> state informatin, gmetad incorrectly reports the number of cpus in the >> cluster. [Since I am using a background process that gathers metrics >> separately from 'gmond' relatively frequently, this background process >> is probably causing all nodes in the cluster to know about all of the >> hosts in the cluster if not all of the metrics of all of the hosts in >> the cluster.] >> This will eventually correct itself since all metrics are >> periodically rebroadcast. >> Possible alternate fixes may include: >> (1) When a node receives a broadcast from another node that it >> hasn't seen before, it may want to send its data back to the first >> node. If I start node A and it broadcasts to an empty cluster, then I >> start node B and it broadcasts to A, then it might be nice if node A >> sends data back to B because it can reasonably infer that B doesn't >> have A's state and that B should have A's state. >> > > I haven't checked the gmond sources lately, but this is exactly what it > was designed to do. Anytime gmond sees data from a new node that is > hasn't seen before, it assumes that node doesn't know anything about > itself either, and sends a complete set of its own metrics out on the > multicast address. This can actually cause part of the problem, > especially if you restart gmond on a lot of nodes all at the same time, > basically because multicast is udp based and therefore does not have > guaranteed packet delivery. I think during this burst of udp metrics > from many nodes, some get lost and you will just have to wait till they > are resent later. > > >> (2) maybe daemons that gather metrics should not directly >> broadcast them throughout a cluster. Instead the metrics should be >> accumulated within a central daemon and then be broadcast. (In other >> words, treat 'gmond' as having two separate components: a metrics >> gathering component and a metric/cluster aggregation component. Then >> both the metrics component of 'gmond' and the metrics that I am >> gathering should be handed to the aggregation component.) [This is >> probably not useful without also implementing (1) above.] >> (3) Alex implies that there may be alternate ways to >> configure a cluster without using multicasting which may handle some >> or all aspects of this problem. >> > > You can configure gmond to use unicast if you don't need or care about > the HA feature that multicast gives you. > > >> [We can treat each node as maintaining a list of metrics and >> their current values and broadcasting deltas to that list on a >> periodic basis. In the current system, it is possible to recieve a >> delta without having the background data to which the delta applies. >> Multiple daemons each spitting out deltas to their own metrics is >> compatible with the current model. However, we may want to have all >> the background data in a single list; we may also want each node to >> know which metric gathering daemons exist so that we can better report >> when one of the metric gathering daemons dies.] >> >> Moving on to the issue of correcting configuration problems. While we >> can say that having a timeout is the way to correct configuration >> issues, this is not necessarily the best implementation. Part of my >> problem is that I have multiple daemons that gather and broadcast >> metrics. If we address parts of that as discussed above, then it >> becomes easier to fix the broadcast address by just resetting a single >> daemon. >> > > There was a plan to provide a plugin architecture for writing custom > metrics in ganglia, I am not sure what happened to that though. > > >> So, at the current time, we can configure the system in a couple >> of ways. We can configure the system so that a host is considered >> removed from a cluster when the host has been down sufficiently long, >> or we can manually remove the host from the cluster by restarting all >> gmond daemons in the cluster. >> Possible alternate approaches might include providing a command >> that could be sent to a 'gmond' daemon in a cluster to remove a host >> from the cluster. It may be that there already exist mechanisms to >> restart all gmond daemons in a cluster, but this mechanism is not >> integrated into ganglia. >> >> So, thanks, I think I now understand what's going on. >> >> Cheers, Chuck >> >> >> >> Alex Balk wrote: >> >>> Hi Chuck, >>> >>> >>> See below... >>> >>> >>> >>> Chuck Simmons wrote: >>> >>> >>> >>>> The number of cpus does get sorted out, but I don't believe that >>>> restarting 'gmond' is a solution. The problem occurs after restarting >>>> a number of 'gmond' processes, and the problem is caused because >>>> 'gmond' is not reporting the information. Does 'gmond' maintain a >>>> timestamp on disk as to when it last reported the number of cpus and >>>> insist on waiting sufficiently long to report again? Does the >>>> collective distributed memory of the system remember when the number >>>> of cpus was last reported but not remember what the last reported >>>> value was? Is there any chance that anyone can give me hints to how >>>> the code works without me having to read the code and reverse engineer >>>> the intent? >>>> >>>> >>>> >>> The reporting interval for number of CPUs is defined within /etc/gmond.conf. >>> For example: >>> >>> collection_group { >>> collect_once = yes >>> time_threshold = 1800 >>> metric { >>> name = "cpu_num" >>> } >>> >>> The above defines that the number of CPUs is collected once at the >>> startup of gmond and reported every 1800 seconds. >>> Your problem occurs because gmond doesn't save any data on disk, but >>> rather in memory. This means that if you're using a single gmond >>> aggregator (in unicast mode) and that aggregator gets restarted, it will >>> will not receive another report the number of CPUs till 1800 seconds >>> elapsed since the previous report. >>> The case of multicast is a more interesting one, since every node holds >>> data for all nodes on the multicast channel. The question here is >>> whether an update with a newer timestamp overrides all previous XML data >>> for the host. I don't think that's the case, it seems more likely that >>> only existing data is overwritten... but then, I don't use multicast, so >>> you may qualify this answer as throwing useless, obvious crap your way. >>> >>> Generally speaking, there are 2 cases when a host reports a metric via >>> its send_channel: >>> 1. When a time_threshold expires. >>> 2. When a value_threshold is exceeded. >>> >>> You're welcome to read the code for more insight, but a simple telnet to >>> a predefined TCP channel would probably be quicker. You could just look >>> at the XML data and compare pre-update and post-update values (yes, >>> you'll need to take note of the timestamps - again, in the XML). >>> >>> >>> >>>> I understand that I can group nodes via /etc/gmond.conf. The question >>>> is, once I have screwed up the configuration, how do I recover from >>>> that screw up? I have restarted various gmetad's and various >>>> gmond's. The grouping is still incorrect. Exactly which gmetad's and >>>> gmond's do I have to shut down when. And, again, my real question is >>>> about understanding how the code works -- how the distributed memory >>>> works. >>>> >>>> >>>> >>> As far as I know, you cannot recover from a configuration error unless >>> you've made sure host_dmax was set to a fairly small, non-zero value. >>> >>> From the docs: >>> >>> The host_dmax value is an integer with units in seconds. When set to >>> zero (0), gmond will never delete a host from its list even when a >>> remote host has stopped responding. If host_dmax is set to a positive >>> number then gmond will flush a host after it has not heard from it for >>> host_dmax seconds. By the way, dmax means ``delete max''. >>> >>> This way, once a host's configuration was modified to point at a >>> different send channel, the aggregator(s) on its previous channel will >>> forget about its existence once delete_max expires. >>> >>> Personally, I don't use multicast due to various reasons, the main one >>> actually being its main advantage - every node keeps data on the entire >>> cluster. While this provides for maximal high availability, it also has >>> a bigger memory footprint. Especially when you have a few thousands of >>> nodes. >>> >>> >>> >>>> I'd much rather be ignored than have people try to pawn off facile >>>> answers on me. >>>> >>>> >>>> >>> I'd provide you with more information on a possible setup which balances >>> high availability with performance, but I wouldn't want to overflow you >>> with useless data any more than I've done so far. >>> Let me know if you'd like more information. >>> >>> Cheers, >>> Alex >>> >>> >>> >>>> Cheers, Chuck >>>> >>>> >>>> >>>> Bernard Li wrote: >>>> >>>> >>>>> Hi Chuck: >>>>> >>>>> For the first issue - give it time, it should sort itself out. >>>>> Alternatively, you can find out which node is reporting incorrect >>>>> information, and restart gmond on it. >>>>> >>>>> For the second issue, you can group nodes in different data_source >>>>> via the multicast port in /etc/gmond.conf. Use the same port # for >>>>> nodes you want belonging to the same group. >>>>> >>>>> You'll need to restart gmetad and gmond for the new groupings to take >>>>> effect. >>>>> >>>>> Cheers, >>>>> >>>>> Bernard >>>>> >>>>> ------------------------------------------------------------------------ >>>>> *From:* gan...@li... on behalf of >>>>> Chuck Simmons >>>>> *Sent:* Wed 22/03/2006 17:54 >>>>> *To:* gan...@li... >>>>> *Subject:* [Ganglia-developers] reorganizing clusters >>>>> >>>>> I need help understanding two things. >>>>> >>>>> I currently have a grid. One of the clusters in the grid is named >>>>> "staiu" and the "grid" level web page reports that this has 8 hosts >>>>> containing 4 cpus. In actuality, this has 8 hosts each containing 4 >>>>> cpus, but apparently the hosts are not reporting the current number of >>>>> cpus to the front end. Why not? I recently restarted gmond on each of >>>>> the 8 hosts. >>>>> >>>>> Another cluster is named "staqp05-08" and the "grid" level web page >>>>> reports that this has 12 hosts. In actual fact, it only has 4 hosts. >>>>> The extra 8 hosts are the 8 hosts of the 'staiu' cluster. On the >>>>> cluster level page for staqp05-08, the "choose a node" pull down menu >>>>> lists the 8 staiu hosts, and the "hosts up" number contains the staiu >>>>> hosts, and there are undrawn graphs for each of the staiu hosts in the >>>>> "load one" section. What do I have to do so that the web pages or gmond >>>>> daemons or whatever won't think that the staqp cluster contains the >>>>> staiu hosts? What is the specific mechanism that causes this >>>>> association to persist despite having shutdown all staqp gmond daemons >>>>> and both the gmond and gmetad daemons on the web server, simultaneously, >>>>> and then starting up that collection of daemons? >>>>> >>>>> Thanks, Chuck >>>>> >>>>> >>>>> ------------------------------------------------------- >>>>> This SF.Net email is sponsored by xPML, a groundbreaking scripting >>>>> language >>>>> that extends applications into web and mobile media. Attend the live >>>>> webcast >>>>> and join the prime developer group breaking into this new coding >>>>> territory! >>>>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 >>>>> <http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642> >>>>> _______________________________________________ >>>>> Ganglia-developers mailing list >>>>> Gan...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers >>>>> >>>>> >>>>> |