From: Sebastien P. <seb...@cg...> - 2008-05-30 09:02:45
|
Hi all, First of all, my apologies if this list is not appropriate for this mail, and thanks to all ganglia developers for the great work. I encountered the following problem: We need to monitor thousands of hosts. Each rack of 32 machines reports to one gmond in the rack. After that, only one gmetad gathers the data from one source per rack. Because of the high number of hosts monitored and because there are other monitor tools built on top of ganglia running on the gmetad host, we needed to decrease the load on the system. We've done so by putting the rrdtool files (/var/lib/ganglia/rrds) on a tmpfs which is backed up regularly. As we also need to keep data for a long time, but are limited in space (tied to the tmpfs size, a few GB), we increased the polling interval for each data source to 120 seconds. The problem is that the hardcoded value of the host TMAX of 20 seconds, which is multiplied by 4 and compared to the time elapsed since the last poll (TN), becomes too low. 80 seconds after the last poll of the data source, most of the nodes (or all of them) are marked as being down. I wrote a patch (attached to this mail), based on version 3.0.7, to be able to configure the TMAX value the same way the DMAX value is configured in the gmond.conf file. The question is: is there a reason why this value has been kept non configurable ? and if not, would my patch be acceptable as it is ? If it is not acceptable, I am open to any suggestion I might bring to make it acceptable. Thanks, Sebastien. |