[Ganglia-developers] Configurable Host TMAX

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi all,

First of all, my apologies if this list is not appropriate for this  
mail, and thanks to all ganglia developers for the great work.

I encountered the following problem:

We need to monitor thousands of hosts. Each rack of 32 machines  
reports to one gmond in the rack. After that, only one gmetad gathers  
the data from one source per rack.

Because of the high number of hosts monitored and because there are  
other monitor tools built on top of ganglia running on the gmetad  
host, we needed to decrease the load on the system. We've done so by  
putting the rrdtool files (/var/lib/ganglia/rrds) on a tmpfs which is  
backed up regularly.
As we also need to keep data for a long time, but are limited in space  
(tied to the tmpfs size, a few GB), we increased the polling interval  
for each data source to 120 seconds.

The problem is that the hardcoded value of the host TMAX of 20  
seconds, which is multiplied by 4 and compared to the time elapsed  
since the last poll (TN), becomes too low. 80 seconds after the last  
poll of the data source, most of the nodes (or all of them) are marked  
as being down.

I wrote a patch (attached to this mail), based on version 3.0.7, to be  
able to configure the TMAX value the same way the DMAX value is  
configured in the gmond.conf file.

The question is: is there a reason why this value has been kept non  
configurable ? and if not, would my patch be acceptable as it is ?
If it is not acceptable, I am open to any suggestion I might bring to  
make it acceptable.

Thanks,

Sebastien.

[Ganglia-developers] Configurable Host TMAX

Scalable, distributed monitoring system for high-performance computing

[Ganglia-developers] Configurable Host TMAX