Current Release Notes
Ganglia Release 3.1.2
The current release of Ganglia is 3.1.2 and can be downloaded from here
This release fixes some of the known issues including a buffer overflow issue in gmetad (CVE-2009-0241). It also includes support for metric spoofing, similar to gmetric, from a metric module.
The following is the list of bug fixes and enhancements to the current release:
- gmond/gmetad: Sync-up the default values for the cluster section of gmond with the default gmond.conf so that a cluster name will always be present. The gmetad code can not handle a host with no associated cluster, therefore the gmond code must always include a cluster XML tag. Bug #200
- gmond: Add an 'enabled' directive to the module section so that a module can easily be enabled or disabled through the configuration file
- gmond: reformat memory metrics to match pre 3.1 style (REGRESSION)
- gmond: -r support for transforming 2.5 configurations (REGRESSION)
- gmond: add boolean option to 'allow_extra_data' generation (BUG199)
- gmond: include localhost in translated (-r) trusted_hosts from 2.5
- gmetad: skip unresponsive sources (BUG92)
- gmetad: CVE-2009-0241: buffer overflow in interactive port (BUG223)
- libganglia: mcast_if support in gmond (BUG140)
- web: add boolean option for using hostname without domainname for graphs
- web: add host atrributes into metric list (BUG30)
- web: metric group enhancements for host view (BUG203)
- web: add option for configurable number of columns in cluster view (BUG194)
- web: make number of metric columns in host view configurable (BUG194)
- Allow both a C and python module to create a metric that will spoof a specific host. This provides the same spoofing functionality as gmetric but through a metric module. It is done by adding SPOOF_HOST and SPOOF_NAME as extra metadata to the metric description
- gmond: mod_python support for versions older than 2.3 or newer than 2.4
- mod_python: Change the way that the python module path is added to better support the Solaris platform. It is also a cleaner way to add the python path programatically rather than altering the PYTHON_PATH environment variable.
- gmetric: Support the short commandline parameter format when spoofing a heartbeat metric. (Regression fix from 3.0.x)
Dealing with .conf file changes when installing from an RPM
If you build RPMs from the tarball and try to upgrade, RPM will create /etc/ganglia/{gmetad,gmond}.conf.rpmsave. This is because there were modifications made to the configuration files from the previous version.
For gmetad, the modifications are negligible, so you could keep your old configuration. For gmond, the modifications are a little more intrusive, mostly because the moduledir is auto-detected so the full path to the DSOs is no longer necessary. Your existing 3.1.0 configuration file will continue to work with 3.1.1 if you decide to keep it, however, If you would like to use the new base configuration file, do the following:
- Before you upgrade, run `gmond -t > ~/gmond-3.1.0.conf`
- `diff -ru gmond-3.1.0.conf /etc/ganglia/gmond.conf > ~/gmond.conf.diff`
- `cp -a /etc/ganglia/gmond.conf /etc/ganglia/gmond.conf.bak`
- Upgrade by running `rpm -Fvh *.rpm`
- `mv /etc/ganglia/gmond.conf.rpmsave /etc/ganglia/gmond.conf`
- `cd /etc/ganglia/ && patch -p0 --dry-run < ~/gmond.conf.diff`
- If everything checks out, run patch without the '--dry-run' option
- You're done
Previous Release Notes
Ganglia Release 3.1.1
The current release of Ganglia is 3.1.1 and can be downloaded from here
Unofficial and experimental install packages for Debian and Windows can be downloaded from:
Debian
Windows
This release fixes some of the known issues reported below, including the gmetad segfault that was preventing 3.1.0 from being used in a hierarchical configuration and the instabilities in the tcpconn python metric module for gmond.
The following is the list of bug fixes and enhancements to the current release:
- Fix segfault when aggregating 3.1 gmetad
- Fix failures and instability for tcpconn.py
- Module directory configurable at build and run time
- Autodetect libdir/moduledir for bi-arch Linux architectures
- Include contrib directory with user provided goodies
- Support for building C++ DSO
- Support for building with Sun Studio 12 in OpenSolaris
- In some platforms (BSD) where /var/lib/ganglia doesn't exist, the RRDs should now be stored and accessed from /var/db/ganglia
- In node view show correctly the downtime relative to cluster time
- In meta view show grid summary always on top regardless of sorting
- Smoother Web frontend navigation by removing interstitial pages, remembering the selected metrics and clarifying messages
- Bug fixes and Enhancements
Ganglia Release 3.1.0
This release of Ganglia and can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=43021&package_id=35280&release_id=616721.
Note: There is a known bug where a 3.1.0 gmetad will segfault when trying to aggregate XML data from another 3.1.0 gmetad. If your environment requires this feature, please wait for the upcoming 3.1.1 release. A patch has been developed and is available here if you can't wait.
Upgrading Instructions
All releases with version number 3.1.x are meant to be compatible even if they might require small configuration updates or relocating files in some cases.
Upgrading from 3.0
The 3.0 and 3.1 versions of ganglia are only compatible at the XML layer, and so you can't mix them in the same cluster (as defined by the multicast address or unicast collector used).
A 3.1 gmetad will be able to collect data from either version (both gmond or gmetad) and so it will be better if you plan your migration in a cluster by cluster basis and starting from the top of your gmetad hierarchy (if you have one) recursively as described by the following steps:
- upgrade your head gmetad/frontend (or your only gmetad if not using a hierarchy)
- upgrade each one of the sources one cluster at a time (will include upgrading a leaf gmetad if using a hierarchy)
Upgrading from 2.5
The 2.5 and 3.0 versions of ganglia are expected to be compatible at the XML layer, and so the instructions for 3.0 should work as well here.
Beware that the configuration format was changed between 2.5.7 and 3.0 and so you'll need to use the conversion utility (gmond -r) to get an equivalent configuration that will be used in your new setup and that will need additional entries to work as noted in the known issues.
The file locations has also changed as noted in the important notes.
Important Notes:
- 3.1 disk metrics use a different unit in AIX so additional steps might be required when upgrading, check README.AIX for details
- 3.1 memory metrics use floating point values instead of 32bit static width integers to avoid artificial limitations to the memory reported, therefore, if using a hierarchical gmetad configuration you will need to ensure that summarization is done by a 3.1 gmetad or it will be incorrect.
- 3.1 doesn't treat user generated metrics specially anymore, so all metrics will be shown together with the core metrics in a 3.1 frontend and there will be no more a gmetrics link in the host view. If using a 3.0 frontend, all user metrics from a 3.1 gmond will be mixed with the core metrics while it will show in the gmetrics page for older gmond.
- 3.1 configuration file format has changed and has moved to a different directory, if looking to convert your current 3.0 configuration to a 3.1 equivalent refer to the gmond 3.1 configuration page and remember to move it to the new directory (done automatically if using RPM)
- 3.1 collectors will request a gmond to resend its metric description information if needed and if using multicast, if you are using unicast there is no way to do that yet and so if you restart your collector will be left with partial or no data from the cluster that is being collected through it untill all gmond in that cluster are restarted. To workaround this problem if using unicast setup send_metadata_interval to a reasonable value so that all gmond resent their metadata periodically to the collector in case it gets lost.
- 3.1 manages metadata for the metrics independently than the metric data itself, this will lead to cases where the metric has been defined but no value for it has been collected yet, which is specially problematic when looking at metrics that are updated infrequently like cpu_count. With a default configuration using multicast getting all gmond in a cluster to agree in the number of CPUs available could take up to 20 minutes so bear that in mind when restarting the gmond on your clusters and better start with the collector (as defined by the gmetad configuration).
- 3.1.2 would fail to build in some 64bit system configurations if 32bit libraries/headers for expat and apr are also installed, to workaround this problem either remove those packages or rebootstrap the package by running autoreconf.
Known Issues:
- no support for C++ to create DSO modules (Objective C should work though) (fixed in 3.1.1)
- unstability for tcpconn python metric module (race condition affects gmond -m; collection thread crashes as shown in BUG196) (fixed in 3.1.1)
- Linux 64bit platforms that have biarch support through /usr/lib64 should use --libdir at configure time (done automatically if using RPM) (fixed in 3.1.1)
- the following platforms won't be able to build or have a working gmond:
- Darwin (AKA MacOS/X)
- HPUX
- Tru64 (AKA OSF/1)
- Irix
- the following platforms won't be able to build DSO metric modules:
- Cygwin (AKA Windows)
- AIX
- testing for library dependencies is flaky at best and relies in gcc intrinsic support, and while it could get through configure might fail at link time. Packages for libconfuse 2.5 has been known to be problematic in several platforms and might require extra parameters to be added through the use of LD_FLAGS or LIBS as shown in BUG197.
- 3.1 modular metrics don't support spoofing, so you'll have to use gmetric if spoofing is needed (fixed in 3.1.2)
- C99 support from the compiler used is assumed and tested for, but at least in Solaris 10 it could fail to get enabled correctly and result in a failed build as shown in BUG215. Adding CFLAGS="-std=c99" at configure time might be needed in those cases.
- the additional metric modules are Linux specific and there is no support for making them architecturally agnostic yet.
- if an additional metric module is configured but fails to load (like trying to start a Linux specific metric in a different platform) gmond will fail to start silently and will require the additional metric to be removed to recover.
- all metric collection routines run as root even if the process has since lowered privileges, be careful with your python module scripts
- python module support requires python 2.3 or newer; python 2.4 is recommended as it has been tested the most (partially fixed in 3.1.2)
- if using python modules gmond will segfault while trying to start if there is no gmond.conf
- in python metrics and unhandled exception will result in the collection thread getting killed and the metric not being collected anymore until gmond is restarted.
- errors in the configuration could result in gmond/gmetad aborting silently or in some cases on segfaults as explained above, try to start the process in the foreground with debugging enable (-d9) to get a better idea of what the configuration problem might be in those cases.
- the conversion utility doesn't generate a working 3.1 configuration because it has no modules section and won't report any metrics, if upgrading from 2.5 it will be better to generate a default configuration (gmond -t) and update it from there (fixed in 3.1.2)
- mcast_if is not honoured and so if you have multiple interfaces and are using multicast will need to add a static route to force multicast packets through the interface you want to be used (fixed in 3.1.2)
- if a module is loaded twice gmond will leak memory and crash quickly (or crash the machine where it is running) so be careful with your module listing in the configuration.
- gmetad will crash if it finds a gmond that is not part of a cluster, so be sure that all your gmond include a "cluster" section in their configuration (fixed in 3.1.2)
TCPConn Python Metric Module
The tcpconn.py metric module has a known issue that can interfere with the current module list that is produced by gmond when using the '-m' parameter. The tcpconn.py module uses netstat in order to gather tcp connection information. Due to the fact that this module is using the popen2 functions to exec the netstat utility, when invoked by the '-m' by gmond, the exec process does not always terminate gracefully. This can cause the module list output to terminate abnormally. The following is a more detailed description of the issue and workarounds:
The tcpconn.py module gathers connection data by taking advantage of python threads. Tcpconn.py spins up its own gathering thread that periodically exec's netstat and updates an internal array of metrics. When the gmond main thread requests the metrics, all it does is read the internal array and return whatever the last gathered value was. This allows gmond to execute normally without having to worry about delays. At worst, the tcpconn gathering thread might delay occasionally which has no effect on anything else. It was written this way on purpose so that gmond would never be at the mercy of the python exec code, netstat delays in execution or OS delays. The delay only shows up for gmond when the tcpconn metric_clean() function is called and the main gmond process has to wait for the tcpconn gathering thread to shutdown. That's why you see the delay in with the -m parameter and also when shutting down gmond. The gmond -m option causes the metric_init(), which starts the gathering thread and the metric_cleanup() which shuts down the gathering thread, to happen one immediately after the other. Gmond has to delay waiting for the thread cleanup. Also tcpconn.py takes a RefreshRate parameter that can be set in the tcpconn.pyconf configuration file. This parameter determines how often the tcpconn gathering thread should attempt to exec netstat to get a new value for the internal structure. The gathering of the netstat value and the gathering of the gmond metric can be on two different cycles for the simple fact that latency can't be pre-determined.
There are a couple of possible solutions to the issue. The first is to move back to using the subprocess python library for exec'ing the netstat utility. The python library allows the netstat data to be read from the command line and the process to terminate without additional wait time. The original version of the tcpconn.py module used the subprocess library calls rather than the popen2 functions. The original version can be found here:
The disadvantage of this solution and the reason why it was changed to use the popen2 functions, was due to incompatibility with pre-2.4 versions of Python. A second solution is to move the spawning of the metric gathering thread out of metric_init() module callback function. This will prevent the tcpconn.py module from attempting to spawn the data gathering thread during the '-m' module list processing and therefore avoiding the early termination of the list. This second solution is likely the solution that will be applied in the next minor release of Ganglia 3.1
Python Module Setup
The 'make install' target does not install and setup the python modules by default. If you are installing Ganglia using 'make install' rather than an RPM and are interested on the python metric modules, you will need to install them manually. The source tar ball includes a README file that describes how to install and configure a python metric module as well as how to develop one. You can find the README file under gmond/modules/python.
Basically, a python_modules directory needs to be created. This directory will contain all of the .py files for each python metric module. The full path to this location needs to match the 'path' directive in the 'modules' section of the modpython.conf configuration file. Modpython will use this path directive to locate all of the python metric modules. It will attempt to load all .py files that it find in this location. In addition, each of the .pyconf module configuration files needs to be copied to the /etc/ganglia/conf.d directory for each python module. Some of the .pyconf files may need additional configuration. For example, the multidisk.py python module by default does not specify the actual metric name. The reason is because the actual metric name is not determined until the module is loaded for the first time. To discover the metric names for each of the multidisk metrics, invoke gmond with the -m parameter after copying the multidisk.py module to the python module location. The '-m' parameter will instruct gmond to load the multidisk module along with all other modules, and produce a list of all of the valid metrics. From the list, you can extract the actual metric name that corresponds to each disk metric.
Be aware that the python modules (except for the example one) were designed to run in Linux and had been only tested in RedHat?/SuSE, so if you are using something else and the python module you loaded is not able to initialize it will abort gmond at startup and will need to be uninstalled manually to allow gmond to start again.
Red Hat Enterprise Linux (and derivatives like CentOS or ScientificLinux)
RPM dependencies on Red Hat Enterprise Linux 4
If you would like to install the Ganglia 3.1 RPMs on Red Hat Enterprise Linux 4, you will need the following extra RPM dependencies that are not available from your distribution repository:
- apr-1
- libconfuse
If you also have the distribution provided version of apr (0.9.4) installed, don't worry as apr-1 and apr-0.9.4 can co-exist. The dependencies in RPM format can be downloaded from:
http://www.ganglia.info/releases/3.1-deps/el4/
Simply download the RPMs, and install them using `rpm -ivh`.
You can also put the RPMs in a yum repository and use yum to download and install the dependencies automatically. However, if you already have apr-0.9.4 installed, you will need to put this in your /etc/yum.conf to allow multiple versions of apr to be installed at the same time (which is what you want):
installonlypkgs=apr
Note: These are unofficial RPMs provided such that one could easily install Ganglia 3.1.x on RHEL4. When updates are available via official channels for apr (eg. via yum), they would not be updated unless the version is greater than the one provided (which is 1.2.8-6).
RPM dependencies on Red Hat Enterprise Linux 5
apr-1 should be provided by the distribution. libconfuse can be installed via EPEL.