I've been able to partially trackdown why my nodes are only reported as "up" for ~6 minutes before going "down" on the web interface- it looks like gmond stops receiving multicast and unicast messages. Restarting gmond on the “gatherer node” re-establishes listening for about 6 minutes, then gmond gets nothing from the other nodes.

A cron job is running to reset gmond every hour but that all that does is make for a saw tooth uptime chart. Restarting gmetad isn't needed as it seems to be communicating to gmond ok- this is reflected in the web charts (i tried it anyways just in case to no avail). gmetad reports only the localhost's gmond is up. The rest are "down" until gmond is restarted.

Is there anything else i can check? Any ideas what can i can do to make the systems be reported as "up"?
Recompiled with apr-1.4.5 (originally 1.2.7), no effect (another posting had this as a problem)
OS = Centos 5.5 and 5.6
ganglia 3.2.0
confuse 2.7
pcre 8.13
rrdtool 1.4.4
No Ipv6
network: all on the same switch
 
Per this posting by Avani Sharma there is a problem with older versions of apr:
http://sourceforge.net/mailarchive/message.php?msg_id=27794074
[root@lando ganglia]# ldd /usr/local/sbin/gmond | grep apr
libapr-1.so.0 => /usr/local/apr/lib/libapr-1.so.0 (0x00002b7fa3feb000)
[root@lando ganglia# /usr/local/apr/bin/apr-1-config --version
1.4.5
 
 
 
/etc/gmond.conf (from system lando)
http://pastebin.com/d1XcBs18
only changed clustername/port, and added the other hosts to unicast. The other nodes have this node to unicast.

******Here is tcpdump when the systems are marked as down, from the “gathering gmond node”: **********
[root@lando ziggy]# /usr/sbin/tcpdump -i any ip multicast
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
09:54:13.273830 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:13.273851 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:13.273870 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:13.273890 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:13.273909 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:13.273920 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:13.273935 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:33.274233 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274255 IP lando.33023 > 239.2.11.71.8641: UDP, length 44
09:54:33.274277 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274295 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274314 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274334 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:33.274365 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:34.274377 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
09:54:35.274753 IP lando.33023 > 239.2.11.71.8641: UDP, length 48
[snip]
 
********** Then i restart gmond on the gathering gmond which makes things all better**********
[root@lando ziggy]# /sbin/service gmond restart
Shutting down GANGLIA gmond:                               [  OK  ]
Starting GANGLIA gmond:                                    [  OK  ]
[root@lando ziggy]#

******Here is tcpdump when the systems are marked as up: **********
[root@lando ziggy]# /usr/sbin/tcpdump -i any ip multicast
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
09:56:47.346753 IP yoda.46045 > 239.2.11.71.8641: UDP, length 176
09:56:47.346801 IP yoda.46045 > 239.2.11.71.8641: UDP, length 48
09:56:47.346807 IP yoda.46045 > 239.2.11.71.8641: UDP, length 204
09:56:47.346857 IP yoda.46045 > 239.2.11.71.8641: UDP, length 48
09:56:47.347078 IP lando.58895 > 239.2.11.71.8641: UDP, length 28
09:56:47.347298 IP lando.58895 > 239.2.11.71.8641: UDP, length 32
09:56:50.289075 IP han.51957 > 239.2.11.71.8641: UDP, length 52
09:56:50.289084 IP han.51957 > 239.2.11.71.8641: UDP, length 56 |
09:56:50.289084 IP han.51957 > 239.2.11.71.8641: UDP, length 56
09:56:50.289096 IP han.51957 > 239.2.11.71.8641: UDP, length 56
09:56:50.289099 IP han.51957 > 239.2.11.71.8641: UDP, length 56
09:56:50.289200 IP yoda.46045 > 239.2.11.71.8641: UDP, length 28
[snip]
 
Approximately 6 minutes later gmond stops listening and only listens to itself again. Restarting gmond on the other nodes has no effect on the listening of the gathering-gmond. It's always ~6 minutes, never 5 or 10.
 
[root@lando ganglia]# telnet localhost 8641 | grep 192.168
<HOST NAME="luke" IP="192.168.1.1" TAGS="" REPORTED="1315501243" TN="8" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315499617">
<HOST NAME="yoda" IP="192.168.1.2" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315500568">
<HOST NAME="han" IP="192.168.1.7" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315499611">
<HOST NAME="lando" IP="192.168.1.8" TAGS="" REPORTED="1315501250" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1315501003">
Connection closed by foreign host.