Thread: [Ganglia-general] monitoring a HA cluster

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Second post today, separate topic...

I've got a few machines set up as active/passive clusters running =20
heartbeat/drbd.  I am currently monitoring them with ganglia, but I =20
think the information I'm getting leads to a misleading picture.

Since both machines are monitored, it looks like I have 8 processors =20
in the cluster (4 each in 2 boxes).  But in reality, only 1 of these =20
machines is ever available at 1 time.  I am keeping a mental note to =20
myself that any time these clusters are more than 50% utilized, =20
they're really >100% utilized, since the CPUs, RAM, etc from the =20
passive node really shouldn't count in the totals.  Always having to =20
drill down to the level of the individual machine to see what's going =20
on is kind of a pain.

The only solution I've thought of is to keep gmond turned off on the =20
passive node, and starting it during a resource migration.  This would =20
be easy enough, but it would have 2 drawbacks :
1. My stats would say 50% of my cluster is 'down' although it's =20
functioning correctly.
2. It is sometimes useful to monitor stuff on the passive node, and I =20
don't really want to lose that ability.

Any better ways to do this?  Maybe extend the PHP frontend to be =20
configurable for monitoring active/passive?  (Would anyone else have a =20
use for that besides me?)

thanks,
alex

Thread: [Ganglia-general] monitoring a HA cluster

Scalable, distributed monitoring system for high-performance computing

ganglia-general