From: Lonni J F. <net...@gm...> - 2011-08-19 20:47:05
|
Greetings, I'm running a server with munin-1.4.5 on a Fedora14-x86_64 system, which is monitoring nearly 300 systems. Most of the time, everything works fine, however seemingly at random, but usually about once every week, munin suddenly claims that it can't connect to dozens of nodes at the same time. In the munin-update log, I see errors such as: [ERROR] Munin::Master::UpdateWorker<cuda-linux-bench5;cuda-linux-bench5> failed to connect to node or something like: [WARNING] Call to accept timed out For the example system above, I see gaps such as the following in the rrd file as a result of munin not being able to connect: $ rrdtool fetch cuda-linux-bench5-cpu-system-d.rrd AVERAGE -s -75m 42 1313782200: 3.9515947113e-01 1313782500: 2.9118542674e-01 1313782800: 2.1081812152e-01 1313783100: -nan 1313783400: -nan 1313783700: -nan 1313784000: -nan 1313784300: -nan 1313784600: -nan 1313784900: -nan 1313785200: -nan 1313785500: 4.2049231082e-01 1313785800: 4.5451342387e-01 1313786100: 4.4806339035e-01 1313786400: 4.1462144143e-01 The part that makes even less sense is while this is happening I can successfully telnet to port 4949 from the munin server to any of the nodes that munin-update is claiming cannot be connected to. Then after anywhere from 15 to 45 minutes, the problem goes away on its own. I'm at a loss how to debug this, since whatever is wrong doesn't reproduce with a trivial telnet attempt. Is there a known issue like this? How can I debug this better? thanks! |