[munin-users] munin node randomly fails to connect to many nodes simultaneously

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Greetings,
I'm running a server with munin-1.4.5 on a Fedora14-x86_64 system,
which is monitoring nearly 300 systems.  Most of the time, everything
works fine, however seemingly at random, but usually about once every
week, munin suddenly claims that it can't connect to dozens of nodes
at the same time.  In the munin-update log, I see errors such as:
[ERROR] Munin::Master::UpdateWorker<cuda-linux-bench5;cuda-linux-bench5>
failed to connect to node

or something like:
[WARNING] Call to accept timed out

For the example system above, I see gaps such as the following in the
rrd file as a result of munin not being able to connect:
$ rrdtool fetch cuda-linux-bench5-cpu-system-d.rrd AVERAGE -s -75m
                             42

1313782200: 3.9515947113e-01
1313782500: 2.9118542674e-01
1313782800: 2.1081812152e-01
1313783100: -nan
1313783400: -nan
1313783700: -nan
1313784000: -nan
1313784300: -nan
1313784600: -nan
1313784900: -nan
1313785200: -nan
1313785500: 4.2049231082e-01
1313785800: 4.5451342387e-01
1313786100: 4.4806339035e-01
1313786400: 4.1462144143e-01

The part that makes even less sense is while this is happening I can
successfully telnet to port 4949 from the munin server to any of the
nodes that munin-update is claiming cannot be connected to.  Then
after anywhere from 15 to 45 minutes, the problem goes away on its
own.

I'm at a loss how to debug this, since whatever is wrong doesn't
reproduce with a trivial telnet attempt.  Is there a known issue like
this?  How can I debug this better?

thanks!