Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios proces
Nagios network monitoring software is enterprise server monitoring
Brought to you by:
egalstad,
sawolf-nagios
From: Mahesh K. <mk...@gm...> - 2006-12-21 16:47:43
|
Hi Ton! > Here is what we did to resolve. > > 1. Edit the include/nagios.h.in > change > #define COMMAND_BUFFER_SLOTS 1024 > to > #define COMMAND_BUFFER_SLOTS 60000 > > And change > #define SERVICE_BUFFER_SLOTS 1024 > to > #define SERVICE_BUFFER_SLOTS 60000 > > > > I was intrigued by this as we have a performance issue, but not with the > same symptoms. Our problem is that NSCA processes increase when the nagios > server is under load. They appear to be blocking on writing to the command > pipe. Switching NSCA to single daemon mitigates the problem (slaves will > timeout their passive results), but we wanted to know where any slow downs > could be. We had the NSCA related performance issues too. We started writing to a file on the slaves, the results it gets to be forwarded to master. Then once every 10 or 15 seconds, send that file over to master. On 12/21/06, Ton Voon <ton...@al...> wrote: > Hi Mahesh, > > > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote: > > Here is what we did to resolve. > > 1. Edit the include/nagios.h.in > change > #define COMMAND_BUFFER_SLOTS 1024 > to > #define COMMAND_BUFFER_SLOTS 60000 > > And change > #define SERVICE_BUFFER_SLOTS 1024 > to > #define SERVICE_BUFFER_SLOTS 60000 > > > > I was intrigued by this as we have a performance issue, but not with the > same symptoms. Our problem is that NSCA processes increase when the nagios > server is under load. They appear to be blocking on writing to the command > pipe. Switching NSCA to single daemon mitigates the problem (slaves will > timeout their passive results), but we wanted to know where any slow downs > could be. > > From your findings, we've created a performance static patch, attached. This > collects the maximum and current values for the command and service buffer > slots and is then written to status.dat (by default every 10 seconds). What > I found with a fake slave sending 128 results every 5 seconds was that the > maximum values were fairly low (under 100), but when I put the server under > load, the maximum_command_buffer_items shot up to 1969 and the > maximum_service_buffer_items shot up to 2156 (had changed from defaults to > your 60000). > > This could show if the buffer is filled at various points or if there is not > enough data ready for Nagios to process further down the chain. > > I'd be interested in figures from other systems. > > Warning: the patch is not thread safe, so there is no guarantees that the > statistic data will not be corrupted (but should not affect usual Nagios > operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 kernel. > > Ton > > http://www.altinity.com > T: +44 (0)870 787 9243 > F: +44 (0)845 280 1725 > Skype: tonvoon > > > > > > > > |