Thread: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes

Nagios Open Source network monitoring software - The Standard In ITM

Brought to you by: aaronatnagios, cbyers, egalstad, smk-nagios

nagios-devel

[Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: <lin...@de...> - 2005-11-28 14:45:38

Hi everybody,

unfortunately nobody answered to Alex from viveconsulting.co.nz who had a
problem with "Nagios spawning rogue ..." and mailed to nagios mailing list
some months ago. Right now, we have the same problemn very likely he
described in a very detailed way. I tried also a lot of different things
(from configuration changes to tuning issues) to find out the real problem
and I guess the real bottleneck is the pipe used for communication between
Nagios processes. But I found not many reports e.g. emails about this
problem in the web and mail archives.

So why am I writing to list? Maybe someone can give me a hint, how to solve
or workaround that problem? We have 677 services configured and use 350
RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is
a little bit more than 1.00. As long as we stay below 1.00 no problem, but
otherwise ... (Detailed problem description in Alexs' mail)

This is just our start with Nagios. We want to configure thousands of
services and more than 100 hundred hosts. We would also invest in faster
hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware
an option? Looking at this issue with the focus on implementation: If the
pipe is the bottleneck it will stay a bottle neck on faster hardware too.
But maybe faster hardware will allow us to configure 3000 services, what
would be enough for the Nagios instance. And then, we deploy another Nagios
instance ...

Any comment would be greatly appreciated. What were your experiences with
Nagios in such an environment and how do you use it today?

Thanks a lot.



Mit freundlichen Gr=FC=DFen

Tobias Mucke

MAN Nutzfahrzeuge AG
Informationssysteme und Organisation
DV-Technologie und RZ-Betrieb
Linux- System-Technik
lin...@de...



This message and any attachments are confidential and may be privileged or =
otherwise protected from disclosure. 
If you are not the intended recipient, please telephone or email the sender=
 and delete this message and any attachment 
from your system. If you are not the intended recipient, you must not copy =
this message or attachment or disclose the 
contents to any other person.

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Andreas E. <ae...@op...> - 2005-11-28 16:10:27

lin...@de... wrote:
> Hi everybody,
> 
> unfortunately nobody answered to Alex from viveconsulting.co.nz who had a
> problem with "Nagios spawning rogue ..." and mailed to nagios mailing list
> some months ago.


A link to the mail archives would be helpful.


> Right now, we have the same problemn very likely he
> described in a very detailed way. I tried also a lot of different things
> (from configuration changes to tuning issues) to find out the real problem
> and I guess the real bottleneck is the pipe used for communication between
> Nagios processes.


Most likely. It's the only real bottleneck in nagios today, so...


> But I found not many reports e.g. emails about this
> problem in the web and mail archives.
> 
> So why am I writing to list? Maybe someone can give me a hint, how to solve
> or workaround that problem? We have 677 services configured and use 350
> RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is
> a little bit more than 1.00. As long as we stay below 1.00 no problem, but
> otherwise ... (Detailed problem description in Alexs' mail)
> 

CMS? Content Management System?
Anyways, 677 services shouldn't be a problem.


> This is just our start with Nagios. We want to configure thousands of
> services and more than 100 hundred hosts. We would also invest in faster
> hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware
> an option?

It helps, but not very much I'm afraid. The bottleneck requires a kernel 
recompile to be solved on most systems, and that's a very bad thing to 
do just to fix this particular problem.

> Looking at this issue with the focus on implementation: If the
> pipe is the bottleneck it will stay a bottle neck on faster hardware too.
> But maybe faster hardware will allow us to configure 3000 services, what
> would be enough for the Nagios instance. And then, we deploy another Nagios
> instance ...
> 

This is definitely a solution. Otherwise you could keep your eyes open 
in the somewhat near future for a mail with

[PATCH] checks: Multiplex running checks.

in the topic. I'm working on it right now, but perhaps Ethan won't let 
it in for the 2.x branch since it's a fairly massive change.

-- 
Andreas Ericsson                   and...@op...
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Antwort: Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: <lin...@de...> - 2005-11-29 10:12:03

Hi Andi,

thanks for your answer.

Here is the link to Alexs mail.

https://sourceforge.net/mailarchive/forum.php?thread_id=3D8135931&forum_id=
=3D1872

I thought that in Nagios terms CMS means Central Monitoring System?

A kernel recompile is not a problem for me. But I didn't find any setting
called "pipe size" nor even "pipe". Maybe you can give me a hint which
setting I have to change.

Hopefully Ethan let your change in the 2.x release. Would be great. I could
also test it in a massive / debugging way, if you are interested in.

Thanks a lot, again.

Mit freundlichen Gr=FC=DFen

Tobias Mucke

MAN Nutzfahrzeuge AG
Informationssysteme und Organisation
DV-Technologie und RZ-Betrieb
Linux- System-Technik
lin...@de...

             Andreas Ericsson                                              
             <ae...@op...>                                                   
             Gesendet von:                                              An 
             nagios-devel-admi          lin...@de... 
             n...@li...urcefor                                       Kopie 
             ge.net                     nag...@li... 
                                                                     Thema 
                                        Re: [Nagios-devel] Problems with   
             28.11.2005 17:09           many hanging Nagios processes      
                                        (Nagios spawning rogue nagios      
                                        processes eventually crashing      
                                        Nagios server)                     

lin...@de... wrote:
> Hi everybody,
>
> unfortunately nobody answered to Alex from viveconsulting.co.nz who had a
> problem with "Nagios spawning rogue ..." and mailed to nagios mailing
list
> some months ago.

A link to the mail archives would be helpful.

> Right now, we have the same problemn very likely he
> described in a very detailed way. I tried also a lot of different things
> (from configuration changes to tuning issues) to find out the real
problem
> and I guess the real bottleneck is the pipe used for communication
between
> Nagios processes.

Most likely. It's the only real bottleneck in nagios today, so...

> But I found not many reports e.g. emails about this
> problem in the web and mail archives.
>
> So why am I writing to list? Maybe someone can give me a hint, how to
solve
> or workaround that problem? We have 677 services configured and use 350
> RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load
is
> a little bit more than 1.00. As long as we stay below 1.00 no problem,
but
> otherwise ... (Detailed problem description in Alexs' mail)
>

CMS? Content Management System?
Anyways, 677 services shouldn't be a problem.

> This is just our start with Nagios. We want to configure thousands of
> services and more than 100 hundred hosts. We would also invest in faster
> hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster
hardware
> an option?

It helps, but not very much I'm afraid. The bottleneck requires a kernel
recompile to be solved on most systems, and that's a very bad thing to
do just to fix this particular problem.

> Looking at this issue with the focus on implementation: If the
> pipe is the bottleneck it will stay a bottle neck on faster hardware too.
> But maybe faster hardware will allow us to configure 3000 services, what
> would be enough for the Nagios instance. And then, we deploy another
Nagios
> instance ...
>

This is definitely a solution. Otherwise you could keep your eyes open
in the somewhat near future for a mail with

[PATCH] checks: Multiplex running checks.

in the topic. I'm working on it right now, but perhaps Ethan won't let
it in for the 2.x branch since it's a fairly massive change.

--
Andreas Ericsson                   and...@op...
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick
_______________________________________________
Nagios-devel mailing list
Nag...@li...
https://lists.sourceforge.net/lists/listinfo/nagios-devel

This message and any attachments are confidential and may be privileged or =
otherwise protected from disclosure. 
If you are not the intended recipient, please telephone or email the sender=
 and delete this message and any attachment 
from your system. If you are not the intended recipient, you must not copy =
this message or attachment or disclose the 
contents to any other person.

Re: Antwort: Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Andreas E. <ae...@op...> - 2005-11-29 10:24:39

lin...@de... wrote:
> Hi Andi,
> 
> thanks for your answer.
> 
> Here is the link to Alexs mail.
> 
> https://sourceforge.net/mailarchive/forum.php?thread_id=8135931&forum_id=1872
> 

Thanks. Yes, this has to do with the pipe-size which, unfortunately, 
just isn't big enough. A solution would be to have a wrapper program 
listen on the pipe (for the CGI's and such), parse it to a numerical 
value and then pass on the command to Nagios through a local UDP socket 
which can have dynamic receive-buffers with a roof somewhere around 128 
pages (128 * 4096 = 512KB), iirc.

> I thought that in Nagios terms CMS means Central Monitoring System?
> 

That would be NMS (Network Monitoring System), although I see why you 
made the mistake from the original mail.

> A kernel recompile is not a problem for me. But I didn't find any setting
> called "pipe size" nor even "pipe". Maybe you can give me a hint which
> setting I have to change.
> 

It's not a setting. It's a macro in the kernel sources.
grep -r "FIFO.*4096" /usr/src/linux

The latest sources from git shows multiple entries of DEFAULT_FIFO_LEN. 
You may need to change all of them and expect the machine to crash every 
now and then until you find the right one (which is why this shouldn't 
really be fixed by a kernel re-compile).

> Hopefully Ethan let your change in the 2.x release. Would be great. I could
> also test it in a massive / debugging way, if you are interested in.
> 

I will be when I've got something to test. Thanks.

-- 
Andreas Ericsson                   and...@op...
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

[Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Mahesh K. <mk...@gm...> - 2006-12-19 00:34:33

---------- Forwarded message ----------
From: Mahesh Kunjal <mk...@gm...>
Date: Dec 18, 2006 2:58 PM
Subject: Re: [Nagios-devel] Problems with many hanging Nagios
processes (Nagios spawning rogue nagios processes eventually crashing
Nagios server)
To: nag...@li..., mk...@gm...

Hi Andreas

We had similar issue. We have a distributed environment with one
master and 4 slaves. Total number of hosts monitored are 1900+ and
20000+ services spread across 4 slaves.

At times we saw 14K or more results being sent in a second from
slaves. This resulted in 100+ nagios processes being created.

Changed reaper frequency to 2 seconds and played with all tunables.
Nothing seemed to help.

Looking at the nagios source,
This is what I found out was happening...

Nagios has a commands file worker thread and when it gets woken up,
looks if there is data in pipe(nagios.cmd), if exists, forks a child
process. This will be in a loop and checks the pipe for data.

Now what does the forked nagios child process do?
It reads all the data from the pipe one message a time and puts it in
commands buffer. If if is able to write to buffer, just exits.

The problem here was command buffer had a limited size of 1024. This
is the default setting in include/nagios.h.in and is in the line
#define COMMAND_BUFFER_SLOTS 1024.

This was not enough and the child process started to wait for memory
to be freed so that the pipe data retrieved can be put in buffer.

While this child process waited for memory to be freed, the command
worker thread got woken up and realized that there is data in pipe and
forked another child. This got repeated and eventually server went out
of memory.

Here is what we did to resolve.

1. Edit the include/nagios.h.in
change
#define COMMAND_BUFFER_SLOTS 1024
to
#define COMMAND_BUFFER_SLOTS 60000

And change
#define SERVICE_BUFFER_SLOTS 1024
to
#define SERVICE_BUFFER_SLOTS 60000

2. Run ./configure
(make sure you don't have nano second sleep enabled. Also disable perl
interpreter)

3. make all;make install

- Mahesh Kunjal (maheshk)

-----------------------
This thread is located in the archive at this URL:
http://www.nagiosexchange.org/nagios-devel.33.0.html?&tx_maillisttofaq_pi1[showUid]=13177

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Mahesh K. <mk...@gm...> - 2006-12-19 00:42:49

		=09

We had similar issue. We have a distributed environment with one=
 master and 4 slaves. Total number of hosts monitored are 1900+=
 and
20000+ services spread across 4 slaves.

At times we saw 14K or more results being sent in a second from=
 slaves. This resulted in 100+ nagios processes being created.

Changed reaper frequency to 2 seconds and played with all tunables.
Nothing seemed to help.

Looking at the nagios source,
This is what I found out was happening...

Nagios has a commands file worker thread and when it gets woken=
 up, looks if there is data in pipe(nagios.cmd), if exists, forks=
 a child process. This will be in a loop and checks the pipe for=
 data.

Now what does the forked nagios child process do?
It reads all the data from the pipe one message a time and puts=
 it in commands buffer. If if is able to write to buffer, just exits.

The problem here was command buffer had a limited size of 1024.=
 This is the default setting in include/nagios.h.in and is in the=
 line #define COMMAND_BUFFER_SLOTS 1024.

This was not enough and the child process started to wait for memory=
 to be freed so that the pipe data retrieved can be put in buffer.

While this child process waited for memory to be freed, the command=
 worker thread got woken up and realized that there is data in pipe=
 and forked another child. This got repeated and eventually server=
 went out of memory.

Here is what we did to resolve.

1. Edit the include/nagios.h.in
change
#define COMMAND_BUFFER_SLOTS 1024
to
#define COMMAND_BUFFER_SLOTS 60000

And change
#define SERVICE_BUFFER_SLOTS 1024
to
#define SERVICE_BUFFER_SLOTS 60000

2. Run ./configure
(make sure you don't have nano second sleep enabled. Also disable=
 perl
interpreter)

3. make all;make install





- Mahesh Kunjal (maheshk)

-----------------------
This thread is located in the archive at this URL:
http://www.nagiosexchange.org/nagios-devel.33.0.html?&tx_maillisttofaq_pi=
1[showUid]=3D13177
				=09

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Ton V. <ton...@al...> - 2006-12-21 11:54:48

Hi Mahesh,

On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:

> Here is what we did to resolve.
>
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
>
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
>

I was intrigued by this as we have a performance issue, but not with =20
the same symptoms. Our problem is that NSCA processes increase when =20
the nagios server is under load. They appear to be blocking on =20
writing to the command pipe. Switching NSCA to single daemon =20
mitigates the problem (slaves will timeout their passive results), =20
but we wanted to know where any slow downs could be.

 =46rom your findings, we've created a performance static patch, =20
attached. This collects the maximum and current values for the =20
command and service buffer slots and is then written to status.dat =20
(by default every 10 seconds). What I found with a fake slave sending =20=

128 results every 5 seconds was that the maximum values were fairly =20
low (under 100), but when I put the server under load, the =20
maximum_command_buffer_items shot up to 1969 and the =20
maximum_service_buffer_items shot up to 2156 (had changed from =20
defaults to your 60000).

This could show if the buffer is filled at various points or if there =20=

is not enough data ready for Nagios to process further down the chain.

I'd be interested in figures from other systems.

Warning: the patch is not thread safe, so there is no guarantees that =20=

the statistic data will not be corrupted (but should not affect usual =20=

Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 =20=

kernel.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

=EF=BF=BC

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Mahesh K. <mk...@gm...> - 2006-12-21 16:47:43

Hi Ton!

> Here is what we did to resolve.
>
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
>
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
>
>
>
> I was intrigued by this as we have a performance issue, but not with the
> same symptoms. Our problem is that NSCA processes increase when the nagios
> server is under load. They appear to be blocking on writing to the command
> pipe. Switching NSCA to single daemon mitigates the problem (slaves will
> timeout their passive results), but we wanted to know where any slow downs
> could be.

We had the NSCA related performance issues too.
We started writing to a file on the slaves, the results it gets to be
forwarded to master.
Then once every 10 or 15 seconds, send that file over to master.



On 12/21/06, Ton Voon <ton...@al...> wrote:
> Hi Mahesh,
>
>
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
>
> Here is what we did to resolve.
>
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
>
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
>
>
>
> I was intrigued by this as we have a performance issue, but not with the
> same symptoms. Our problem is that NSCA processes increase when the nagios
> server is under load. They appear to be blocking on writing to the command
> pipe. Switching NSCA to single daemon mitigates the problem (slaves will
> timeout their passive results), but we wanted to know where any slow downs
> could be.
>
> From your findings, we've created a performance static patch, attached. This
> collects the maximum and current values for the command and service buffer
> slots and is then written to status.dat (by default every 10 seconds). What
> I found with a fake slave sending 128 results every 5 seconds was that the
> maximum values were fairly low (under 100), but when I put the server under
> load, the maximum_command_buffer_items shot up to 1969 and the
> maximum_service_buffer_items shot up to 2156 (had changed from defaults to
> your 60000).
>
> This could show if the buffer is filled at various points or if there is not
> enough data ready for Nagios to process further down the chain.
>
> I'd be interested in figures from other systems.
>
> Warning: the patch is not thread safe, so there is no guarantees that the
> statistic data will not be corrupted (but should not affect usual Nagios
> operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 kernel.
>
> Ton
>
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
>
>
>
>
>
>
>
>

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Ethan G. <na...@na...> - 2006-12-21 16:56:52

Good work on nailing down the problem to the command buffer slots! 
Sounds like this problem might affect a number of users, so I think we 
need to patch Nagios. There are two possible solutions:

1.  Bump up the default buffer slots to something larger.  Since Nagios 
only immediately allocates memory for pointers, the additional memory 
overhead is fairly small.  Allocated memory = (sizeof(char **)) * (# of 
slots).

2.  Moving the slots definitions out to command file variables.  This is 
a better solution than having to edit the code and recompile.

Thoughts?


Ton Voon wrote:
> Hi Mahesh,
> 
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
> 
>> Here is what we did to resolve.
>>
>> 1. Edit the include/nagios.h.in
>> change
>> #define COMMAND_BUFFER_SLOTS 1024
>> to
>> #define COMMAND_BUFFER_SLOTS 60000
>>
>> And change
>> #define SERVICE_BUFFER_SLOTS 1024
>> to
>> #define SERVICE_BUFFER_SLOTS 60000
>>
> 
> I was intrigued by this as we have a performance issue, but not with the 
> same symptoms. Our problem is that NSCA processes increase when the 
> nagios server is under load. They appear to be blocking on writing to 
> the command pipe. Switching NSCA to single daemon mitigates the problem 
> (slaves will timeout their passive results), but we wanted to know where 
> any slow downs could be.
> 
>  From your findings, we've created a performance static patch, attached. 
> This collects the maximum and current values for the command and service 
> buffer slots and is then written to status.dat (by default every 10 
> seconds). What I found with a fake slave sending 128 results every 5 
> seconds was that the maximum values were fairly low (under 100), but 
> when I put the server under load, the maximum_command_buffer_items shot 
> up to 1969 and the maximum_service_buffer_items shot up to 2156 (had 
> changed from defaults to your 60000).
> 
> This could show if the buffer is filled at various points or if there is 
> not enough data ready for Nagios to process further down the chain.
> 
> I'd be interested in figures from other systems.
> 
> Warning: the patch is not thread safe, so there is no guarantees that 
> the statistic data will not be corrupted (but should not affect usual 
> Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 
> kernel.
> 
> Ton
> 
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
> 


Ethan Galstad,
Nagios Developer
---
Email: na...@na...
Website: http://www.nagios.org

Re: [Nagios-devel] =?iso-8859-1?q?Problems_with_many_hanging_Nagios_p?= =?iso-8859-1?q?rocesses_=28Nagios_spawning_rogue_nagios_processes_?= =?iso-8859-1?q?eventually_crashing=09Nagios_server=29?=

From: Joerg L. <pit...@ed...> - 2006-12-21 17:07:37

Am Donnerstag, 21. Dezember 2006 17:56 schrieb Ethan Galstad:
> Good work on nailing down the problem to the command buffer slots!
> Sounds like this problem might affect a number of users, so I think we
> need to patch Nagios. There are two possible solutions:
>
> 1.  Bump up the default buffer slots to something larger.  Since Nagios
> only immediately allocates memory for pointers, the additional memory
> overhead is fairly small.  Allocated memory =3D (sizeof(char **)) * (# of
> slots).
>
> 2.  Moving the slots definitions out to command file variables.  This is
> a better solution than having to edit the code and recompile.

Yes, make variables for the bufferslots.

Can you please adapt nagiostat to provide the current and max Buffer Slots ?

J=F6rg=20

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Ethan G. <na...@na...> - 2006-12-21 17:16:18

Joerg Linge wrote:
[snip]
>>
>> 2.  Moving the slots definitions out to command file variables.  This is
>> a better solution than having to edit the code and recompile.
> 
> Yes, make variables for the bufferslots.
> 
> Can you please adapt nagiostat to provide the current and max Buffer Slots ?
> 
> Jörg 

Awesome idea!  I'll definitely add that.


Ethan Galstad,
Nagios Developer
---
Email: na...@na...
Website: http://www.nagios.org

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Mahesh K. <mk...@gm...> - 2006-12-21 17:14:13

On 12/21/06, Ethan Galstad <na...@na...> wrote:
> Good work on nailing down the problem to the command buffer slots!
> Sounds like this problem might affect a number of users, so I think we
> need to patch Nagios. There are two possible solutions:
>
> 1.  Bump up the default buffer slots to something larger.  Since Nagios
> only immediately allocates memory for pointers, the additional memory
> overhead is fairly small.  Allocated memory = (sizeof(char **)) * (# of
> slots).
>
Since you generate the nagios.h from configure. Based on RAM
available, this number could be generated from configure script.

> 2.  Moving the slots definitions out to command file variables.  This is
> a better solution than having to edit the code and recompile.
Yes this will be better solution. Also can we have additional
information displayed like buffers(command & service) in use(as to how
many messages occupied), messages in pipe(nagios.cmd) and messages in
message queue from nagiostats ?

>
> Thoughts?
>
>
> Ton Voon wrote:
> > Hi Mahesh,
> >
> > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
> >
> >> Here is what we did to resolve.
> >>
> >> 1. Edit the include/nagios.h.in
> >> change
> >> #define COMMAND_BUFFER_SLOTS 1024
> >> to
> >> #define COMMAND_BUFFER_SLOTS 60000
> >>
> >> And change
> >> #define SERVICE_BUFFER_SLOTS 1024
> >> to
> >> #define SERVICE_BUFFER_SLOTS 60000
> >>
> >
> > I was intrigued by this as we have a performance issue, but not with the
> > same symptoms. Our problem is that NSCA processes increase when the
> > nagios server is under load. They appear to be blocking on writing to
> > the command pipe. Switching NSCA to single daemon mitigates the problem
> > (slaves will timeout their passive results), but we wanted to know where
> > any slow downs could be.
> >
> >  From your findings, we've created a performance static patch, attached.
> > This collects the maximum and current values for the command and service
> > buffer slots and is then written to status.dat (by default every 10
> > seconds). What I found with a fake slave sending 128 results every 5
> > seconds was that the maximum values were fairly low (under 100), but
> > when I put the server under load, the maximum_command_buffer_items shot
> > up to 1969 and the maximum_service_buffer_items shot up to 2156 (had
> > changed from defaults to your 60000).
> >
> > This could show if the buffer is filled at various points or if there is
> > not enough data ready for Nagios to process further down the chain.
> >
> > I'd be interested in figures from other systems.
> >
> > Warning: the patch is not thread safe, so there is no guarantees that
> > the statistic data will not be corrupted (but should not affect usual
> > Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6
> > kernel.
> >
> > Ton
> >
> > http://www.altinity.com
> > T: +44 (0)870 787 9243
> > F: +44 (0)845 280 1725
> > Skype: tonvoon
> >
>
>
> Ethan Galstad,
> Nagios Developer
> ---
> Email: na...@na...
> Website: http://www.nagios.org
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Nagios-devel mailing list
> Nag...@li...
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Thomas Guyot-S. <Th...@za...> - 2006-12-21 17:22:07

Attachments: smime.p7s

> From: nag...@li... 
> [mailto:nag...@li...] On Behalf 
> Of Ethan Galstad
> Sent: December 21, 2006 11:57
> To: Nagios-Devel
> Subject: Re: [Nagios-devel] Problems with many hanging Nagios 
> processes (Nagios spawning rogue nagios processes eventually 
> crashing Nagios server)
> 
> Good work on nailing down the problem to the command buffer slots! 
> Sounds like this problem might affect a number of users, so I 
> think we 
> need to patch Nagios. There are two possible solutions:
> 

Ethan,

Do you think this bug could be the cause of my issue where I loose passive
check results under load / when many passive checks comes in at the same
time?

Thanks,

Thomas

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Ethan G. <na...@na...> - 2007-01-03 03:25:45

Thomas Guyot-Sionnest wrote:
>> From: nag...@li... 
>> [mailto:nag...@li...] On Behalf 
>> Of Ethan Galstad
>> Sent: December 21, 2006 11:57
>> To: Nagios-Devel
>> Subject: Re: [Nagios-devel] Problems with many hanging Nagios 
>> processes (Nagios spawning rogue nagios processes eventually 
>> crashing Nagios server)
>>
>> Good work on nailing down the problem to the command buffer slots! 
>> Sounds like this problem might affect a number of users, so I 
>> think we 
>> need to patch Nagios. There are two possible solutions:
>>
> 
> Ethan,
> 
> Do you think this bug could be the cause of my issue where I loose passive
> check results under load / when many passive checks comes in at the same
> time?
> 
> Thanks,
> 
> Thomas
> 

Hi Thomas - I think I must have missed your original problem description 
in the mass of emails that I've been behind on. :-)  I guess you'd have 
to test the new code to see if that fixes your problem for sure.

If I'm correct, you shouldn't loose passive checks under heavy load 
conditions unless you exhaust physical memory and something like the 
infamous OOM killer starts killing processes.  Under normal situations, 
passive checks would just be delayed under heavy load.

Ethan Galstad,
Nagios Developer
---
Email: na...@na...
Website: http://www.nagios.org

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Andreas E. <ae...@op...> - 2007-01-03 10:04:07

Ethan Galstad wrote:
> Good work on nailing down the problem to the command buffer slots! 
> Sounds like this problem might affect a number of users, so I think we 
> need to patch Nagios. There are two possible solutions:
> 
> 1.  Bump up the default buffer slots to something larger.  Since Nagios 
> only immediately allocates memory for pointers, the additional memory 
> overhead is fairly small.  Allocated memory = (sizeof(char **)) * (# of 
> slots).
> 
> 2.  Moving the slots definitions out to command file variables.  This is 
> a better solution than having to edit the code and recompile.
> 
> Thoughts?
> 

3. Make the number of slots dynamic and allocate memory as needed. It 
should never release any allocated memory, but just increase the number 
of buffer slots as needed. One probably wants to allocate the buffer 
slots in chunks of sysconf(_SC_PAGESIZE) / (sizeof(char *)) to keep it 
to one page at a time, which will prevent expensive memory copying on 
realloc().

-- 
Andreas Ericsson                   and...@op...
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: <an...@pr...> - 2006-12-22 12:21:55

Hi all,

as mentioned in Ethans Thread for testing the actual branch version, I
am afraid the problems are not only sitting on the buffers.

I have talked to a collegue of mine, watching to the sources. Specially
the event.c on line 1079

####
                        if(run_event==TRUE){

                                /* remove the first event from the
timing loop */
                                temp_event=event_list_low;
                                event_list_low=event_list_low->next;

                                /* handle the event */

                                handle_timed_event(temp_event);
// This is 1079 -----------^
                                /* reschedule the event if necessary */
                                if(temp_event->recurring==TRUE)
                                       
reschedule_event(temp_event,&event_list_low);

                                /* else free memory associated with the
event */
                                else
                                        free(temp_event);
                                }
####

The function starts after on line 1154 and following.

If I am right, this is the worker part who do anything for nagios,
starts checks, get check result (reaper), freshness checks and anything
else.

Is this part working serialized (one shot after another) or is it
threaded before?
If it is serialzed, won't it be able to paralize it?

Do anyone know how long the processing of handle_timed_event is running?
(Just a question before, I will test it after this mail compiling with
debug3)

Just a my 2 cents.

Best wishes
Hendrik


Ton Voon schrieb:
> Hi Mahesh,
>
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
>
>> Here is what we did to resolve.
>>
>> 1. Edit the include/nagios.h.in
>> change
>> #define COMMAND_BUFFER_SLOTS 1024
>> to
>> #define COMMAND_BUFFER_SLOTS 60000
>>
>> And change
>> #define SERVICE_BUFFER_SLOTS 1024
>> to
>> #define SERVICE_BUFFER_SLOTS 60000
>>
>
> I was intrigued by this as we have a performance issue, but not with
> the same symptoms. Our problem is that NSCA processes increase when
> the nagios server is under load. They appear to be blocking on writing
> to the command pipe. Switching NSCA to single daemon mitigates the
> problem (slaves will timeout their passive results), but we wanted to
> know where any slow downs could be.
>
> From your findings, we've created a performance static patch,
> attached. This collects the maximum and current values for the command
> and service buffer slots and is then written to status.dat (by default
> every 10 seconds). What I found with a fake slave sending 128 results
> every 5 seconds was that the maximum values were fairly low (under
> 100), but when I put the server under load, the
> maximum_command_buffer_items shot up to 1969 and the
> maximum_service_buffer_items shot up to 2156 (had changed from
> defaults to your 60000).
>
> This could show if the buffer is filled at various points or if there
> is not enough data ready for Nagios to process further down the chain.
>
> I'd be interested in figures from other systems.
>
> Warning: the patch is not thread safe, so there is no guarantees that
> the statistic data will not be corrupted (but should not affect usual
> Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6
> kernel.
>
> Ton
>
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
>
>
> ------------------------------------------------------------------------
>
>
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-devel mailing list
> Nag...@li...
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Ethan G. <na...@na...> - 2007-01-03 03:31:26

Hendrik Bäcker wrote:
> Hi all,
> 
> as mentioned in Ethans Thread for testing the actual branch version, I
> am afraid the problems are not only sitting on the buffers.
> 
> I have talked to a collegue of mine, watching to the sources. Specially
> the event.c on line 1079
> 
> ####
>                         if(run_event==TRUE){
> 
>                                 /* remove the first event from the
> timing loop */
>                                 temp_event=event_list_low;
>                                 event_list_low=event_list_low->next;
> 
>                                 /* handle the event */
> 
>                                 handle_timed_event(temp_event);
> // This is 1079 -----------^
>                                 /* reschedule the event if necessary */
>                                 if(temp_event->recurring==TRUE)
>                                        
> reschedule_event(temp_event,&event_list_low);
> 
>                                 /* else free memory associated with the
> event */
>                                 else
>                                         free(temp_event);
>                                 }
> ####
> 
> The function starts after on line 1154 and following.
> 
> If I am right, this is the worker part who do anything for nagios,
> starts checks, get check result (reaper), freshness checks and anything
> else.
> 
> Is this part working serialized (one shot after another) or is it
> threaded before?
> If it is serialzed, won't it be able to paralize it?
> 
> Do anyone know how long the processing of handle_timed_event is running?
> (Just a question before, I will test it after this mail compiling with
> debug3)
> 
> Just a my 2 cents.
> 
> Best wishes
> Hendrik
> 
[snip]

Most things in Nagios are performed in a serial fashion.  They include, 
event handlers, starting service checks, running the OCSP command, 
updating the status log, etc.  All these actions are kicked off the the 
handle_timed_event() function, which runs each thing serially.

Although the process of starting service checks is handled serially, the 
actual execution is run in parallel, and the service check reaper (which 
collects service check results) runs as its own thread.  The processing 
of service check results is handled in a serial fashion, although this 
is not a time-intensive process like the actual execution of a check.

The execution of host checks are the huge holdup in Nagios 2.x, and they 
are (for the most part) parallelized in Nagios 3.x, so that will help in 
the future.  Things like event handlers, notifications, etc. are hard to 
  run in parallel while ensuring that certain things happen in a 
particular, repeatable order.

Hope that helps.  It can get more confusing the more you look into 
things. :-)

Ethan Galstad,
Nagios Developer
---
Email: na...@na...
Website: http://www.nagios.org

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Andreas E. <ae...@op...> - 2006-12-19 09:08:52

Mahesh Kunjal wrote:
> 			
> 
> We had similar issue. We have a distributed environment with one master and 4 slaves. Total number of hosts monitored are 1900+ and
> 20000+ services spread across 4 slaves.
> 
> At times we saw 14K or more results being sent in a second from slaves. This resulted in 100+ nagios processes being created.
> 
> Changed reaper frequency to 2 seconds and played with all tunables.
> Nothing seemed to help.
> 
> Looking at the nagios source,
> This is what I found out was happening...
> 
> Nagios has a commands file worker thread and when it gets woken up, looks if there is data in pipe(nagios.cmd), if exists, forks a child process. This will be in a loop and checks the pipe for data.
> 
> Now what does the forked nagios child process do?
> It reads all the data from the pipe one message a time and puts it in commands buffer. If if is able to write to buffer, just exits.
> 
> The problem here was command buffer had a limited size of 1024. This is the default setting in include/nagios.h.in and is in the line #define COMMAND_BUFFER_SLOTS 1024.

This is the number of buffers that will be available for writing into, 
not the number of total bytes available. Each command buffer slot holds 
MAX_INPUT_BUFFER bytes.

> 
> This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer.
> 
> While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory.
> 

A very concise and correct description of what's going on. Thanks.

> Here is what we did to resolve.
> 
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
> 
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
> 

This would indeed solve the problem, although you could have gotten away 
with the same amount of SERVICE_BUFFER_SLOTS as there are services 
configured on the system, and the same amount of COMMAND_BUFFER_SLOTS as 
there are hosts and services. Provided the slaves also send passive 
hostchecks, ofc, otherwise you can set it to the amount of services instead.

It should also be noted that these settings shouldn't be modified unless 
needed, as it will make Nagios use quite a bit more memory per default.

-- 
Andreas Ericsson                   and...@op...
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Mahesh K. <mk...@gm...> - 2006-12-19 15:36:08

On 12/19/06, Andreas Ericsson <ae...@op...> wrote:
> >
> > The problem here was command buffer had a limited size of 1024. This is the default setting in include/nagios.h.in and is in the line #define COMMAND_BUFFER_SLOTS 1024.
>
> This is the number of buffers that will be available for writing into,
> not the number of total bytes available. Each command buffer slot holds
> MAX_INPUT_BUFFER bytes.

Yes each message is of MAX_INPUT_BUFFER which defaults to 1024.
What i was getting at was the number of messages(results) you can write to.

>
> >
> > This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer.
> >
> > While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory.
> >
>
> A very concise and correct description of what's going on. Thanks.
:)


>
> > Here is what we did to resolve.
> >
> > 1. Edit the include/nagios.h.in
> > change
> > #define COMMAND_BUFFER_SLOTS 1024
> > to
> > #define COMMAND_BUFFER_SLOTS 60000
> >
> > And change
> > #define SERVICE_BUFFER_SLOTS 1024
> > to
> > #define SERVICE_BUFFER_SLOTS 60000
> >
>
> This would indeed solve the problem, although you could have gotten away
> with the same amount of SERVICE_BUFFER_SLOTS as there are services
> configured on the system, and the same amount of COMMAND_BUFFER_SLOTS as
> there are hosts and services. Provided the slaves also send passive
> hostchecks, ofc, otherwise you can set it to the amount of services instead.
The customer was planning on adding more services.
Right now at peak we got 14K results in a second and reaper frequency
of 2 second could fill 28k slots on command buffer .
Came up with 60000 just in case if nagios is not digesting the buffer
fast enough..

>
> It should also be noted that these settings shouldn't be modified unless
> needed, as it will make Nagios use quite a bit more memory per default
What i remember looking at code is, the child process allocates memory
per message read and if the number of messages in buffer is less than
COMMAND_BUFFER_SLOTS.
My understading is Nagios wont allocate all of COMMAND_BUFFER_SLOTS slots.
It will be allocated only if results are coming at short interval
and/or if it is not being processed fast enough.


----
Mahesh Kunjal   mk...@gm...

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Gaspar, C. <Car...@gs...> - 2006-12-20 00:58:34

Mahesh Kunjal wrote:
>=20
> This was not enough and the child process started to wait for memory
to be freed so that the pipe data retrieved can be put in buffer.
>=20
> While this child process waited for memory to be freed, the command
worker thread got woken up and realized that there is data in pipe and
forked another child. This got repeated and eventually server went out
of memory.

This is a bug. A second reader should not be created if a prior reader
still exists. Locking is required here.

--=20
Carson

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Percy J. <ja...@fg...> - 2007-04-10 11:21:09

Gaspar, Carson schrieb:
> Mahesh Kunjal wrote:
>   
>> This was not enough and the child process started to wait for memory
>>     
> to be freed so that the pipe data retrieved can be put in buffer.
>   
>> While this child process waited for memory to be freed, the command
>>     
> worker thread got woken up and realized that there is data in pipe and
> forked another child. This got repeated and eventually server went out
> of memory.
>
> This is a bug. A second reader should not be created if a prior reader
> still exists. Locking is required here.
>   
This is a large bug concerning all mayor installations of nagios.
We've located this bug too and are now working on a solution. We like to 
solve this problem, by spawning a thread doing the job of the problem 
causing processes. I hope a working patch will be available soon.

Best regards
Percy Jahn

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Daniel M. <ea...@cy...> - 2006-12-21 13:05:04

Hi Ton,

the patch does not work, i think its a typo here:

+       fprintf(fp,"\tmax_command_buffer_items=%d\n", 
max_command_buffer_items);
+       fprintf(fp,"\tcurrent_service_buffer=%d\n", 
service_result_buffer.items);
+       fprintf(fi,"\tmax_service_buffer_items=%d\n", 
max_service_buffer_items);

Shouldn't it be fprintf(fp... in all lines?

Danny
-- 
Q: Gentoo is too hard to install      =        http://www.cyberdelia.de
    and I feel like whining.           =             ea...@cy...
A: Please see /dev/null.              =
       (from the gentoo installer FAQ) =                             \o/

Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

From: Ton V. <ton...@al...> - 2006-12-21 16:23:41

Hi Daniel,

On 21 Dec 2006, at 13:04, Daniel Meyer wrote:

> Hi Ton,
>
> the patch does not work, i think its a typo here:
>
> +       fprintf(fp,"\tmax_command_buffer_items=3D%d\n", =20
> max_command_buffer_items);
> +       fprintf(fp,"\tcurrent_service_buffer=3D%d\n", =20
> service_result_buffer.items);
> +       fprintf(fi,"\tmax_service_buffer_items=3D%d\n", =20
> max_service_buffer_items);
>
> Shouldn't it be fprintf(fp... in all lines?

Right you are - sorry. Patch redone.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

=EF=BF=BC