From: <lin...@de...> - 2005-11-28 14:45:38
|
Hi everybody, unfortunately nobody answered to Alex from viveconsulting.co.nz who had a problem with "Nagios spawning rogue ..." and mailed to nagios mailing list some months ago. Right now, we have the same problemn very likely he described in a very detailed way. I tried also a lot of different things (from configuration changes to tuning issues) to find out the real problem and I guess the real bottleneck is the pipe used for communication between Nagios processes. But I found not many reports e.g. emails about this problem in the web and mail archives. So why am I writing to list? Maybe someone can give me a hint, how to solve or workaround that problem? We have 677 services configured and use 350 RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is a little bit more than 1.00. As long as we stay below 1.00 no problem, but otherwise ... (Detailed problem description in Alexs' mail) This is just our start with Nagios. We want to configure thousands of services and more than 100 hundred hosts. We would also invest in faster hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware an option? Looking at this issue with the focus on implementation: If the pipe is the bottleneck it will stay a bottle neck on faster hardware too. But maybe faster hardware will allow us to configure 3000 services, what would be enough for the Nagios instance. And then, we deploy another Nagios instance ... Any comment would be greatly appreciated. What were your experiences with Nagios in such an environment and how do you use it today? Thanks a lot. Mit freundlichen Gr=FC=DFen Tobias Mucke MAN Nutzfahrzeuge AG Informationssysteme und Organisation DV-Technologie und RZ-Betrieb Linux- System-Technik lin...@de... This message and any attachments are confidential and may be privileged or = otherwise protected from disclosure. If you are not the intended recipient, please telephone or email the sender= and delete this message and any attachment from your system. If you are not the intended recipient, you must not copy = this message or attachment or disclose the contents to any other person. |
From: Andreas E. <ae...@op...> - 2005-11-28 16:10:27
|
lin...@de... wrote: > Hi everybody, > > unfortunately nobody answered to Alex from viveconsulting.co.nz who had a > problem with "Nagios spawning rogue ..." and mailed to nagios mailing list > some months ago. A link to the mail archives would be helpful. > Right now, we have the same problemn very likely he > described in a very detailed way. I tried also a lot of different things > (from configuration changes to tuning issues) to find out the real problem > and I guess the real bottleneck is the pipe used for communication between > Nagios processes. Most likely. It's the only real bottleneck in nagios today, so... > But I found not many reports e.g. emails about this > problem in the web and mail archives. > > So why am I writing to list? Maybe someone can give me a hint, how to solve > or workaround that problem? We have 677 services configured and use 350 > RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is > a little bit more than 1.00. As long as we stay below 1.00 no problem, but > otherwise ... (Detailed problem description in Alexs' mail) > CMS? Content Management System? Anyways, 677 services shouldn't be a problem. > This is just our start with Nagios. We want to configure thousands of > services and more than 100 hundred hosts. We would also invest in faster > hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware > an option? It helps, but not very much I'm afraid. The bottleneck requires a kernel recompile to be solved on most systems, and that's a very bad thing to do just to fix this particular problem. > Looking at this issue with the focus on implementation: If the > pipe is the bottleneck it will stay a bottle neck on faster hardware too. > But maybe faster hardware will allow us to configure 3000 services, what > would be enough for the Nagios instance. And then, we deploy another Nagios > instance ... > This is definitely a solution. Otherwise you could keep your eyes open in the somewhat near future for a mail with [PATCH] checks: Multiplex running checks. in the topic. I'm working on it right now, but perhaps Ethan won't let it in for the 2.x branch since it's a fairly massive change. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 |
From: <lin...@de...> - 2005-11-29 10:12:03
|
Hi Andi, thanks for your answer. Here is the link to Alexs mail. https://sourceforge.net/mailarchive/forum.php?thread_id=3D8135931&forum_id= =3D1872 I thought that in Nagios terms CMS means Central Monitoring System? A kernel recompile is not a problem for me. But I didn't find any setting called "pipe size" nor even "pipe". Maybe you can give me a hint which setting I have to change. Hopefully Ethan let your change in the 2.x release. Would be great. I could also test it in a massive / debugging way, if you are interested in. Thanks a lot, again. Mit freundlichen Gr=FC=DFen Tobias Mucke MAN Nutzfahrzeuge AG Informationssysteme und Organisation DV-Technologie und RZ-Betrieb Linux- System-Technik lin...@de... Andreas Ericsson <ae...@op...> Gesendet von: An nagios-devel-admi lin...@de... n...@li...urcefor Kopie ge.net nag...@li... Thema Re: [Nagios-devel] Problems with 28.11.2005 17:09 many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server) lin...@de... wrote: > Hi everybody, > > unfortunately nobody answered to Alex from viveconsulting.co.nz who had a > problem with "Nagios spawning rogue ..." and mailed to nagios mailing list > some months ago. A link to the mail archives would be helpful. > Right now, we have the same problemn very likely he > described in a very detailed way. I tried also a lot of different things > (from configuration changes to tuning issues) to find out the real problem > and I guess the real bottleneck is the pipe used for communication between > Nagios processes. Most likely. It's the only real bottleneck in nagios today, so... > But I found not many reports e.g. emails about this > problem in the web and mail archives. > > So why am I writing to list? Maybe someone can give me a hint, how to solve > or workaround that problem? We have 677 services configured and use 350 > RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is > a little bit more than 1.00. As long as we stay below 1.00 no problem, but > otherwise ... (Detailed problem description in Alexs' mail) > CMS? Content Management System? Anyways, 677 services shouldn't be a problem. > This is just our start with Nagios. We want to configure thousands of > services and more than 100 hundred hosts. We would also invest in faster > hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware > an option? It helps, but not very much I'm afraid. The bottleneck requires a kernel recompile to be solved on most systems, and that's a very bad thing to do just to fix this particular problem. > Looking at this issue with the focus on implementation: If the > pipe is the bottleneck it will stay a bottle neck on faster hardware too. > But maybe faster hardware will allow us to configure 3000 services, what > would be enough for the Nagios instance. And then, we deploy another Nagios > instance ... > This is definitely a solution. Otherwise you could keep your eyes open in the somewhat near future for a mail with [PATCH] checks: Multiplex running checks. in the topic. I'm working on it right now, but perhaps Ethan won't let it in for the 2.x branch since it's a fairly massive change. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick _______________________________________________ Nagios-devel mailing list Nag...@li... https://lists.sourceforge.net/lists/listinfo/nagios-devel This message and any attachments are confidential and may be privileged or = otherwise protected from disclosure. If you are not the intended recipient, please telephone or email the sender= and delete this message and any attachment from your system. If you are not the intended recipient, you must not copy = this message or attachment or disclose the contents to any other person. |
From: Andreas E. <ae...@op...> - 2005-11-29 10:24:39
|
lin...@de... wrote: > Hi Andi, > > thanks for your answer. > > Here is the link to Alexs mail. > > https://sourceforge.net/mailarchive/forum.php?thread_id=8135931&forum_id=1872 > Thanks. Yes, this has to do with the pipe-size which, unfortunately, just isn't big enough. A solution would be to have a wrapper program listen on the pipe (for the CGI's and such), parse it to a numerical value and then pass on the command to Nagios through a local UDP socket which can have dynamic receive-buffers with a roof somewhere around 128 pages (128 * 4096 = 512KB), iirc. > I thought that in Nagios terms CMS means Central Monitoring System? > That would be NMS (Network Monitoring System), although I see why you made the mistake from the original mail. > A kernel recompile is not a problem for me. But I didn't find any setting > called "pipe size" nor even "pipe". Maybe you can give me a hint which > setting I have to change. > It's not a setting. It's a macro in the kernel sources. grep -r "FIFO.*4096" /usr/src/linux The latest sources from git shows multiple entries of DEFAULT_FIFO_LEN. You may need to change all of them and expect the machine to crash every now and then until you find the right one (which is why this shouldn't really be fixed by a kernel re-compile). > Hopefully Ethan let your change in the 2.x release. Would be great. I could > also test it in a massive / debugging way, if you are interested in. > I will be when I've got something to test. Thanks. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 |
From: Mahesh K. <mk...@gm...> - 2006-12-19 00:34:33
|
---------- Forwarded message ---------- From: Mahesh Kunjal <mk...@gm...> Date: Dec 18, 2006 2:58 PM Subject: Re: [Nagios-devel] Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server) To: nag...@li..., mk...@gm... Hi Andreas We had similar issue. We have a distributed environment with one master and 4 slaves. Total number of hosts monitored are 1900+ and 20000+ services spread across 4 slaves. At times we saw 14K or more results being sent in a second from slaves. This resulted in 100+ nagios processes being created. Changed reaper frequency to 2 seconds and played with all tunables. Nothing seemed to help. Looking at the nagios source, This is what I found out was happening... Nagios has a commands file worker thread and when it gets woken up, looks if there is data in pipe(nagios.cmd), if exists, forks a child process. This will be in a loop and checks the pipe for data. Now what does the forked nagios child process do? It reads all the data from the pipe one message a time and puts it in commands buffer. If if is able to write to buffer, just exits. The problem here was command buffer had a limited size of 1024. This is the default setting in include/nagios.h.in and is in the line #define COMMAND_BUFFER_SLOTS 1024. This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer. While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory. Here is what we did to resolve. 1. Edit the include/nagios.h.in change #define COMMAND_BUFFER_SLOTS 1024 to #define COMMAND_BUFFER_SLOTS 60000 And change #define SERVICE_BUFFER_SLOTS 1024 to #define SERVICE_BUFFER_SLOTS 60000 2. Run ./configure (make sure you don't have nano second sleep enabled. Also disable perl interpreter) 3. make all;make install - Mahesh Kunjal (maheshk) ----------------------- This thread is located in the archive at this URL: http://www.nagiosexchange.org/nagios-devel.33.0.html?&tx_maillisttofaq_pi1[showUid]=13177 |
From: Mahesh K. <mk...@gm...> - 2006-12-19 00:42:49
|
=09 We had similar issue. We have a distributed environment with one= master and 4 slaves. Total number of hosts monitored are 1900+= and 20000+ services spread across 4 slaves. At times we saw 14K or more results being sent in a second from= slaves. This resulted in 100+ nagios processes being created. Changed reaper frequency to 2 seconds and played with all tunables. Nothing seemed to help. Looking at the nagios source, This is what I found out was happening... Nagios has a commands file worker thread and when it gets woken= up, looks if there is data in pipe(nagios.cmd), if exists, forks= a child process. This will be in a loop and checks the pipe for= data. Now what does the forked nagios child process do? It reads all the data from the pipe one message a time and puts= it in commands buffer. If if is able to write to buffer, just exits. The problem here was command buffer had a limited size of 1024.= This is the default setting in include/nagios.h.in and is in the= line #define COMMAND_BUFFER_SLOTS 1024. This was not enough and the child process started to wait for memory= to be freed so that the pipe data retrieved can be put in buffer. While this child process waited for memory to be freed, the command= worker thread got woken up and realized that there is data in pipe= and forked another child. This got repeated and eventually server= went out of memory. Here is what we did to resolve. 1. Edit the include/nagios.h.in change #define COMMAND_BUFFER_SLOTS 1024 to #define COMMAND_BUFFER_SLOTS 60000 And change #define SERVICE_BUFFER_SLOTS 1024 to #define SERVICE_BUFFER_SLOTS 60000 2. Run ./configure (make sure you don't have nano second sleep enabled. Also disable= perl interpreter) 3. make all;make install - Mahesh Kunjal (maheshk) ----------------------- This thread is located in the archive at this URL: http://www.nagiosexchange.org/nagios-devel.33.0.html?&tx_maillisttofaq_pi= 1[showUid]=3D13177 =09 |
From: Ton V. <ton...@al...> - 2006-12-21 11:54:48
|
Hi Mahesh, On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote: > Here is what we did to resolve. > > 1. Edit the include/nagios.h.in > change > #define COMMAND_BUFFER_SLOTS 1024 > to > #define COMMAND_BUFFER_SLOTS 60000 > > And change > #define SERVICE_BUFFER_SLOTS 1024 > to > #define SERVICE_BUFFER_SLOTS 60000 > I was intrigued by this as we have a performance issue, but not with =20 the same symptoms. Our problem is that NSCA processes increase when =20 the nagios server is under load. They appear to be blocking on =20 writing to the command pipe. Switching NSCA to single daemon =20 mitigates the problem (slaves will timeout their passive results), =20 but we wanted to know where any slow downs could be. =46rom your findings, we've created a performance static patch, =20 attached. This collects the maximum and current values for the =20 command and service buffer slots and is then written to status.dat =20 (by default every 10 seconds). What I found with a fake slave sending =20= 128 results every 5 seconds was that the maximum values were fairly =20 low (under 100), but when I put the server under load, the =20 maximum_command_buffer_items shot up to 1969 and the =20 maximum_service_buffer_items shot up to 2156 (had changed from =20 defaults to your 60000). This could show if the buffer is filled at various points or if there =20= is not enough data ready for Nagios to process further down the chain. I'd be interested in figures from other systems. Warning: the patch is not thread safe, so there is no guarantees that =20= the statistic data will not be corrupted (but should not affect usual =20= Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 =20= kernel. Ton http://www.altinity.com T: +44 (0)870 787 9243 F: +44 (0)845 280 1725 Skype: tonvoon =EF=BF=BC |
From: Mahesh K. <mk...@gm...> - 2006-12-21 16:47:43
|
Hi Ton! > Here is what we did to resolve. > > 1. Edit the include/nagios.h.in > change > #define COMMAND_BUFFER_SLOTS 1024 > to > #define COMMAND_BUFFER_SLOTS 60000 > > And change > #define SERVICE_BUFFER_SLOTS 1024 > to > #define SERVICE_BUFFER_SLOTS 60000 > > > > I was intrigued by this as we have a performance issue, but not with the > same symptoms. Our problem is that NSCA processes increase when the nagios > server is under load. They appear to be blocking on writing to the command > pipe. Switching NSCA to single daemon mitigates the problem (slaves will > timeout their passive results), but we wanted to know where any slow downs > could be. We had the NSCA related performance issues too. We started writing to a file on the slaves, the results it gets to be forwarded to master. Then once every 10 or 15 seconds, send that file over to master. On 12/21/06, Ton Voon <ton...@al...> wrote: > Hi Mahesh, > > > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote: > > Here is what we did to resolve. > > 1. Edit the include/nagios.h.in > change > #define COMMAND_BUFFER_SLOTS 1024 > to > #define COMMAND_BUFFER_SLOTS 60000 > > And change > #define SERVICE_BUFFER_SLOTS 1024 > to > #define SERVICE_BUFFER_SLOTS 60000 > > > > I was intrigued by this as we have a performance issue, but not with the > same symptoms. Our problem is that NSCA processes increase when the nagios > server is under load. They appear to be blocking on writing to the command > pipe. Switching NSCA to single daemon mitigates the problem (slaves will > timeout their passive results), but we wanted to know where any slow downs > could be. > > From your findings, we've created a performance static patch, attached. This > collects the maximum and current values for the command and service buffer > slots and is then written to status.dat (by default every 10 seconds). What > I found with a fake slave sending 128 results every 5 seconds was that the > maximum values were fairly low (under 100), but when I put the server under > load, the maximum_command_buffer_items shot up to 1969 and the > maximum_service_buffer_items shot up to 2156 (had changed from defaults to > your 60000). > > This could show if the buffer is filled at various points or if there is not > enough data ready for Nagios to process further down the chain. > > I'd be interested in figures from other systems. > > Warning: the patch is not thread safe, so there is no guarantees that the > statistic data will not be corrupted (but should not affect usual Nagios > operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 kernel. > > Ton > > http://www.altinity.com > T: +44 (0)870 787 9243 > F: +44 (0)845 280 1725 > Skype: tonvoon > > > > > > > > |
From: Ethan G. <na...@na...> - 2006-12-21 16:56:52
|
Good work on nailing down the problem to the command buffer slots! Sounds like this problem might affect a number of users, so I think we need to patch Nagios. There are two possible solutions: 1. Bump up the default buffer slots to something larger. Since Nagios only immediately allocates memory for pointers, the additional memory overhead is fairly small. Allocated memory = (sizeof(char **)) * (# of slots). 2. Moving the slots definitions out to command file variables. This is a better solution than having to edit the code and recompile. Thoughts? Ton Voon wrote: > Hi Mahesh, > > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote: > >> Here is what we did to resolve. >> >> 1. Edit the include/nagios.h.in >> change >> #define COMMAND_BUFFER_SLOTS 1024 >> to >> #define COMMAND_BUFFER_SLOTS 60000 >> >> And change >> #define SERVICE_BUFFER_SLOTS 1024 >> to >> #define SERVICE_BUFFER_SLOTS 60000 >> > > I was intrigued by this as we have a performance issue, but not with the > same symptoms. Our problem is that NSCA processes increase when the > nagios server is under load. They appear to be blocking on writing to > the command pipe. Switching NSCA to single daemon mitigates the problem > (slaves will timeout their passive results), but we wanted to know where > any slow downs could be. > > From your findings, we've created a performance static patch, attached. > This collects the maximum and current values for the command and service > buffer slots and is then written to status.dat (by default every 10 > seconds). What I found with a fake slave sending 128 results every 5 > seconds was that the maximum values were fairly low (under 100), but > when I put the server under load, the maximum_command_buffer_items shot > up to 1969 and the maximum_service_buffer_items shot up to 2156 (had > changed from defaults to your 60000). > > This could show if the buffer is filled at various points or if there is > not enough data ready for Nagios to process further down the chain. > > I'd be interested in figures from other systems. > > Warning: the patch is not thread safe, so there is no guarantees that > the statistic data will not be corrupted (but should not affect usual > Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 > kernel. > > Ton > > http://www.altinity.com > T: +44 (0)870 787 9243 > F: +44 (0)845 280 1725 > Skype: tonvoon > Ethan Galstad, Nagios Developer --- Email: na...@na... Website: http://www.nagios.org |
From: Joerg L. <pit...@ed...> - 2006-12-21 17:07:37
|
Am Donnerstag, 21. Dezember 2006 17:56 schrieb Ethan Galstad: > Good work on nailing down the problem to the command buffer slots! > Sounds like this problem might affect a number of users, so I think we > need to patch Nagios. There are two possible solutions: > > 1. Bump up the default buffer slots to something larger. Since Nagios > only immediately allocates memory for pointers, the additional memory > overhead is fairly small. Allocated memory =3D (sizeof(char **)) * (# of > slots). > > 2. Moving the slots definitions out to command file variables. This is > a better solution than having to edit the code and recompile. Yes, make variables for the bufferslots. Can you please adapt nagiostat to provide the current and max Buffer Slots ? J=F6rg=20 |
From: Ethan G. <na...@na...> - 2006-12-21 17:16:18
|
Joerg Linge wrote: [snip] >> >> 2. Moving the slots definitions out to command file variables. This is >> a better solution than having to edit the code and recompile. > > Yes, make variables for the bufferslots. > > Can you please adapt nagiostat to provide the current and max Buffer Slots ? > > Jörg Awesome idea! I'll definitely add that. Ethan Galstad, Nagios Developer --- Email: na...@na... Website: http://www.nagios.org |
From: Mahesh K. <mk...@gm...> - 2006-12-21 17:14:13
|
On 12/21/06, Ethan Galstad <na...@na...> wrote: > Good work on nailing down the problem to the command buffer slots! > Sounds like this problem might affect a number of users, so I think we > need to patch Nagios. There are two possible solutions: > > 1. Bump up the default buffer slots to something larger. Since Nagios > only immediately allocates memory for pointers, the additional memory > overhead is fairly small. Allocated memory = (sizeof(char **)) * (# of > slots). > Since you generate the nagios.h from configure. Based on RAM available, this number could be generated from configure script. > 2. Moving the slots definitions out to command file variables. This is > a better solution than having to edit the code and recompile. Yes this will be better solution. Also can we have additional information displayed like buffers(command & service) in use(as to how many messages occupied), messages in pipe(nagios.cmd) and messages in message queue from nagiostats ? > > Thoughts? > > > Ton Voon wrote: > > Hi Mahesh, > > > > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote: > > > >> Here is what we did to resolve. > >> > >> 1. Edit the include/nagios.h.in > >> change > >> #define COMMAND_BUFFER_SLOTS 1024 > >> to > >> #define COMMAND_BUFFER_SLOTS 60000 > >> > >> And change > >> #define SERVICE_BUFFER_SLOTS 1024 > >> to > >> #define SERVICE_BUFFER_SLOTS 60000 > >> > > > > I was intrigued by this as we have a performance issue, but not with the > > same symptoms. Our problem is that NSCA processes increase when the > > nagios server is under load. They appear to be blocking on writing to > > the command pipe. Switching NSCA to single daemon mitigates the problem > > (slaves will timeout their passive results), but we wanted to know where > > any slow downs could be. > > > > From your findings, we've created a performance static patch, attached. > > This collects the maximum and current values for the command and service > > buffer slots and is then written to status.dat (by default every 10 > > seconds). What I found with a fake slave sending 128 results every 5 > > seconds was that the maximum values were fairly low (under 100), but > > when I put the server under load, the maximum_command_buffer_items shot > > up to 1969 and the maximum_service_buffer_items shot up to 2156 (had > > changed from defaults to your 60000). > > > > This could show if the buffer is filled at various points or if there is > > not enough data ready for Nagios to process further down the chain. > > > > I'd be interested in figures from other systems. > > > > Warning: the patch is not thread safe, so there is no guarantees that > > the statistic data will not be corrupted (but should not affect usual > > Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 > > kernel. > > > > Ton > > > > http://www.altinity.com > > T: +44 (0)870 787 9243 > > F: +44 (0)845 280 1725 > > Skype: tonvoon > > > > > Ethan Galstad, > Nagios Developer > --- > Email: na...@na... > Website: http://www.nagios.org > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Nagios-devel mailing list > Nag...@li... > https://lists.sourceforge.net/lists/listinfo/nagios-devel > |
From: Thomas Guyot-S. <Th...@za...> - 2006-12-21 17:22:07
Attachments:
smime.p7s
|
> From: nag...@li... > [mailto:nag...@li...] On Behalf > Of Ethan Galstad > Sent: December 21, 2006 11:57 > To: Nagios-Devel > Subject: Re: [Nagios-devel] Problems with many hanging Nagios > processes (Nagios spawning rogue nagios processes eventually > crashing Nagios server) > > Good work on nailing down the problem to the command buffer slots! > Sounds like this problem might affect a number of users, so I > think we > need to patch Nagios. There are two possible solutions: > Ethan, Do you think this bug could be the cause of my issue where I loose passive check results under load / when many passive checks comes in at the same time? Thanks, Thomas |
From: Ethan G. <na...@na...> - 2007-01-03 03:25:45
|
Thomas Guyot-Sionnest wrote: >> From: nag...@li... >> [mailto:nag...@li...] On Behalf >> Of Ethan Galstad >> Sent: December 21, 2006 11:57 >> To: Nagios-Devel >> Subject: Re: [Nagios-devel] Problems with many hanging Nagios >> processes (Nagios spawning rogue nagios processes eventually >> crashing Nagios server) >> >> Good work on nailing down the problem to the command buffer slots! >> Sounds like this problem might affect a number of users, so I >> think we >> need to patch Nagios. There are two possible solutions: >> > > Ethan, > > Do you think this bug could be the cause of my issue where I loose passive > check results under load / when many passive checks comes in at the same > time? > > Thanks, > > Thomas > Hi Thomas - I think I must have missed your original problem description in the mass of emails that I've been behind on. :-) I guess you'd have to test the new code to see if that fixes your problem for sure. If I'm correct, you shouldn't loose passive checks under heavy load conditions unless you exhaust physical memory and something like the infamous OOM killer starts killing processes. Under normal situations, passive checks would just be delayed under heavy load. Ethan Galstad, Nagios Developer --- Email: na...@na... Website: http://www.nagios.org |
From: Andreas E. <ae...@op...> - 2007-01-03 10:04:07
|
Ethan Galstad wrote: > Good work on nailing down the problem to the command buffer slots! > Sounds like this problem might affect a number of users, so I think we > need to patch Nagios. There are two possible solutions: > > 1. Bump up the default buffer slots to something larger. Since Nagios > only immediately allocates memory for pointers, the additional memory > overhead is fairly small. Allocated memory = (sizeof(char **)) * (# of > slots). > > 2. Moving the slots definitions out to command file variables. This is > a better solution than having to edit the code and recompile. > > Thoughts? > 3. Make the number of slots dynamic and allocate memory as needed. It should never release any allocated memory, but just increase the number of buffer slots as needed. One probably wants to allocate the buffer slots in chunks of sysconf(_SC_PAGESIZE) / (sizeof(char *)) to keep it to one page at a time, which will prevent expensive memory copying on realloc(). -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 |
From: <an...@pr...> - 2006-12-22 12:21:55
|
Hi all, as mentioned in Ethans Thread for testing the actual branch version, I am afraid the problems are not only sitting on the buffers. I have talked to a collegue of mine, watching to the sources. Specially the event.c on line 1079 #### if(run_event==TRUE){ /* remove the first event from the timing loop */ temp_event=event_list_low; event_list_low=event_list_low->next; /* handle the event */ handle_timed_event(temp_event); // This is 1079 -----------^ /* reschedule the event if necessary */ if(temp_event->recurring==TRUE) reschedule_event(temp_event,&event_list_low); /* else free memory associated with the event */ else free(temp_event); } #### The function starts after on line 1154 and following. If I am right, this is the worker part who do anything for nagios, starts checks, get check result (reaper), freshness checks and anything else. Is this part working serialized (one shot after another) or is it threaded before? If it is serialzed, won't it be able to paralize it? Do anyone know how long the processing of handle_timed_event is running? (Just a question before, I will test it after this mail compiling with debug3) Just a my 2 cents. Best wishes Hendrik Ton Voon schrieb: > Hi Mahesh, > > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote: > >> Here is what we did to resolve. >> >> 1. Edit the include/nagios.h.in >> change >> #define COMMAND_BUFFER_SLOTS 1024 >> to >> #define COMMAND_BUFFER_SLOTS 60000 >> >> And change >> #define SERVICE_BUFFER_SLOTS 1024 >> to >> #define SERVICE_BUFFER_SLOTS 60000 >> > > I was intrigued by this as we have a performance issue, but not with > the same symptoms. Our problem is that NSCA processes increase when > the nagios server is under load. They appear to be blocking on writing > to the command pipe. Switching NSCA to single daemon mitigates the > problem (slaves will timeout their passive results), but we wanted to > know where any slow downs could be. > > From your findings, we've created a performance static patch, > attached. This collects the maximum and current values for the command > and service buffer slots and is then written to status.dat (by default > every 10 seconds). What I found with a fake slave sending 128 results > every 5 seconds was that the maximum values were fairly low (under > 100), but when I put the server under load, the > maximum_command_buffer_items shot up to 1969 and the > maximum_service_buffer_items shot up to 2156 (had changed from > defaults to your 60000). > > This could show if the buffer is filled at various points or if there > is not enough data ready for Nagios to process further down the chain. > > I'd be interested in figures from other systems. > > Warning: the patch is not thread safe, so there is no guarantees that > the statistic data will not be corrupted (but should not affect usual > Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 > kernel. > > Ton > > http://www.altinity.com > T: +44 (0)870 787 9243 > F: +44 (0)845 280 1725 > Skype: tonvoon > > > ------------------------------------------------------------------------ > > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ------------------------------------------------------------------------ > > _______________________________________________ > Nagios-devel mailing list > Nag...@li... > https://lists.sourceforge.net/lists/listinfo/nagios-devel > |
From: Ethan G. <na...@na...> - 2007-01-03 03:31:26
|
Hendrik Bäcker wrote: > Hi all, > > as mentioned in Ethans Thread for testing the actual branch version, I > am afraid the problems are not only sitting on the buffers. > > I have talked to a collegue of mine, watching to the sources. Specially > the event.c on line 1079 > > #### > if(run_event==TRUE){ > > /* remove the first event from the > timing loop */ > temp_event=event_list_low; > event_list_low=event_list_low->next; > > /* handle the event */ > > handle_timed_event(temp_event); > // This is 1079 -----------^ > /* reschedule the event if necessary */ > if(temp_event->recurring==TRUE) > > reschedule_event(temp_event,&event_list_low); > > /* else free memory associated with the > event */ > else > free(temp_event); > } > #### > > The function starts after on line 1154 and following. > > If I am right, this is the worker part who do anything for nagios, > starts checks, get check result (reaper), freshness checks and anything > else. > > Is this part working serialized (one shot after another) or is it > threaded before? > If it is serialzed, won't it be able to paralize it? > > Do anyone know how long the processing of handle_timed_event is running? > (Just a question before, I will test it after this mail compiling with > debug3) > > Just a my 2 cents. > > Best wishes > Hendrik > [snip] Most things in Nagios are performed in a serial fashion. They include, event handlers, starting service checks, running the OCSP command, updating the status log, etc. All these actions are kicked off the the handle_timed_event() function, which runs each thing serially. Although the process of starting service checks is handled serially, the actual execution is run in parallel, and the service check reaper (which collects service check results) runs as its own thread. The processing of service check results is handled in a serial fashion, although this is not a time-intensive process like the actual execution of a check. The execution of host checks are the huge holdup in Nagios 2.x, and they are (for the most part) parallelized in Nagios 3.x, so that will help in the future. Things like event handlers, notifications, etc. are hard to run in parallel while ensuring that certain things happen in a particular, repeatable order. Hope that helps. It can get more confusing the more you look into things. :-) Ethan Galstad, Nagios Developer --- Email: na...@na... Website: http://www.nagios.org |
From: Andreas E. <ae...@op...> - 2006-12-19 09:08:52
|
Mahesh Kunjal wrote: > > > We had similar issue. We have a distributed environment with one master and 4 slaves. Total number of hosts monitored are 1900+ and > 20000+ services spread across 4 slaves. > > At times we saw 14K or more results being sent in a second from slaves. This resulted in 100+ nagios processes being created. > > Changed reaper frequency to 2 seconds and played with all tunables. > Nothing seemed to help. > > Looking at the nagios source, > This is what I found out was happening... > > Nagios has a commands file worker thread and when it gets woken up, looks if there is data in pipe(nagios.cmd), if exists, forks a child process. This will be in a loop and checks the pipe for data. > > Now what does the forked nagios child process do? > It reads all the data from the pipe one message a time and puts it in commands buffer. If if is able to write to buffer, just exits. > > The problem here was command buffer had a limited size of 1024. This is the default setting in include/nagios.h.in and is in the line #define COMMAND_BUFFER_SLOTS 1024. This is the number of buffers that will be available for writing into, not the number of total bytes available. Each command buffer slot holds MAX_INPUT_BUFFER bytes. > > This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer. > > While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory. > A very concise and correct description of what's going on. Thanks. > Here is what we did to resolve. > > 1. Edit the include/nagios.h.in > change > #define COMMAND_BUFFER_SLOTS 1024 > to > #define COMMAND_BUFFER_SLOTS 60000 > > And change > #define SERVICE_BUFFER_SLOTS 1024 > to > #define SERVICE_BUFFER_SLOTS 60000 > This would indeed solve the problem, although you could have gotten away with the same amount of SERVICE_BUFFER_SLOTS as there are services configured on the system, and the same amount of COMMAND_BUFFER_SLOTS as there are hosts and services. Provided the slaves also send passive hostchecks, ofc, otherwise you can set it to the amount of services instead. It should also be noted that these settings shouldn't be modified unless needed, as it will make Nagios use quite a bit more memory per default. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 |
From: Mahesh K. <mk...@gm...> - 2006-12-19 15:36:08
|
On 12/19/06, Andreas Ericsson <ae...@op...> wrote: > > > > The problem here was command buffer had a limited size of 1024. This is the default setting in include/nagios.h.in and is in the line #define COMMAND_BUFFER_SLOTS 1024. > > This is the number of buffers that will be available for writing into, > not the number of total bytes available. Each command buffer slot holds > MAX_INPUT_BUFFER bytes. Yes each message is of MAX_INPUT_BUFFER which defaults to 1024. What i was getting at was the number of messages(results) you can write to. > > > > > This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer. > > > > While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory. > > > > A very concise and correct description of what's going on. Thanks. :) > > > Here is what we did to resolve. > > > > 1. Edit the include/nagios.h.in > > change > > #define COMMAND_BUFFER_SLOTS 1024 > > to > > #define COMMAND_BUFFER_SLOTS 60000 > > > > And change > > #define SERVICE_BUFFER_SLOTS 1024 > > to > > #define SERVICE_BUFFER_SLOTS 60000 > > > > This would indeed solve the problem, although you could have gotten away > with the same amount of SERVICE_BUFFER_SLOTS as there are services > configured on the system, and the same amount of COMMAND_BUFFER_SLOTS as > there are hosts and services. Provided the slaves also send passive > hostchecks, ofc, otherwise you can set it to the amount of services instead. The customer was planning on adding more services. Right now at peak we got 14K results in a second and reaper frequency of 2 second could fill 28k slots on command buffer . Came up with 60000 just in case if nagios is not digesting the buffer fast enough.. > > It should also be noted that these settings shouldn't be modified unless > needed, as it will make Nagios use quite a bit more memory per default What i remember looking at code is, the child process allocates memory per message read and if the number of messages in buffer is less than COMMAND_BUFFER_SLOTS. My understading is Nagios wont allocate all of COMMAND_BUFFER_SLOTS slots. It will be allocated only if results are coming at short interval and/or if it is not being processed fast enough. ---- Mahesh Kunjal mk...@gm... |
From: Gaspar, C. <Car...@gs...> - 2006-12-20 00:58:34
|
Mahesh Kunjal wrote: >=20 > This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer. >=20 > While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory. This is a bug. A second reader should not be created if a prior reader still exists. Locking is required here. --=20 Carson |
From: Percy J. <ja...@fg...> - 2007-04-10 11:21:09
|
Gaspar, Carson schrieb: > Mahesh Kunjal wrote: > >> This was not enough and the child process started to wait for memory >> > to be freed so that the pipe data retrieved can be put in buffer. > >> While this child process waited for memory to be freed, the command >> > worker thread got woken up and realized that there is data in pipe and > forked another child. This got repeated and eventually server went out > of memory. > > This is a bug. A second reader should not be created if a prior reader > still exists. Locking is required here. > This is a large bug concerning all mayor installations of nagios. We've located this bug too and are now working on a solution. We like to solve this problem, by spawning a thread doing the job of the problem causing processes. I hope a working patch will be available soon. Best regards Percy Jahn |
From: Daniel M. <ea...@cy...> - 2006-12-21 13:05:04
|
Hi Ton, the patch does not work, i think its a typo here: + fprintf(fp,"\tmax_command_buffer_items=%d\n", max_command_buffer_items); + fprintf(fp,"\tcurrent_service_buffer=%d\n", service_result_buffer.items); + fprintf(fi,"\tmax_service_buffer_items=%d\n", max_service_buffer_items); Shouldn't it be fprintf(fp... in all lines? Danny -- Q: Gentoo is too hard to install = http://www.cyberdelia.de and I feel like whining. = ea...@cy... A: Please see /dev/null. = (from the gentoo installer FAQ) = \o/ |
From: Ton V. <ton...@al...> - 2006-12-21 16:23:41
|
Hi Daniel, On 21 Dec 2006, at 13:04, Daniel Meyer wrote: > Hi Ton, > > the patch does not work, i think its a typo here: > > + fprintf(fp,"\tmax_command_buffer_items=3D%d\n", =20 > max_command_buffer_items); > + fprintf(fp,"\tcurrent_service_buffer=3D%d\n", =20 > service_result_buffer.items); > + fprintf(fi,"\tmax_service_buffer_items=3D%d\n", =20 > max_service_buffer_items); > > Shouldn't it be fprintf(fp... in all lines? Right you are - sorry. Patch redone. Ton http://www.altinity.com T: +44 (0)870 787 9243 F: +44 (0)845 280 1725 Skype: tonvoon =EF=BF=BC |