From: Ton V. <ton...@al...> - 2006-12-22 12:30:52
|
On 22 Dec 2006, at 01:50, Ethan Galstad wrote: > Based on the recent thread about hanging Nagios processes, I have > removed the COMMAND_BUFFER_SLOTS and SERVICE_BUFFER_SLOTS definitions > out to config file variables: > > external_command_buffer_slots=3D4096 > check_result_buffer_slots=3D4096 > > I have also updated nagiostats to report the avail/used number of =20 > slots > for graphing in MRTG. Could folks try out the latest 2.x CVS code and > give it some testing? Ethan, Thanks for applying to CVS. Several comments: - external_command_buffer_slots and check_result_buffer_slots only =20 needs to be an int as the circular_buffer struct only uses an int for =20= items - in xsddefault.c, when you print out external_command_buffer.items, =20 I think this is not thread-safe. My thread knowledge is pretty =20 limited, so please correct me if I am wrong. The main nagios process =20 writes the status data via xsddefault_save_status_data, which needs =20 to read the external_command_buffer variable. However, this variable =20 is written to by the command_file_worker_thread. So I think the =20 xsddefault_save_status_data routine needs a thread lock on =20 external_command_buffers before it can read the items data, otherwise =20= there is the potential for corrupt data. Note, there is a cost to =20 that, especially if the status data is being written with =20 aggregate_status_updates =3D 0. - your output to status.dat is different from mine. You are =20 outputting max_external_command_buffer_slots (the value defined in =20 nagios.cfg) and used_external_command_buffer_slots (the current =20 number of items in the buffer). In my patch, I had a different =20 definition: max_command_buffer_items meant the "maximum number of =20 items that has been in the buffer". (I would prefer used_external_command_buffer_slots be changed to =20 current_external_command_buffer_slots because it more accurately =20 describes "this is the number I have now".) =46rom now on, I'll call it high_external_command_buffer_items, as it =20= can also be the "high water mark of the number of items in the =20 buffer". This is a useful statistic as it tells you what the =20 max_external_command_buffer_slots should be to get no holdups. Also, it probably makes sense to put the high water mark within the =20 circular_buffer struct. Please find a patch attached with these changes. On my small test system, the used_check_result_buffer_slots is =20 usually 0. When I introduce 1 fake slave (128 results per 10 =20 seconds), used_check_result_buffer fluctuates from 0 to 20s to 30s. =20 Introducing a 2nd fake slave, the high mark moves up to 100s. A 3rd =20 slave moves the high mark to 192. If I introduce NDO into the system, I get a large iowait time (in the =20= 80%s), presumably database writes. The status file is not updated as =20 regularly (one instance of 60 seconds between writes), but when it =20 does, then the high_* values jump up to the 200-300s. This is a =20 poorly configured database, so I'm guessing that there are delays due =20= to the main nagios process passing data to the the broker module. At the moment with 2 slaves sending 128 packets per 10 seconds, I'm =20 getting high values of 983 for external commands and 1405 for check =20 results. I think these recent changes help with seeing if there are =20 bottlenecks at the reading of the command pipe, but I think there are =20= possibly other slow downs further down the chain (which Nagios 3 may =20 aid with). Ton http://www.altinity.com T: +44 (0)870 787 9243 F: +44 (0)845 280 1725 Skype: tonvoon =EF=BF=BC |