[Nagios-devel] Reproducible bug scheduling second host check following failure
Nagios network monitoring software is enterprise server monitoring
Brought to you by:
egalstad,
sawolf-nagios
From: Matthew R. <mat...@it...> - 2007-12-29 18:11:46
|
I have, for some time, been suspicions that Nagios 3 appeared a bit slow alerting on host failures. In order to diagnose this, I setup a minimal config to try to show the problem using 3.0rc1 with a default nagios.cfg. It consists of one host and one service (both checked by ping), with the retries and timers set as below (the config file is attached in full):- |define host{ | host_name hostname | check_interval 10 | retry_interval 1 | max_check_attempts 5 |} |define service{ | max_check_attempts 5 | normal_check_interval 5 | retry_check_interval 5 | host_name hostname | service_description ping |} The host was up at 16:25 with the next host check scheduled for 16:35 and the next service check for 16:30:- |[2007-12-29 16:25:42] [1198945542.102847] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:30:32 2007 |[2007-12-29 16:25:52] [1198945552.201392] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=1/5, Type=HARD, Final State=0 |[2007-12-29 16:25:52] [1198945552.201763] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:35:52 2007 The host was then unplugged and the service was checked as expected at 16:30. As the service was down, the host (again as expected) was checked. The next host check was scheduled for 16:32 as expected with the host retry_interval of 1 minute:- |[2007-12-29 16:30:32] [1198945832.178688] [016.0] [pid=3660] Attempting to run scheduled check of service 'ping' on host 'hostname': check options=0, latency=0.178000 |[2007-12-29 16:30:52] [1198945852.109282] [016.1] [pid=3660] HOST: hostname, SERVICE: ping, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CRITICAL - Host Unreachable (m18.itconsult.net) |[2007-12-29 16:30:52] [1198945852.109557] [016.1] [pid=3660] Service is in a non-OK state |[2007-12-29 16:30:52] [1198945852.109579] [016.1] [pid=3660] Host is currently UP, so we'll recheck its state to make sure... |[2007-12-29 16:30:52] [1198945852.109626] [016.0] [pid=3660] ** Running async check of host 'hostname'... |[2007-12-29 16:30:52] [1198945852.127798] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:35:32 2007 |[2007-12-29 16:31:02] [1198945862.209380] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=1/5, Type=SOFT, Final State=1 |[2007-12-29 16:31:02] [1198945862.210709] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:32:02 2007 |[2007-12-29 16:31:02] [1198945862.210735] [016.2] [pid=3660] Found another host check event for this host... |[2007-12-29 16:31:02] [1198945862.210760] [016.2] [pid=3660] Keeping original host check event (ignoring the new one). |[2007-12-29 16:31:02] [1198945862.210831] [016.1] [pid=3660] ** Async check result for host 'hostname' handled: new state=1 However, although the CGIs continued to show the next host check for 16:32, nothing happened until:- |[2007-12-29 16:35:32] [1198946132.119030] [016.0] [pid=3660] Attempting to run scheduled check of service 'ping' on host 'hostname': check options=0, latency=0.118000 |[2007-12-29 16:35:42] [1198946142.223011] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:40:32 2007 |[2007-12-29 16:35:52] [1198946152.053988] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.053000 |[2007-12-29 16:35:52] [1198946152.054042] [016.0] [pid=3660] ** Running async check of host 'hostname'... |[2007-12-29 16:36:02] [1198946162.152677] [016.2] [pid=3660] Host has no parents, so it is DOWN. |[2007-12-29 16:36:02] [1198946162.152699] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=2/5, Type=SOFT, Final State=1 |[2007-12-29 16:36:02] [1198946162.154051] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:37:02 2007 Thereafter, the remaining host checks were executed at the correct retry_interval of 1 minute:- |[2007-12-29 16:37:02] [1198946222.148471] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.148000 |[2007-12-29 16:37:12] [1198946232.008071] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=3/5, Type=SOFT, Final State=1 |[2007-12-29 16:37:12] [1198946232.009395] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:38:12 2007 |[2007-12-29 16:38:12] [1198946292.239641] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.239000 |[2007-12-29 16:38:22] [1198946302.086426] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=4/5, Type=SOFT, Final State=1 |[2007-12-29 16:38:22] [1198946302.087774] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:39:22 2007 |[2007-12-29 16:39:22] [1198946362.069056] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.068000 |[2007-12-29 16:39:32] [1198946372.181915] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=5/5, Type=HARD, Final State=1 |[2007-12-29 16:39:32] [1198946372.184009] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:49:32 2007 Having now prepared this bug report, I suspect that the problem relates to:- |[2007-12-29 16:31:02] [1198945862.210735] [016.2] [pid=3660] Found another host check event for this host... |[2007-12-29 16:31:02] [1198945862.210760] [016.2] [pid=3660] Keeping original host check event (ignoring the new one). which causes the previous host check:- |[2007-12-29 16:25:52] [1198945552.201763] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:35:52 2007 incorrectly to take precedence over:- |[2007-12-29 16:31:02] [1198945862.210709] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:32:02 2007 This is completely reproducible, although I did adjust some of the timers & retries to make it easier to demonstrate. In case it is relevant, I have been testing on Centos 4.6. Finally, just in case they are useful, the full debug logs are available from: http://user.itconsult.co.uk/temp/itconsult1229a.tar.gz Best wishes, Matthew |