From: Percy J. <ja...@fg...> - 2007-08-30 13:53:12
|
Hello Ethan and others, we are using a redundant Nagios-System with keepalived for IP-Transition. The Problem now occuring is that service checks get "lost" and are never scheduled again. I've located the problem in schedule_service_check(). In case of an keepalived transition, nagios gets a STOP_EXECUTING_SVC_CHECKS, DISABLE_NOTIFICATIONS or ENABLE_NOTIFICATIONS, START_EXECUTING_SVC_CHECKS on the other hand. If nagios got outstanding checks while receiving "disable notifications" it sets the global status accordingly. reap_service_checks() gets the check results from the outstanding properly scheduled service checks and trys to reschedule the servicecheck via schedule_service_check(). This function immediately exists without rescheduling, because active checks are disabled globaly. In the end, the service is lost and could not be rescheduled. check_for_orphaned_services() could not solve this problem, because the check is marked as "not executing/running" by reap_service_checks(). My first solution is to adapt schedule_service_check() to schedule all services (including the not active ones), but i believe this could break some other stuff. Ethan could you please take a closer look at this? I'm using Nagios version 2.6 and checked the Changelog, but nothing concerning my problem is mentioned. In the meanwhile i solved the problem for my case, via "sighup"ing nagios in case of an transition. best regards Percy Jahn |