From: Thomas Guyot-S. <Th...@za...> - 2006-08-25 19:05:07
|
I've been running a fairly big Nagios setup (600+ checks) for a few years now... Only issue so far is some lost passive checks under load (I posted about it some time ago, been dismissed as a non-issue which I think is not). Some time ago (Aug 16 to be precise) I noticed there were a --enable-nanosleep option so I tried it to see if it helps for the passive checks problem. I couldn't see any change in performance or passive checks reliability, however I had an issue. Twice since then I found out Nagios stopped running active checks and processing passive checks, so I had a stale daemon that wouldn't monitor anything apart from showing everything is fine. The first time was not so long after the nanosleep change, right after a restart so I dismissed it as an odd startup bug. The second time it happened was today (nagios were running fine since 2-3 days, last restart was for config change). For no apparent reasons it stopped running checks. Running check_nagios manually shown that status file wasn't updated and process count were oscillating between 3 and 6. I'm running nagios-2-x-cvs (2006-07-07 10:11:49), last commit was for a bug I reported: * Bug fix for segfault during startup due to extended service definition duplication Here's the last entries in the log (edited). Service X is a custom script scheduled to run every 5 minutes on some servers and reporting trough send_nsca: [2006-08-25 13:46:47] Caught SIGHUP, restarting... <--- ME RESTARTING NAGIOS (STALE) Informational Message[2006-08-25 13:15:20] Auto-save of retention data completed successfully. Service Ok[2006-08-25 13:15:13] SERVICE ALERT: hostx.example.com;Service X;OK;SOFT;2;OK: Everything looks fine Service Ok[2006-08-25 13:15:13] SERVICE ALERT: hosty.example.com;Service X;OK;SOFT;2;OK: Everything looks fine Service Critical[2006-08-25 13:11:29] SERVICE ALERT: hosty.example.com;Service X;CRITICAL;SOFT;1;CRITICAL: Didn't recieved Service X results. Service Critical[2006-08-25 13:11:29] SERVICE ALERT: hostx.example.com;Service X;CRITICAL;SOFT;1;CRITICAL: Didn't recieved Service X results. Informational Message[2006-08-25 13:11:20] Warning: The results of service 'Service X' on host 'hosty.example.com' are stale by 47 seconds (threshold=330 seconds). I'm forcing an immediate check of the service. Informational Message[2006-08-25 13:11:20] Warning: The results of service 'Service X' on host 'hostx.example.com' are stale by 48 seconds (threshold=330 seconds). I'm forcing an immediate check of the service. Informational Message[2006-08-25 13:10:20] Auto-save of retention data completed successfully. Informational Message[2006-08-25 13:05:20] Auto-save of retention data completed successfully. Informational Message[2006-08-25 13:00:21] Auto-save of retention data completed successfully. Thanks, Thomas |