|
From: Detrak <de...@ca...> - 2007-10-24 08:33:59
|
Hi, We use Nagios on several servers, in version 2.9 with NDO 1.4b5 and perf2rd= d (nagios write performance data in a pipe file and perf2rrd perform it in rr= d file). Running on RHEL4 with package from dag.wieers.com we have 80 hosts and 420 services on this server. We can see some huge gaps in our graphs, perf2rrd works fine, my first investigation shows this message in nagios.log file : [1193178252] ndomod: Error writing to data sink! Some output may get lost... [1193178268] ndomod: Successfully reconnected to data sink! 0 items lost, 240 queued items to flush. [1193178269] ndomod: Successfully flushed 240 queued items to data sink. [1193187298] Warning: A system time change of 8729 seconds (forwards in time) has been detected. Compensating... [1193190553] Warning: A system time change of 3255 seconds (forwards in time) has been detected. Compensating... we have recompiled nagios with debug mode : --enable-DEBUG2 shows warning messages --enable-DEBUG3 shows scheduled events we don't use le DEBUG0 because it generates too much informations and the log file increases too fast. so, I found this message in debug information, with the last gap : - Masquer le texte des messages pr=E9c=E9dents - *** Event Check Loop *** Current time: Wed Oct 24 00:29:29 2007 Next High Priority Event Time: Wed Oct 24 00:29:30 2007 Next Low Priority Event Time: Wed Oct 24 00:29:29 2007 Current/Max Outstanding Service Checks: 19/65 *** Event Details *** Event time: Wed Oct 24 00:29:29 2007 Event type: 0 (service check) Service Description: LOAD_AVERAGE@LOADAVERAGE Associated Host: SGBD1 Checking service 'LOAD_AVERAGE@LOADAVERAGE' on host 'SGBD1'... - Masquer le texte des messages pr=E9c=E9dents - *** Event Check Loop *** Current time: Wed Oct 24 00:29:29 2007 Next High Priority Event Time: Wed Oct 24 00:29:30 2007 Next Low Priority Event Time: Wed Oct 24 00:29:29 2007 Current/Max Outstanding Service Checks: 20/65 *** Event Details *** Event time: Wed Oct 24 00:29:29 2007 Event type: 0 (service check) Service Description: LOAD_AVERAGE@LOADAVERAGE Associated Host: INTEG Checking service 'LOAD_AVERAGE@LOADAVERAGE' on host 'INTEG'... Warning: A system time change of 8729 seconds (forwards in time) has been detected. Compensating... *** Event Check Loop *** Current time: Wed Oct 24 02:54:58 2007 Next High Priority Event Time: Wed Oct 24 02:54:59 2007 Next Low Priority Event Time: Wed Oct 24 02:54:58 2007 Current/Max Outstanding Service Checks: 21/65 *** Event Details *** Event time: Wed Oct 24 02:54:58 2007 Event type: 0 (service check) Service Description: MONITOR_TELNET_SUIVI_PS Associated Host: PREPROD1 Checking service 'MONITOR_TELNET_SUIVI_PS' on host 'PREPROD1'... Warning: A system time change of 3255 seconds (forwards in time) has been detected. Compensating... *** Event Check Loop *** Current time: Wed Oct 24 03:49:13 2007 Next High Priority Event Time: Wed Oct 24 03:49:14 2007 Next Low Priority Event Time: Wed Oct 24 03:49:13 2007 Current/Max Outstanding Service Checks: 22/65 *** Event Details *** Event time: Wed Oct 24 03:49:13 2007 Event type: 0 (service check) Service Description: MONITOR_TELNET_SUIVI_PS Associated Host: BIDS15 Checking service 'MONITOR_TELNET_SUIVI_PS' on host 'BIDS15'... we can see the jump 00:29:29 to 02:54:58 and 02:54:58 to 03:49:13 without activity in nagios! I dont understand this! if you can give me some help to have a nagios server with more stability. I dont know how to reproduce this bug. At the time a gap was accuring, the server time was up to date. We have on this server more than a gap by day! best regards, Olivier |