Re: [Nagios-devel] NDO utils bug/explanation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi there,
Frassinelli, Marco wrote the following on 18.09.2009 12:25:
> Hi,
> I think is not a correct behavior because visualization software as 
> nagvis and others uses the table programstatus to check if nagios is 
> running.
The process of starting ndo2db and then Nagios makes sure that there is 
actual data within the DB. If there is an outdated data within the DB it 
needs to be removed before Nagios even sends new data. So the process of 
trimming those table entries is truly intentional at the beginning 
(so-called pre-launch state where the if condition matches). If ndo2db 
fails for some reason, those data will remain within the database and 
then removed during the next start.
>
> I saw that often this table is empty.
Depending on your startup routine I would guess that you started Nagios 
first and then ndo2db. But it shouldn't because ndomod as an event 
broker keeps data not written to ndo2db in a defined cache. Depending on 
your configuration this cache may be to little so the oldest entry could 
be lost (in this case the programstatus of Nagios). But that's really a 
guess you'll have to give more information where and when this error 
occurs mentioning all circumstances you'll catch up in the logs (turn on 
very detailed and everything in debug_level in case).
> Code calculates the difference between now() and status_update_time. 
> If the record is null this difference is far more than the 
> configurable interval, tipical 180 sec.
Which code and which configuration?
The only thing I can see here is tstamp.tv_sec which is a converted 
timestamp got from eventbroker module. This is kind of now() but 
recently a now() from Nagios itsself. You may check 
ndo2db.c;ndo2db_convert_standard_data_elements
The other compared value is dbinfo.latest_realtime_data_time which is 
initialized in db.c:ndo2db_db_init and then updated if 
dbinfo.latest_program_status_time newer (db.c:374; directly to that 
value). There are several other realtime datavalues which may update 
this value.
So the clue of this data is - if actual Nagios NDO_DATA_TIMESTAMP is 
newer than the latest realtime data gotten some time before, it is time 
for a cleanup at the very beginning of ndo2db (check the sequence in 
ido2db.c:main).
>
> The problem is that this difference suddenly vary from near 0 to 
> infinity.
The conditional statement does not only insist on the difference 0 or 
more but also if it is a process pre launch (see above). But besides a 
question - how did you get to this values? Current NDOUtils code doesn't 
give and debug information at this stage.
>
> Perhaps this is a problem in my ndo setup, and those deletes normally 
> occurs rarely. But I saw them every 60 seconds.
> Here ndo2db log:
>  
> As you can see the ndo2db pid varies, I think that when it has no more 
> data the child exits, an a new one is forked. The new child then 
> deletes records in db.
Seeing your ndo2db die and refork explains why the pre_launch_state and 
timestamp condition is matching and so within each period of time, 
database cleanup is performed.
It would be interesting why ndo2db is dying. Depending on your 
configuration this may vary - tcp or unix socket e.g.? What about more 
detailed debuglogs or are there messages like "error writing to 
datasink" in the logs?

Kind regards,
Michael

-- 
DI (FH) Michael Friedrich
mic...@un...
Tel: +43 1 4277 14359
Vienna University Computer Center
Universitaetsstrasse 7
A-1010 Vienna, Austria

Re: [Nagios-devel] NDO utils bug/explanation

Nagios network monitoring software is enterprise server monitoring

Re: [Nagios-devel] NDO utils bug/explanation