RE: [Nagios-db-devel] performance query

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> -----Original Message-----
> From: Ben [mailto:be...@si...]
> Sent: 28 April 2005 16:09
> To: Dan Hopkins
> Cc: 'nag...@li...'
> Subject: Re: [Nagios-db-devel] performance query

<snip>

> There are two potential gotchas. First, the postgres nagios-db uses 
> snapshot materialized views, meaning they refresh 
> periodically, not in 
> realtime. I have pretty beefy hardware, so I can set mine to refresh 
> every minute, but for a tool that's supposed to tell you when things 
> are down, I think updating every minute is pretty overkill. 
> Once every 
> 5 minutes would probably be just fine.

This is my next step too, although I'm using MySQL so pgsql's nice
materialized views will either be hand rolled snapshot tables :), or I may
try replication to a reader db. I'd intended to do the snapshots fairly
quickly, a couple of times a minute (we're only talking a few thousand rows
at the moment, so it's no big deal to the db) but I've not done any testing
yet so this could be optimistic considering there are multiple nagios
instances running off the same db backend hosts. We shall see...

> The other potential gotcha is that nagios has to write the value of 
> each check into the db to complete a check, and if the db is slammed, 
> this will take an increasing amount of time. This was causing me very 
> bad latency problems (thousands of seconds) until recently, when I 
> implemented some thread pool action in the NEB and now my 
> latency times 
> tend to hover around 15 seconds.

This is the sort of issue that's hitting us: nagios is slow writing it's
updates in aggregated mode (we're talking 10-20 seconds for less than 4000
services in worst case) that some of our scripts (inhouse php customised
replacements) just hang on queries waiting for the locks to free. And too
many users accessing the scripts cause nagios to lag obtaining locks to dump
its updates. Enter non-aggregated updates .... and the massive jump in load
on the nagios host. Still, the user facing scripts appear quicker at least
;) But it did make me wonder how the neb's fare with thousands of status
updates, now I see from another thread you've got over 8k services on the go
? - that's promising stuff, are you distributing this over multiple nagios
hosts or a single centralised one?

> That thread pool-enabled version of nagios still lives in CVS only; 
> once I write some documentation, I'll be making another release.

I look forward to having a play with this.

Thanks for the info,
Dan