RE: [Nagios-db-devel] performance query

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, 28 Apr 2005, Dan Hopkins wrote:

> > The other potential gotcha is that nagios has to write the value of 
> > each check into the db to complete a check, and if the db is slammed, 
> > this will take an increasing amount of time. This was causing me very 
> > bad latency problems (thousands of seconds) until recently, when I 
> > implemented some thread pool action in the NEB and now my 
> > latency times 
> > tend to hover around 15 seconds.
> 
> This is the sort of issue that's hitting us: nagios is slow writing it's
> updates in aggregated mode (we're talking 10-20 seconds for less than 4000
> services in worst case) that some of our scripts (inhouse php customised
> replacements) just hang on queries waiting for the locks to free. And too
> many users accessing the scripts cause nagios to lag obtaining locks to dump
> its updates. Enter non-aggregated updates .... and the massive jump in load
> on the nagios host. Still, the user facing scripts appear quicker at least
> ;) But it did make me wonder how the neb's fare with thousands of status
> updates, now I see from another thread you've got over 8k services on the go
> ? - that's promising stuff, are you distributing this over multiple nagios
> hosts or a single centralised one?

I've got a single dual xeon with 2.5GB of ram running nagios and making
all the active checks (I only have active checks). It does agregate writes
to a ram disk for the nagios logs. I've got another dual xeon with 4GB of
ram and a fast scsi raid holding the db. And I've got a third wimpy box
running the php UI.

My problem is the way nagios' scheduler works. I don't know if you're
familiar with it, so let me just tell you how it works. Everything nagios
does gets scheduled and piped through an event queue. There are low
priority events (like kicking off checks at a specified time) and high
priority events (like reaping outstanding check results). Well, all the
high priority events get handled before any low priority ones. So what
eneds up happening is that nagios starts kicking off low priority checks
until the results come back - and then it spends a lot of time (10 seconds
at least) processing those results. Nagios-DB only makes this worse.  
During those 10 seconds, no more events are kicked off, meaning things get 
progressively worse.

> > That thread pool-enabled version of nagios still lives in CVS only; 
> > once I write some documentation, I'll be making another release.
> 
> I look forward to having a play with this.

Just to clarily, I meant a thread pool-enabled version of the postgres 
nagios-db NEB is in CVS. :)