From: Ben <be...@si...> - 2005-05-23 15:02:35
|
I've been ignoring most of this thread, so maybe I'm not adding anything here, but I thought I'd comment on this statement: On May 23, 2005, at 5:14 AM, Andreas Ericsson wrote: > > That said, the current bottleneck in Nagios appears to be the fact > that it runs checks in chunks rather than as standalone units which > can be picked up as they become elligible for checking. If that > little snag could be overcome, I'm confident that the > aforementioned average check latency of 25 seconds could be done > away with. > This is misleading. In my experience, Nagios doesn't run checks in chunks. It *does* kick off as many concurrent checks as you tell it (assuming there are things that need to be checked), but, if the results come in while it's still trying to kick off more checks, it stops doing that so it can process the new results. Because similar checks tend to be started at similar times and take similarly long to run, that means that it *appears* as if nagios kicks of a batch checks, then waits a while, then kicks off some more. In actuality, it's processing the results of the first batch before it does anything else, and the batch size is defined by how long it takes from the first check to be started until the first result comes in. One possible way to speed this up is to trade in the rather simple current model of "we can't initiate checks if we've got pending results, because those results might alter what we need to check" for the much more complex (but scaleable and possibly more correct) model of "we can't send more checks that depend on the results of what we currently have outstanding checks for, but if we want to check unrelated services, not a problem." It seems to me that would help an awful lot, assuming it was bug- free, but it's also a pretty fundamental change to Nagios' scheduler. |