From: Adam A. <aug...@gm...> - 2010-07-30 22:06:15
|
1) Why would DNX do that? A - See my other response. It was not the right thing to do. John's change should fix that. 2) Why am I seeing checks that return -1 only when I use DNX? A - This one is a little more complex and subtle. I don't think it is DNX per-se that is doing it. You may see a different return code when running under DNX because the DNX client may be running as a different user from Nagios, because the DNX client is running from a different location in the network, because DNX may run the check with different environment variables (DNX does not currently pass the environment variables from the Nagios server), and those reasons are just from the top of my head. Without knowing your setup, and because this seems to be intermittent, something we have seen is a particular worker node for whatever reason is configured slightly differently from the others (/etc/resolv.conf didn't have a search domain, ntp.conf had a typo in one of the server IPs, freetds.conf had an entry missing), so a particular plugin behaves differently there than on all the other. On Fri, Jul 30, 2010 at 5:43 AM, Roger Torrentsgenerós <rto...@fl...> wrote: > > I'm with Eric. The more transparent DNX is, the better we'll understand > what happens when a check fails. > > However, the current situation is that DNX "translates" an out of scope > exit to an exit 3, and passes it to Nagios together with the original > status message with a prepended "[EC-1]" (which I assume it means exit > code -1). The human parsing part is useful, but in my case having an > UNKNOWN state also means triggering an alert, sending an SMS and maybe > waking someone (probably me) up. > > Two questions come into my mind: > > - Why would DNX do that? > > We also have home-made checks. If one of my boys creates a check that > returns out of scope codes, I'd like to see it as always have seen it. > Adding translations is also adding complexity when tracing errors back, > so I think it's much better DNX simply acts as a messenger and returns > to Nagios the check result "as is". > > - Why am I seeing checks that return -1 only when I use DNX? > > I'm seeing one [EC-1] result every 4 or 5 minutes when I use DNX, > including standard checks that come with the official nagios-plugins > releases for RHEL5, for example check_ntp. The thing is if I disable > DNX, I never get -1 status, or UNKNOWN "out of bounds", or whatever. I > only get false positives when I use DNX. > > As John said, it has to be something related to the way DNX fetches exit > status from the plugins and tries to understand them. I don't know if > his recent commit will fix something (will try), but I'm pretty sure > that if DNX simply forwarded whatever it got from the plugin to Nagios, > instead of trying to "understand" it, we'd eliminate some complexity and > DNX wouldn't be the one to blame, but the plugin itself. > > Cheers. > > Roger > |