From: Roger T. <rto...@fl...> - 2010-07-30 11:43:39
|
I'm with Eric. The more transparent DNX is, the better we'll understand what happens when a check fails. However, the current situation is that DNX "translates" an out of scope exit to an exit 3, and passes it to Nagios together with the original status message with a prepended "[EC-1]" (which I assume it means exit code -1). The human parsing part is useful, but in my case having an UNKNOWN state also means triggering an alert, sending an SMS and maybe waking someone (probably me) up. Two questions come into my mind: - Why would DNX do that? We also have home-made checks. If one of my boys creates a check that returns out of scope codes, I'd like to see it as always have seen it. Adding translations is also adding complexity when tracing errors back, so I think it's much better DNX simply acts as a messenger and returns to Nagios the check result "as is". - Why am I seeing checks that return -1 only when I use DNX? I'm seeing one [EC-1] result every 4 or 5 minutes when I use DNX, including standard checks that come with the official nagios-plugins releases for RHEL5, for example check_ntp. The thing is if I disable DNX, I never get -1 status, or UNKNOWN "out of bounds", or whatever. I only get false positives when I use DNX. As John said, it has to be something related to the way DNX fetches exit status from the plugins and tries to understand them. I don't know if his recent commit will fix something (will try), but I'm pretty sure that if DNX simply forwarded whatever it got from the plugin to Nagios, instead of trying to "understand" it, we'd eliminate some complexity and DNX wouldn't be the one to blame, but the plugin itself. Cheers. Roger On Thu, 2010-07-29 at 15:13 -0600, Eric Schoeller wrote: > John, > > Yep, that's exactly where I was heading with this. Now that I > understand what DNX is doing, I'm torn between the options. Having DNX > send the exact return code would be more consistent with how Nagios > works in general, as well as how other passive distributed monitoring > methods work. Do you know if there was a specific reason why DNX was > designed to behave this way? Would changing the code break anything > else? > > I'm also interested if anyone else has thoughts on the matter, I've > cc'd dnx-devel. At first glance I would vote for sending the original > return code all the way up to Nagios. But at the very least this > should be documented somewhere. > > Eric > > John Calcote wrote: > > Eric, > > > > Would you rather have dnx clients return the exact error code that > > was returned by the shell? > > > > If there are any other interested parties, please also respond here. > > This functionality has been this way in DNX forever. :) I don't mind > > changing it, but I want to make sure I won't break a bunch of people > > if I do. > > > > John > > > > On 7/29/2010 2:47 PM, Eric Schoeller wrote: > > > So you're basically saying that Roger's plugin is returning -1 and > > > DNX translates this to 3 when it passes it to Nagios? > > > > > > The default behavior for Nagios is to accept an out of bounds > > > return code, and report it as such with a "return code out of > > > bounds" message. I think you can configure what state nagios uses > > > for such instances. > > > > > > John Calcote wrote: > > > > Hi Roger, > > > > > > > > The following result codes are defined in the dnxPlugin.h header file: > > > > > > > > #define DNX_PLUGIN_RESULT_OK 0 // DNX plugin result: success. > > > > #define DNX_PLUGIN_RESULT_WARNING 1 //DNX plugin result: warning. > > > > #define DNX_PLUGIN_RESULT_CRITICAL 2 // DNX plugin result: critical. > > > > #define DNX_PLUGIN_RESULT_UNKNOWN 3 // DNX plugin result: unknown. > > > > > > > > There is code in the plugin handler that basically looks like this: > > > > > > > > int result = do-shell-command(...) > > > > if result < DNX_PLUGIN_RESULT_OK OR result > DNX_PLUGIN_RESULT_UNKNOWN then > > > > prepend "[EC <result>]" to resulting message text > > > > return DNX_PLUGIN_RESULT_UNKNOWN (3) > > > > > > > > Thus, no matter what the plugin's shell returns to DNX client, if it's > > > > outside the range of known results, the result returned to Nagios by DNX > > > > is 3 (result unknown), but the real shell result code is displayed in > > > > the text in the [EC <result>] value. Thus, whenever DNX returns a 3 > > > > (result unknown) to Nagios, it can provide the true result code for > > > > human parsing in the message text. > > > > > > > > Btw, I've committed a minor change that *may* reduce the number of such > > > > occurrences because on some systems the shell status code may be encoded > > > > in the waitpid status value a bit differently. We were assuming a > > > > particular format rather than using the macros designed to parse out the > > > > status code. > > > > > > > > Regards, > > > > John > > > > > > > > On 7/29/2010 10:48 AM, Roger Torrentsgenerós wrote: > > > > > > > > > Hi, > > > > > > > > > > I have noticed that sometimes, some checks status in Nagios are shown > > > > > like this: > > > > > > > > > > [EC -1]NTP OK: Offset 0.0007655997179 secs > > > > > > > > > > This is what dnxsrv.audit.log shows (debug=1): > > > > > > > > > > [Thu Jul 29 18:24:00.211 2010] DISPATCH: Job 125865: Worker > > > > > 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/chec > > > > > k_nrpe -H 192.168.128.222 -c "check_ntp" > > > > > [Thu Jul 29 18:24:00.211 2010] ASSIGN: Job 125865: Worker > > > > > 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/check_ > > > > > nrpe -H 192.168.128.222 -c "check_ntp" > > > > > [Thu Jul 29 18:24:00.312 2010] COLLECT: Job 125865: Worker > > > > > 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/check > > > > > _nrpe -H 192.168.128.222 -c "check_ntp" > > > > > > > > > > This is what dnxcld.debug.log says (debug=2): > > > > > > > > > > [Thu Jul 29 18:24:00.220 2010] dnxPluginExecute: > > > > > Executing /usr/libexec/nagios/plugins/check_nrpe -H 192.168.128.222 > > > > > -c "check_ntp" > > > > > > > > > > Nagios log says: > > > > > > > > > > [1280420649] SERVICE ALERT: streamer022.p4.bt.bcn;ntp;UNKNOWN;SOFT;1;[EC > > > > > -1]NTP OK: Offset 0.0007655997179 secs > > > > > > > > > > What is this "[EC-1]" thing? In all cases, the check result message says > > > > > OK, but returns an "exit 3" so Nagios treats it as an UNKNOWN state. And > > > > > every time, the next check of the same service (rescheduled or > > > > > automatic) always returns an OK, with exit status 0 and a correctly > > > > > formed status message. > > > > > > > > > > So can someone explain what is that thing, what does it mean and what's > > > > > for? > > > > > > > > > > Thanks a lot! > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > The Palm PDK Hot Apps Program offers developers who use the > > > > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > > > > of $1 Million in cash or HP Products. Visit us here for more details: > > > > http://p.sf.net/sfu/dev2dev-palm > > > > _______________________________________________ > > > > Dnx-users mailing list > > > > Dnx...@li... > > > > https://lists.sourceforge.net/lists/listinfo/dnx-users > > > > > > > > > > ------------------------------------------------------------------------------ > > > The Palm PDK Hot Apps Program offers developers who use the > > > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > > > of $1 Million in cash or HP Products. Visit us here for more details: > > > http://p.sf.net/sfu/dev2dev-palm > > > > > > _______________________________________________ > > > Dnx-users mailing list > > > Dnx...@li... > > > https://lists.sourceforge.net/lists/listinfo/dnx-users > > > > > > > > > ____________________________________________________________________ > > > > ------------------------------------------------------------------------------ > > The Palm PDK Hot Apps Program offers developers who use the > > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > > of $1 Million in cash or HP Products. Visit us here for more details: > > http://p.sf.net/sfu/dev2dev-palm > > > > ____________________________________________________________________ > > > > _______________________________________________ > > Dnx-users mailing list > > Dnx...@li... > > https://lists.sourceforge.net/lists/listinfo/dnx-users > > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ Dnx-users mailing list Dnx...@li... https://lists.sourceforge.net/lists/listinfo/dnx-users |