From: Eric S. <esc...@us...> - 2010-07-29 21:13:18
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> John,<br> <br> Yep, that's exactly where I was heading with this. Now that I understand what DNX is doing, I'm torn between the options. Having DNX send the exact return code would be more consistent with how Nagios works in general, as well as how other passive distributed monitoring methods work. Do you know if there was a specific reason why DNX was designed to behave this way? Would changing the code break anything else?<br> <br> I'm also interested if anyone else has thoughts on the matter, I've cc'd dnx-devel. At first glance I would vote for sending the original return code all the way up to Nagios. But at the very least this should be documented somewhere. <br> <br> Eric<br> <br> John Calcote wrote: <blockquote cite="mid:4C5...@gm..." type="cite"> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> Eric,<br> <br> Would you rather have dnx clients return the exact error code that was returned by the shell? <br> <br> If there are any other interested parties, please also respond here. This functionality has been this way in DNX forever. :) I don't mind changing it, but I want to make sure I won't break a bunch of people if I do.<br> <br> John<br> <br> On 7/29/2010 2:47 PM, Eric Schoeller wrote: <blockquote cite="mid:4C5...@us..." type="cite"> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> So you're basically saying that Roger's plugin is returning -1 and DNX translates this to 3 when it passes it to Nagios?<br> <br> The default behavior for Nagios is to accept an out of bounds return code, and report it as such with a "return code out of bounds" message. I think you can configure what state nagios uses for such instances.<br> <br> John Calcote wrote: <blockquote cite="mid:4C5...@gm..." type="cite"> <pre wrap=""> Hi Roger, The following result codes are defined in the dnxPlugin.h header file: #define DNX_PLUGIN_RESULT_OK 0 // DNX plugin result: success. #define DNX_PLUGIN_RESULT_WARNING 1 //DNX plugin result: warning. #define DNX_PLUGIN_RESULT_CRITICAL 2 // DNX plugin result: critical. #define DNX_PLUGIN_RESULT_UNKNOWN 3 // DNX plugin result: unknown. There is code in the plugin handler that basically looks like this: int result = do-shell-command(...) if result < DNX_PLUGIN_RESULT_OK OR result > DNX_PLUGIN_RESULT_UNKNOWN then prepend "[EC <result>]" to resulting message text return DNX_PLUGIN_RESULT_UNKNOWN (3) Thus, no matter what the plugin's shell returns to DNX client, if it's outside the range of known results, the result returned to Nagios by DNX is 3 (result unknown), but the real shell result code is displayed in the text in the [EC <result>] value. Thus, whenever DNX returns a 3 (result unknown) to Nagios, it can provide the true result code for human parsing in the message text. Btw, I've committed a minor change that *may* reduce the number of such occurrences because on some systems the shell status code may be encoded in the waitpid status value a bit differently. We were assuming a particular format rather than using the macros designed to parse out the status code. Regards, John On 7/29/2010 10:48 AM, Roger Torrentsgenerós wrote: </pre> <blockquote type="cite"> <pre wrap="">Hi, I have noticed that sometimes, some checks status in Nagios are shown like this: [EC -1]NTP OK: Offset 0.0007655997179 secs This is what dnxsrv.audit.log shows (debug=1): [Thu Jul 29 18:24:00.211 2010] DISPATCH: Job 125865: Worker 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/chec k_nrpe -H 192.168.128.222 -c "check_ntp" [Thu Jul 29 18:24:00.211 2010] ASSIGN: Job 125865: Worker 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/check_ nrpe -H 192.168.128.222 -c "check_ntp" [Thu Jul 29 18:24:00.312 2010] COLLECT: Job 125865: Worker 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/check _nrpe -H 192.168.128.222 -c "check_ntp" This is what dnxcld.debug.log says (debug=2): [Thu Jul 29 18:24:00.220 2010] dnxPluginExecute: Executing /usr/libexec/nagios/plugins/check_nrpe -H 192.168.128.222 -c "check_ntp" Nagios log says: [1280420649] SERVICE ALERT: streamer022.p4.bt.bcn;ntp;UNKNOWN;SOFT;1;[EC -1]NTP OK: Offset 0.0007655997179 secs What is this "[EC-1]" thing? In all cases, the check result message says OK, but returns an "exit 3" so Nagios treats it as an UNKNOWN state. And every time, the next check of the same service (rescheduled or automatic) always returns an OK, with exit status 0 and a correctly formed status message. So can someone explain what is that thing, what does it mean and what's for? Thanks a lot! </pre> </blockquote> <pre wrap=""><!----> ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://p.sf.net/sfu/dev2dev-palm">http://p.sf.net/sfu/dev2dev-palm</a> _______________________________________________ Dnx-users mailing list <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Dnx...@li...">Dnx...@li...</a> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/dnx-users">https://lists.sourceforge.net/lists/listinfo/dnx-users</a> </pre> </blockquote> <pre wrap=""><fieldset class="mimeAttachmentHeader"></fieldset> ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://p.sf.net/sfu/dev2dev-palm">http://p.sf.net/sfu/dev2dev-palm</a></pre> <pre wrap=""><fieldset class="mimeAttachmentHeader"></fieldset> _______________________________________________ Dnx-users mailing list <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Dnx...@li...">Dnx...@li...</a> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/dnx-users">https://lists.sourceforge.net/lists/listinfo/dnx-users</a> </pre> </blockquote> <br> <pre wrap=""> <hr size="4" width="90%"> ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: <a class="moz-txt-link-freetext" href="http://p.sf.net/sfu/dev2dev-palm">http://p.sf.net/sfu/dev2dev-palm</a></pre> <pre wrap=""> <hr size="4" width="90%"> _______________________________________________ Dnx-users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dnx...@li...">Dnx...@li...</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/dnx-users">https://lists.sourceforge.net/lists/listinfo/dnx-users</a> </pre> </blockquote> </body> </html> |
From: John C. <joh...@gm...> - 2010-07-30 16:47:18
|
Roger, On 7/30/2010 5:43 AM, Roger Torrentsgenerós wrote: > I'm with Eric. The more transparent DNX is, the better we'll understand > what happens when a check fails. I've got a query into Adam Augustine (copied on this message also) - owner of the DNX project. I want his opinion before I commit any such change that breaks backward compatibility. > However, the current situation is that DNX "translates" an out of scope > exit to an exit 3, and passes it to Nagios together with the original > status message with a prepended "[EC-1]" (which I assume it means exit > code -1). The human parsing part is useful, but in my case having an > UNKNOWN state also means triggering an alert, sending an SMS and maybe > waking someone (probably me) up. > > Two questions come into my mind: > > - Why would DNX do that? I don't really know what the original motivation was. As I stated earlier, it's been that way since the beginning, and I wasn't the original maintainer. > We also have home-made checks. If one of my boys creates a check that > returns out of scope codes, I'd like to see it as always have seen it. > Adding translations is also adding complexity when tracing errors back, > so I think it's much better DNX simply acts as a messenger and returns > to Nagios the check result "as is". > > - Why am I seeing checks that return -1 only when I use DNX? This could have something to do with the way we were manually extracting the shell exit code from the waitpid status value. I changed this code to use the proper abstraction macros (WIFEXIT and WEXITSTATUS). I've committed this code, but to check the results, you'd have to checkout the latest source from subversion and build from scratch. It's not that hard to do this, but you may wish to wait till we've decided what to do with the [EC...] thing. Such a build may be more valuable to you at that point. John |
From: Adam A. <aug...@gm...> - 2010-07-30 20:29:45
|
I don't think making this change will hurt anything. Everyone is right, DNX should be as transparent as it reasonably can be, and this doesn't keep with that goal. As I recall, the original motivation for doing this was to highlight plugins that were having problems to distinguish them from DNX and Nagios problems. Nagios by default returns a CRITICAL for out-of-bounds return codes which (at the time) we also felt was wrong. Out-of-bounds seemed like it should be UNKNOWN instead. With the perspective of the intervening years it doesn't seem like much of an issue. Instead, it should have been something we petitioned Nagios to change. It is clearly not the proper role of DNX to interpret return codes. I vote for making the change. Parallel to this however, there are other cases where DNX does insert additional information into the status info field of the check results. "DNX: Plugin timeout" is one where we felt it was important to distinguish between a DNX executing plugin and a locallly (to Nagios) executing plugin. It also does this where DNX itself times out or has other problems John would be in a better position to enumerate all the cases where this occurs. Anyway, I think these cases of modifying results make sense, but since the topic is being discussed I am interested in other opinions. Anyone care to comment? Adam Augustine On Fri, Jul 30, 2010 at 10:47 AM, John Calcote <joh...@gm...> wrote: > Roger, > > On 7/30/2010 5:43 AM, Roger Torrentsgenerós wrote: >> I'm with Eric. The more transparent DNX is, the better we'll understand >> what happens when a check fails. > > I've got a query into Adam Augustine (copied on this message also) - > owner of the DNX project. I want his opinion before I commit any such > change that breaks backward compatibility. > >> However, the current situation is that DNX "translates" an out of scope >> exit to an exit 3, and passes it to Nagios together with the original >> status message with a prepended "[EC-1]" (which I assume it means exit >> code -1). The human parsing part is useful, but in my case having an >> UNKNOWN state also means triggering an alert, sending an SMS and maybe >> waking someone (probably me) up. >> >> Two questions come into my mind: >> >> - Why would DNX do that? > > I don't really know what the original motivation was. As I stated > earlier, it's been that way since the beginning, and I wasn't the > original maintainer. > >> We also have home-made checks. If one of my boys creates a check that >> returns out of scope codes, I'd like to see it as always have seen it. >> Adding translations is also adding complexity when tracing errors back, >> so I think it's much better DNX simply acts as a messenger and returns >> to Nagios the check result "as is". >> >> - Why am I seeing checks that return -1 only when I use DNX? > > This could have something to do with the way we were manually extracting > the shell exit code from the waitpid status value. I changed this code > to use the proper abstraction macros (WIFEXIT and WEXITSTATUS). I've > committed this code, but to check the results, you'd have to checkout > the latest source from subversion and build from scratch. It's not that > hard to do this, but you may wish to wait till we've decided what to do > with the [EC...] thing. Such a build may be more valuable to you at that > point. > > John > |
From: Roger T. <rto...@fl...> - 2010-08-03 08:54:40
|
> Anyway, I think these cases of modifying results make sense, but since > the topic is being discussed I am interested in other opinions. Anyone > care to comment? > They make sense to me too. IMO, DNX *should* make an injection in the returned result always and only when there has been any anomaly in DNX. Being transparent doesn't mean being untraceable, it means not being intrusive. If DNX works as expected, let the plugins fail if they want. But if DNX has had any issue when executing a plugin (i.e. the stated above timeouts), it must notify the user, and an injection is the best way to me. Of course, I also vote for making the change. Cheers. Roger |
From: Roger T. <rto...@fl...> - 2010-07-30 11:43:39
|
I'm with Eric. The more transparent DNX is, the better we'll understand what happens when a check fails. However, the current situation is that DNX "translates" an out of scope exit to an exit 3, and passes it to Nagios together with the original status message with a prepended "[EC-1]" (which I assume it means exit code -1). The human parsing part is useful, but in my case having an UNKNOWN state also means triggering an alert, sending an SMS and maybe waking someone (probably me) up. Two questions come into my mind: - Why would DNX do that? We also have home-made checks. If one of my boys creates a check that returns out of scope codes, I'd like to see it as always have seen it. Adding translations is also adding complexity when tracing errors back, so I think it's much better DNX simply acts as a messenger and returns to Nagios the check result "as is". - Why am I seeing checks that return -1 only when I use DNX? I'm seeing one [EC-1] result every 4 or 5 minutes when I use DNX, including standard checks that come with the official nagios-plugins releases for RHEL5, for example check_ntp. The thing is if I disable DNX, I never get -1 status, or UNKNOWN "out of bounds", or whatever. I only get false positives when I use DNX. As John said, it has to be something related to the way DNX fetches exit status from the plugins and tries to understand them. I don't know if his recent commit will fix something (will try), but I'm pretty sure that if DNX simply forwarded whatever it got from the plugin to Nagios, instead of trying to "understand" it, we'd eliminate some complexity and DNX wouldn't be the one to blame, but the plugin itself. Cheers. Roger On Thu, 2010-07-29 at 15:13 -0600, Eric Schoeller wrote: > John, > > Yep, that's exactly where I was heading with this. Now that I > understand what DNX is doing, I'm torn between the options. Having DNX > send the exact return code would be more consistent with how Nagios > works in general, as well as how other passive distributed monitoring > methods work. Do you know if there was a specific reason why DNX was > designed to behave this way? Would changing the code break anything > else? > > I'm also interested if anyone else has thoughts on the matter, I've > cc'd dnx-devel. At first glance I would vote for sending the original > return code all the way up to Nagios. But at the very least this > should be documented somewhere. > > Eric > > John Calcote wrote: > > Eric, > > > > Would you rather have dnx clients return the exact error code that > > was returned by the shell? > > > > If there are any other interested parties, please also respond here. > > This functionality has been this way in DNX forever. :) I don't mind > > changing it, but I want to make sure I won't break a bunch of people > > if I do. > > > > John > > > > On 7/29/2010 2:47 PM, Eric Schoeller wrote: > > > So you're basically saying that Roger's plugin is returning -1 and > > > DNX translates this to 3 when it passes it to Nagios? > > > > > > The default behavior for Nagios is to accept an out of bounds > > > return code, and report it as such with a "return code out of > > > bounds" message. I think you can configure what state nagios uses > > > for such instances. > > > > > > John Calcote wrote: > > > > Hi Roger, > > > > > > > > The following result codes are defined in the dnxPlugin.h header file: > > > > > > > > #define DNX_PLUGIN_RESULT_OK 0 // DNX plugin result: success. > > > > #define DNX_PLUGIN_RESULT_WARNING 1 //DNX plugin result: warning. > > > > #define DNX_PLUGIN_RESULT_CRITICAL 2 // DNX plugin result: critical. > > > > #define DNX_PLUGIN_RESULT_UNKNOWN 3 // DNX plugin result: unknown. > > > > > > > > There is code in the plugin handler that basically looks like this: > > > > > > > > int result = do-shell-command(...) > > > > if result < DNX_PLUGIN_RESULT_OK OR result > DNX_PLUGIN_RESULT_UNKNOWN then > > > > prepend "[EC <result>]" to resulting message text > > > > return DNX_PLUGIN_RESULT_UNKNOWN (3) > > > > > > > > Thus, no matter what the plugin's shell returns to DNX client, if it's > > > > outside the range of known results, the result returned to Nagios by DNX > > > > is 3 (result unknown), but the real shell result code is displayed in > > > > the text in the [EC <result>] value. Thus, whenever DNX returns a 3 > > > > (result unknown) to Nagios, it can provide the true result code for > > > > human parsing in the message text. > > > > > > > > Btw, I've committed a minor change that *may* reduce the number of such > > > > occurrences because on some systems the shell status code may be encoded > > > > in the waitpid status value a bit differently. We were assuming a > > > > particular format rather than using the macros designed to parse out the > > > > status code. > > > > > > > > Regards, > > > > John > > > > > > > > On 7/29/2010 10:48 AM, Roger Torrentsgenerós wrote: > > > > > > > > > Hi, > > > > > > > > > > I have noticed that sometimes, some checks status in Nagios are shown > > > > > like this: > > > > > > > > > > [EC -1]NTP OK: Offset 0.0007655997179 secs > > > > > > > > > > This is what dnxsrv.audit.log shows (debug=1): > > > > > > > > > > [Thu Jul 29 18:24:00.211 2010] DISPATCH: Job 125865: Worker > > > > > 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/chec > > > > > k_nrpe -H 192.168.128.222 -c "check_ntp" > > > > > [Thu Jul 29 18:24:00.211 2010] ASSIGN: Job 125865: Worker > > > > > 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/check_ > > > > > nrpe -H 192.168.128.222 -c "check_ntp" > > > > > [Thu Jul 29 18:24:00.312 2010] COLLECT: Job 125865: Worker > > > > > 195.10.10.170-aa80a8c0: /usr/libexec/nagios/plugins/check > > > > > _nrpe -H 192.168.128.222 -c "check_ntp" > > > > > > > > > > This is what dnxcld.debug.log says (debug=2): > > > > > > > > > > [Thu Jul 29 18:24:00.220 2010] dnxPluginExecute: > > > > > Executing /usr/libexec/nagios/plugins/check_nrpe -H 192.168.128.222 > > > > > -c "check_ntp" > > > > > > > > > > Nagios log says: > > > > > > > > > > [1280420649] SERVICE ALERT: streamer022.p4.bt.bcn;ntp;UNKNOWN;SOFT;1;[EC > > > > > -1]NTP OK: Offset 0.0007655997179 secs > > > > > > > > > > What is this "[EC-1]" thing? In all cases, the check result message says > > > > > OK, but returns an "exit 3" so Nagios treats it as an UNKNOWN state. And > > > > > every time, the next check of the same service (rescheduled or > > > > > automatic) always returns an OK, with exit status 0 and a correctly > > > > > formed status message. > > > > > > > > > > So can someone explain what is that thing, what does it mean and what's > > > > > for? > > > > > > > > > > Thanks a lot! > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > The Palm PDK Hot Apps Program offers developers who use the > > > > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > > > > of $1 Million in cash or HP Products. Visit us here for more details: > > > > http://p.sf.net/sfu/dev2dev-palm > > > > _______________________________________________ > > > > Dnx-users mailing list > > > > Dnx...@li... > > > > https://lists.sourceforge.net/lists/listinfo/dnx-users > > > > > > > > > > ------------------------------------------------------------------------------ > > > The Palm PDK Hot Apps Program offers developers who use the > > > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > > > of $1 Million in cash or HP Products. Visit us here for more details: > > > http://p.sf.net/sfu/dev2dev-palm > > > > > > _______________________________________________ > > > Dnx-users mailing list > > > Dnx...@li... > > > https://lists.sourceforge.net/lists/listinfo/dnx-users > > > > > > > > > ____________________________________________________________________ > > > > ------------------------------------------------------------------------------ > > The Palm PDK Hot Apps Program offers developers who use the > > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > > of $1 Million in cash or HP Products. Visit us here for more details: > > http://p.sf.net/sfu/dev2dev-palm > > > > ____________________________________________________________________ > > > > _______________________________________________ > > Dnx-users mailing list > > Dnx...@li... > > https://lists.sourceforge.net/lists/listinfo/dnx-users > > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ Dnx-users mailing list Dnx...@li... https://lists.sourceforge.net/lists/listinfo/dnx-users |
From: Adam A. <aug...@gm...> - 2010-07-30 22:06:15
|
1) Why would DNX do that? A - See my other response. It was not the right thing to do. John's change should fix that. 2) Why am I seeing checks that return -1 only when I use DNX? A - This one is a little more complex and subtle. I don't think it is DNX per-se that is doing it. You may see a different return code when running under DNX because the DNX client may be running as a different user from Nagios, because the DNX client is running from a different location in the network, because DNX may run the check with different environment variables (DNX does not currently pass the environment variables from the Nagios server), and those reasons are just from the top of my head. Without knowing your setup, and because this seems to be intermittent, something we have seen is a particular worker node for whatever reason is configured slightly differently from the others (/etc/resolv.conf didn't have a search domain, ntp.conf had a typo in one of the server IPs, freetds.conf had an entry missing), so a particular plugin behaves differently there than on all the other. On Fri, Jul 30, 2010 at 5:43 AM, Roger Torrentsgenerós <rto...@fl...> wrote: > > I'm with Eric. The more transparent DNX is, the better we'll understand > what happens when a check fails. > > However, the current situation is that DNX "translates" an out of scope > exit to an exit 3, and passes it to Nagios together with the original > status message with a prepended "[EC-1]" (which I assume it means exit > code -1). The human parsing part is useful, but in my case having an > UNKNOWN state also means triggering an alert, sending an SMS and maybe > waking someone (probably me) up. > > Two questions come into my mind: > > - Why would DNX do that? > > We also have home-made checks. If one of my boys creates a check that > returns out of scope codes, I'd like to see it as always have seen it. > Adding translations is also adding complexity when tracing errors back, > so I think it's much better DNX simply acts as a messenger and returns > to Nagios the check result "as is". > > - Why am I seeing checks that return -1 only when I use DNX? > > I'm seeing one [EC-1] result every 4 or 5 minutes when I use DNX, > including standard checks that come with the official nagios-plugins > releases for RHEL5, for example check_ntp. The thing is if I disable > DNX, I never get -1 status, or UNKNOWN "out of bounds", or whatever. I > only get false positives when I use DNX. > > As John said, it has to be something related to the way DNX fetches exit > status from the plugins and tries to understand them. I don't know if > his recent commit will fix something (will try), but I'm pretty sure > that if DNX simply forwarded whatever it got from the plugin to Nagios, > instead of trying to "understand" it, we'd eliminate some complexity and > DNX wouldn't be the one to blame, but the plugin itself. > > Cheers. > > Roger > |
From: John C. <joh...@gm...> - 2010-07-31 18:57:33
|
Hi all, SVN revision 378 has the range check removed. In this revision (and later) DNX clients will report the true shell error code. Please note that in making this change, the [EC = x] text has been removed from the status message as well, but it's no longer necessary because it would be redundant with the code reported to Nagios anyway. The previous revision (377) contains the changes necessary to use the system WIFEXIT and WEXITSTATUS macros. John On 7/30/2010 4:06 PM, Adam Augustine wrote: > 1) Why would DNX do that? > A - See my other response. It was not the right thing to do. John's > change should fix that. > > 2) Why am I seeing checks that return -1 only when I use DNX? > A - This one is a little more complex and subtle. I don't think it is > DNX per-se that is doing it. > > You may see a different return code when running under DNX because the > DNX client may be running as a different user from Nagios, because the > DNX client is running from a different location in the network, > because DNX may run the check with different environment variables > (DNX does not currently pass the environment variables from the Nagios > server), and those reasons are just from the top of my head. > > Without knowing your setup, and because this seems to be intermittent, > something we have seen is a particular worker node for whatever reason > is configured slightly differently from the others (/etc/resolv.conf > didn't have a search domain, ntp.conf had a typo in one of the server > IPs, freetds.conf had an entry missing), so a particular plugin > behaves differently there than on all the other. > > On Fri, Jul 30, 2010 at 5:43 AM, Roger Torrentsgenerós > <rto...@fl...> wrote: >> I'm with Eric. The more transparent DNX is, the better we'll understand >> what happens when a check fails. >> >> However, the current situation is that DNX "translates" an out of scope >> exit to an exit 3, and passes it to Nagios together with the original >> status message with a prepended "[EC-1]" (which I assume it means exit >> code -1). The human parsing part is useful, but in my case having an >> UNKNOWN state also means triggering an alert, sending an SMS and maybe >> waking someone (probably me) up. >> >> Two questions come into my mind: >> >> - Why would DNX do that? >> >> We also have home-made checks. If one of my boys creates a check that >> returns out of scope codes, I'd like to see it as always have seen it. >> Adding translations is also adding complexity when tracing errors back, >> so I think it's much better DNX simply acts as a messenger and returns >> to Nagios the check result "as is". >> >> - Why am I seeing checks that return -1 only when I use DNX? >> >> I'm seeing one [EC-1] result every 4 or 5 minutes when I use DNX, >> including standard checks that come with the official nagios-plugins >> releases for RHEL5, for example check_ntp. The thing is if I disable >> DNX, I never get -1 status, or UNKNOWN "out of bounds", or whatever. I >> only get false positives when I use DNX. >> >> As John said, it has to be something related to the way DNX fetches exit >> status from the plugins and tries to understand them. I don't know if >> his recent commit will fix something (will try), but I'm pretty sure >> that if DNX simply forwarded whatever it got from the plugin to Nagios, >> instead of trying to "understand" it, we'd eliminate some complexity and >> DNX wouldn't be the one to blame, but the plugin itself. >> >> Cheers. >> >> Roger >> > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel > |