mon-devel Mailing List for mon (Page 4)
Brought to you by:
trockij
You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
(13) |
Aug
(6) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
|
Feb
(27) |
Mar
|
Apr
(9) |
May
(11) |
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(15) |
2006 |
Jan
|
Feb
(6) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2007 |
Jan
|
Feb
|
Mar
(14) |
Apr
(4) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(6) |
Nov
(4) |
Dec
(8) |
2008 |
Jan
(6) |
Feb
(4) |
Mar
(7) |
Apr
|
May
|
Jun
(2) |
Jul
(1) |
Aug
|
Sep
|
Oct
(2) |
Nov
(1) |
Dec
|
2009 |
Jan
(1) |
Feb
(1) |
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2010 |
Jan
(11) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(7) |
Nov
(7) |
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
(1) |
Dec
|
2013 |
Jan
|
Feb
(3) |
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Augie S. <aug...@gm...> - 2007-11-15 19:22:00
|
On 11/15/07, Jim Trocki <tr...@ar...> wrote: > regarding the syslog bug, it's wrapped up in an eval to handle exception > processing from deeper levels in Sys::Syslog, and the other gunk in there (the > map) is a workaround for a bug in an older version of Sys::Syslog (0.07). the > better way to fix this is to have it bail out on startup if the old buggy > version is found, and tell people to get a newer version. I recently ran into this, and as far as I could tell the 'map' wasn't needed and it was even causing the interpreter to fail because it's trying to alter @_ a read only variable which later Perl revs. fail on. My quick fix was the following: # diff -u mon.orig /usr/sbin/mon --- mon.orig 2007-11-07 15:16:35.000000000 -0800 +++ /usr/sbin/mon 2007-11-07 16:04:09.000000000 -0800 @@ -5385,8 +5385,9 @@ sub syslog { eval { local $SIG{"__DIE__"}= sub { }; - my @log = map { s/\%//mg; } @_; - Sys::Syslog::syslog(@log); +# my @log = map { s/\%//mg; } @_; +# Sys::Syslog::syslog(@log); + Sys::Syslog::syslog(@_); } } use warnings; -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Jim T. <tr...@ar...> - 2007-11-15 18:28:47
|
On Tue, 9 Oct 2007, Wolfram Schlich wrote: > 'upalertafter' is only supported for period definitions, not for > service definitions itself. Despite that fact, process_event() > (mon line 3365) looks for $sref->{"upalertafter"}, which obviously > doesn't exist. correct. this is part of the bug. > A place where the code loops through the periods and where one > could check it is within the do_alert() function. correct again. the upalertafter processing is being handled in the wrong place. a while back i had cleaned up the code to make the trap processing use the same squelch logic as the other processing by putting that in process_event. this fixed some trap bugs, and i had intended to do some more cleanup related to that. so it does appear that the way to fix this is to rip out the decisions to call do_alert from process_event and stick them into do_alert. > Unfortunately, when you place the upalertafter check in there, > it will only be run once, because process_event() already resets sure, just some minor details :) david, have you had a look at this yet, and have you formulated an opinion on this? i'll move on this, but just let us know if you have some ideas. regarding the syslog bug, it's wrapped up in an eval to handle exception processing from deeper levels in Sys::Syslog, and the other gunk in there (the map) is a workaround for a bug in an older version of Sys::Syslog (0.07). the better way to fix this is to have it bail out on startup if the old buggy version is found, and tell people to get a newer version. fwiw, the perl that ships with sles10, fc6, rhel5 all include the newer version. sles8 and sles9 have the buggy version. from the manual: Note "Sys::Syslog" version v0.07 and older passed the $message as the formatting string to "sprintf()" even when no formatting arguments were provided. If the code calling "syslog()" might execute with older versions of this module, make sure to call the function as "syslog($priority, "%s", $message)" instead of "syslog($priority, $message)". This protects against hostile formatting sequences that might show up if $message contains tainted data. |
From: Wolfram S. <li...@wo...> - 2007-11-15 14:57:43
|
* David Nolan <vit...@cm...> [2007-10-09 17:54]: > On 10/9/07, Wolfram Schlich <li...@wo...> wrote: > > > I am really surprised that such an essential feature is > > "unknown broken" :-/ > > Ahh, now that you've actually described the problem my first thought > was "I thought we fixed that!" > > But nope, its still there... > > I'll try to dig into it and write a fix sometime soon. So, any news on this issue? This is really preventing me from producitvely using mon for an active/passive cluster setup monitoring... :-( -- Regards, Wolfram Schlich <wsc...@ge...> Gentoo Linux * http://dev.gentoo.org/~wschlich/ |
From: Wolfram S. <li...@wo...> - 2007-10-10 07:55:59
|
* Wolfram Schlich <li...@wo...> [2007-10-09 16:33]: > with the current mon-1.2.0, the syslog() function is broken. ...meaning it doesn't log *at all*. |
From: David N. <vit...@cm...> - 2007-10-09 15:53:35
|
On 10/9/07, Wolfram Schlich <li...@wo...> wrote: > I am really surprised that such an essential feature is > "unknown broken" :-/ > Ahh, now that you've actually described the problem my first thought was "I thought we fixed that!" But nope, its still there... I'll try to dig into it and write a fix sometime soon. -David |
From: Wolfram S. <li...@wo...> - 2007-10-09 14:55:21
|
* David Nolan <vit...@cm...> [2007-10-09 16:34]: > On 10/9/07, Wolfram Schlich <li...@wo...> wrote: > > Hi, > > > > with the current mon-1.2.0, the "upalertafter" functionality is simply > > broken. > > > > I tried to find a quick fix by digging into the code, but I failed to > > find a sane modification to process_event() and/or do_alert(). > > > > When will this be fixed? > > Can you provide us with a better problem description then "simply broken"? > > Basically what we need to know is: > - how did you have the period configured (a copy of all period config > statements, without any local details like email addresses, etc.) > - how did you test it? presumably by generating a failure, receiving > an alert and then generating a success. details please on timing, > alerts, etc. i.e. what did you expect to happen, and what actually > happened? > > Once we have a detailed bug report to investigate we can try to track this down. Sorry -- when I looked at the code it was so obvious that it's broken, so I thought you (developers) all know about it and just haven't fixed it due to whatever reason :-) Ok, let's proceed... 'upalertafter' is only supported for period definitions, not for service definitions itself. Despite that fact, process_event() (mon line 3365) looks for $sref->{"upalertafter"}, which obviously doesn't exist. A place where the code loops through the periods and where one could check it is within the do_alert() function. Unfortunately, when you place the upalertafter check in there, it will only be run once, because process_event() already resets the status from FAIL to OK (I actually tried to put the check there, and it also got executed, but only once, making 'upalertafter' check senseless), thus never executing more than 1 do_alert() for an upalert (which is correct for other reasons). A correct check for upalertafter that evaluates to true or false looks like this: Decision to suppress the upalert: ($tmnow - $sref->{'_last_failure'} < $pref->{'upalertafter'}) Decision to run the upalert: ($tmnow - $sref->{'_last_failure'} >= $pref->{'upalertafter'}) ...but clearly not like the one present in line 3366: ($tmnow - $sref->{"_first_failure"}) >= $sref->{"upalertafter"}) Maybe it would be best to add some period looping code to process_event() and check upalertafter there. I am really surprised that such an essential feature is "unknown broken" :-/ Best regards, Wolfram |
From: David N. <vit...@cm...> - 2007-10-09 14:33:43
|
On 10/9/07, Wolfram Schlich <li...@wo...> wrote: > Hi, > > with the current mon-1.2.0, the "upalertafter" functionality is simply > broken. > > I tried to find a quick fix by digging into the code, but I failed to > find a sane modification to process_event() and/or do_alert(). > > When will this be fixed? > Can you provide us with a better problem description then "simply broken"? Basically what we need to know is: - how did you have the period configured (a copy of all period config statements, without any local details like email addresses, etc.) - how did you test it? presumably by generating a failure, receiving an alert and then generating a success. details please on timing, alerts, etc. i.e. what did you expect to happen, and what actually happened? Once we have a detailed bug report to investigate we can try to track this down. -David |
From: Wolfram S. <li...@wo...> - 2007-10-09 14:32:11
|
Hi, with the current mon-1.2.0, the syslog() function is broken. Here is something I hacked together which may not be an ideal solution but just works compared to the broken original code: http://dev.gentoo.org/~wschlich/src/mon-1.2.0-syslog.patch -- Regards, Wolfram Schlich <wsc...@ge...> Gentoo Linux * http://dev.gentoo.org/~wschlich/ |
From: Wolfram S. <li...@wo...> - 2007-10-09 14:24:29
|
Hi, with the current mon-1.2.0, the "upalertafter" functionality is simply broken. I tried to find a quick fix by digging into the code, but I failed to find a sane modification to process_event() and/or do_alert(). When will this be fixed? Best regards, Wolfram Schlich |
From: Augie S. <aug...@gm...> - 2007-04-26 17:08:34
|
http://sourceforge.net/tracker/index.php?func=detail&aid=1708251&group_id=170&atid=100170 The snpp.alert does not work with "use strict" enabled; attached is a simple patch to fix this behavior. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-04-26 17:02:56
|
The nntp.monitor does not handle the "-f" flag properly. It looks like this was changed in a previous bug: http://sourceforge.net/tracker/index.php?func=detail&aid=405318&group_id=17 0&atid=100170 but this does not work with newer Perl installs; it should go back to being $opt_f ; simple patch attached to fix the behavior. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-04-26 16:50:26
|
http://sourceforge.net/tracker/index.php?func=detail&aid=1708231&group_id=170&atid=100170 The MON mon.monitor (http://mon.cvs.sourceforge.net/mon/mon/mon.d/mon.monitor?revision=1.1.1.1& view=markup) has one to many "p" options; one for the port to use, and the other for password, thus you can't actually use the "-p" flag for a password; attached is a patch to fix this behavior. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-04-26 00:46:21
|
http://sourceforge.net/tracker/index.php?func=detail&aid=1707773&group_id=170&atid=100170 The process.monitor that is in the CVS HEAD does not change the return value ($RETVAL) when there is an snmp error; for example if the server's snmpd is down. A simple patch is included to fix this behavior. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-23 17:12:21
|
On 3/22/07, Jim Trocki <tr...@ar...> wrote: > On Thu, 22 Mar 2007, Ed Ravin wrote: > > On Thu, Mar 22, 2007 at 10:43:17AM -0800, Augie Schwer wrote: > >> What is the proper protocol for submitting patches? Send patch here? > >> Submit bug report to the Source Forge project page, and include a > >> patch and send the patch here? > > I like that better - put the patches in SourceForge for the record, > > but tell the list so that anyone who cares can chime in. For example, > > those patches to mon.cgi looked mighty familiar - I made them myself > > and reported the problem when I upgraded to the current level of Mon. > ...ok, but which mon.cgi? Probably not this one: > http://moncgi.sourceforge.net/ For what it's worth all the mon.cgis I have seen have the same bug and won't be able to display ack comments with the new mon daemon running, so the patch I submitted would apply to Ryan's code. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Jim T. <tr...@ar...> - 2007-03-22 20:01:17
|
On Thu, 22 Mar 2007, Ed Ravin wrote: > On Thu, Mar 22, 2007 at 10:43:17AM -0800, Augie Schwer wrote: >> What is the proper protocol for submitting patches? Send patch here? > > Yes, but... > >> Submit bug report to the Source Forge project page, and include a >> patch and send the patch here? > > I like that better - put the patches in SourceForge for the record, > but tell the list so that anyone who cares can chime in. For example, > those patches to mon.cgi looked mighty familiar - I made them myself > and reported the problem when I upgraded to the current level of Mon. ...ok, but which mon.cgi? Probably not this one: http://moncgi.sourceforge.net/ Ryan Clark's done a lot of work on it. He CSSized it, gave it a config file, and it looks like the code he released in December even has fancy Javascript stuff. |
From: Augie S. <aug...@gm...> - 2007-03-22 19:16:21
|
On 3/22/07, Ed Ravin <er...@pa...> wrote: > On Thu, Mar 22, 2007 at 10:43:17AM -0800, Augie Schwer wrote: > > What is the proper protocol for submitting patches? Send patch here? > > Submit bug report to the Source Forge project page, and include a > > patch and send the patch here? > I like that better - put the patches in SourceForge for the record, > but tell the list so that anyone who cares can chime in. For example, > those patches to mon.cgi looked mighty familiar - I made them myself > and reported the problem when I upgraded to the current level of Mon. > We definitely need a bit more discipline about getting fixes into CVS. Sounds good; I'll gather my patches and go hunting around on the SF project page before submitting them there. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-22 19:13:50
|
--- doc/mon.8 2005-07-31 10:02:38.000000000 -0700 +++ doc/mon.8 2007-03-22 12:03:21.000000000 -0700 @@ -928,6 +928,13 @@ .B monremote.pl is available in the clients directory. +.TP +.BI "unack_summary = "[0|1|y|yes|n|no] + +If set to "yes" or "1", then an acknowledged alert for a service +will be un-acknowledged any time that the summary for that service +changes. + .SS "Hostgroup Entries" Hostgroup entries begin with the keyword -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Ed R. <er...@pa...> - 2007-03-22 18:57:27
|
On Thu, Mar 22, 2007 at 10:43:17AM -0800, Augie Schwer wrote: > What is the proper protocol for submitting patches? Send patch here? Yes, but... > Submit bug report to the Source Forge project page, and include a > patch and send the patch here? I like that better - put the patches in SourceForge for the record, but tell the list so that anyone who cares can chime in. For example, those patches to mon.cgi looked mighty familiar - I made them myself and reported the problem when I upgraded to the current level of Mon. We definitely need a bit more discipline about getting fixes into CVS. |
From: Augie S. <aug...@gm...> - 2007-03-22 18:47:03
|
The MAIN branch is older than the mon-1-0-0pre1 branch, but in my case the MAIN branch was the only one with the "unack_summary" feature, so are code changes not being merged back and forth? -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-22 18:43:24
|
What is the proper protocol for submitting patches? Send patch here? Submit bug report to the Source Forge project page, and include a patch and send the patch here? -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-22 17:19:07
|
In the latest mon CVS HEAD the mon backend no longer stores $op{$group}{$service}{'ack'} as just '1' or '0'; the following patch updates the mon.cgi in /mon/clients/mon.cgi of the CVS HEAD to reflect the backend change. --- clients/mon.cgi 2005-07-31 09:59:13.000000000 -0700 +++ clients/mon.cgi.new 2007-03-22 10:15:11.086763007 -0700 @@ -1083,7 +1083,7 @@ # Escape the HTML to avoid any potential nastiness if the # user requested it, otherwise, just pass it on through # as is. - if ( $op{$group}{$service}{'ack'} == 1 ) { + if ( $op{$group}{$service}{'ack'} != 0 ) { if ($untaint_ack_msgs) { # # We untaint @@ -1678,7 +1678,7 @@ $webpage->print("<tr>"); $webpage->print("<td $ack_command_bgcolor colspan=1 width=50%>"); - if ($op{$group}{$service}{'ack'} == 1) { + if ($op{$group}{$service}{'ack'} != 0) { # Service has already been acked, offer to re-ack $acknowledge_string = "<font size=+1><b>Re-acknowledge this failure:</b></font><br>(changes the acknowledgement message)<br>"; $ackcomment_default = "Was:\"$op{$group}{$service}{'ackcomment'}\""; -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-22 16:43:17
|
On 3/7/07, David Nolan <vit...@cm...> wrote: > You are correct that the old mon 0.99.2 code exhibits this behavior. > The more recent code in CVS has a configurable feature that causes mon > to remove the ack state from a service if the summary component of the > failure message changes. In most common usage the summary is the list > of hosts that are failing, so additional hosts failing would remove an > ack. After poking around in the latest HEAD release I found the unack_summary configuration option, and it works great after patching to fix a few bugs: --- mon 2007-03-20 15:33:26.000000000 -0700 +++ mon 2007-03-21 14:15:22.000000000 -0700 @@ -1132,11 +1132,11 @@ } elsif ($1 eq "unack_summary") { if (defined $2) { if ($2 =~ /y(es)?/i) { - $2 = 1; + $UNACK_SUMMARY= 1; } elsif ($2 =~ /n(o)?/i) { - $2 = 0; + $UNACK_SUMMARY= 0; } - if ($2 eq "0" || $2 eq "1") { + elsif ($2 eq "0" || $2 eq "1") { $UNACK_SUMMARY = $2; } else { return "cf error: invalid unack_summary value '$2' (syntax: unack_summary [0|1|y|yes|n|no])"; $2 is a read-only variable and trying to assign to it throws errors on my system, the above patch addresses that. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Jim T. <tr...@ar...> - 2007-03-16 17:57:55
|
On Thu, 8 Mar 2007, Augie Schwer wrote: > Well it is good to hear that the project hasn't been abandoned; I Not at all. I have a list of TODO items for fixes and feature enhancements, some of them as recent as a week or two ago. > think people will naturally assume that the project is abandoned not > only because of the lack of stable release, but because of the > cosmetic public facing things: the kernel.org mon site has a few > broken links, and the SourceForge mailing list archive either isn't > keeping up, or is slow on this thread because the last message on > there is over a year old. > > All these things combined will most assuredly hurt the mon project's > popularity and its ability to gather feedback; that being said, I will > grab the latest CVS, stay on the list and hopefully have some feedback > for you. :) Thank you very much for the reply David. Yes, feedback is very important, bad or good. Here's the todo so far, which covers protocol changes to handle the per-host status: ------------------------------- -no alerts for n mins -better cookbook of examples, including some pre-fab m4 defines for templates with focus on the ability to quickly configure mon out-of-the-box for the most common setups -period "templates" > like I have to repeat my period definitions all 260 times, one for > each watch. we should have templates in the Mon config file for any > kind of object so it can be reused. so do you mean a way to define a "template" for a period so that you don't need to keep rewriting "wd {Sun-Sat}", or so that it'll use some default period if you don't specify one, or what? i can see this working a bunch of different ways. like this? define period-template xyz period wd {Sun-Sat} alert mail.alert mi...@do... alert page.alert mis...@do... alertevery 1h watch something service something period template(xyz) watch somethingelse service something period template(xyz) # override the 1h alertevery 2h -my recent thoughts on config management are that the parsing should be all modularized, (a keeping the config parsing code in a separate perl module to be reused by other apps), and there should be a way to turn the resulting data structure into xml and importing the same back, not so you can write your config by hand in xml, but so you can use some generic xml editing tool to mess around with the config, to get one type of gui. -the most common things should be easiest to do, regardless of a gui or text file config. that is what makes stuff "easy". however, i don't think more complicated setups lend themselves to guis as much, and in complicated setups you have to invest a lot of time to learn how the tool works, and a fancy gui in that case is less of a payoff. this is for configuration, i mean. fancy guis for reporting and stuff are good, no doubt. -global alert definitions with their own squelches (alertevery, etc.) > also, alarms need to be collated so pagers and cell phones don't get > buried with large numbers of alerts. I have a custom solution that I > wrote for this, but it's a lousy solution since it essentially implements > its own paging system. i could see how it would be good to be able to define some alert destinations *outside* of the period definitions, then refer to them in the period definitions, then you can do "collation" that way. like this: define global-alert xyz mail.alert xy...@lm... alertevery 1h watch service period globalalert xyz <---collated globally watch service period globalalert xyz <---collated globally alert mail.alert pd...@lm... <---not collated that would be quite easy to do and i think very useful. you could apply all the same squelch knobs (alertevery, etc.) to the global ones. ----- (from mon-1.2.0) $Id: TODO,v 1.2 2004/11/15 14:45:16 vitroth Exp $ -add short a "radius howto" to the doc/ directory. -make traps authenticate via the same scheme used to obscure the password in RADIUS packets -descriptions defined in mon.cf should be 'quoted' -document command section and trap section in authfile -finish support for receiving snmp traps -output to client should be buffered and incorporated into the I/O loop. There is the danger that a sock_write to a client will block the server. -finish muxpect -make "chainable" alerts ?? i don't recall who asked for this or how it would work -make alerts nonblocking, and handle them in a similar fashion to monitors. i.e., serialize per-service (or per-period) alerts. -document "clear" client command -Document trap authentication. -Document traps. -Make monitors parallelize their tasks, similar to fping.monitor. This is an important scalability problem. -re-vamp the host disabling. 1) store them in a table with a timeout on each so that they can automatically re-enable themselves so people don't forget to re-enable them manually. 2) don't do the disabling by "commenting" them out of the host groups. We still want them to be tested for failure, but just disable alerts that have to do with the disabled hosts. When a host is commented out, accept a "reason" field that is later accessible so that you can tell why someone disabled the host. -allow checking a service at a particular time of day, maybe using inPeriod. -maybe make a command that will disable an alert for a certain amount of time -make it possible to disable just one of multiple alarms in a service -make a logging facility which forks and execs external logging daemons and writes to them via some ipc such as unix domain socket. mon should be sure that one of each type of these loggers is running at all times. configure the logging either globally or for each service. write both the success and failure status to the log in some "list opstatus" type format. each logger can do as it wishes with the data (e.g. stuff it into rrdtool, mysql, cat it to a file, etc.) # global setting logger = file watch stuff service http logger file -p _LOGDIR_ ... service fping # this will use the global logger setting ... service # this will override the global logger setting logger none ... common options to logger: -d dir path to logging dir -f file name of log file -g, -s group, service ----------- notes on a v2 protocol redesign from trockij - Configuring on a hostgroup scheme works very well. In the beginning, mon was never intended to get this complex(tm), it was intended to be a tool where it was easy to whip up custom monitoring scripts and alert scripts and plug them into a framework which allowed them all to connect to each other, and to have a way to easily build custom clients and report generators as well. - However, per host status is needed now. - This requires changes to both mon itself and also the monitors / alerts. Backward compatibility is important, and KISS is very important to retain the ease at which one can whip up a new monitor or alert or reporting client. - There will be a new protocol for communicating with the monitors / alerts, which will be masked by a Mon::Monitor / Mon::Alert module in Perl. Appropriate shell functions will be provided by the first one who asks. See below for the protocol. - We still want to retain the benefits of the old behaviour, but extend some alert management features, such as the ability to liberate alert definitions from the service periods so they can be used globally. - The server code might be broken up into multiple files (I/O routines, config parser, related parts, etc) - monitors can communicate better with the alerts (see below). For example, the monitor might hint (using "a_mail_list") the mail.alert about where else to send a warning that a user dir goes over quota. (Attention should be paid to privacy that we don't accidentially inform all users that /home/foo/how-i-will-destroy-western-civilization/ is consuming 1GB too much space ;) - Associations: these allow monitors to communicate details about failures back to the server which can be used to specify who to alert. The associations are based on key/value pairs specified in the association config file, and are expanded on the alert command line (or possibly within the alert protocol) if "@assoc-*" is in the configuration. If a host assoc. is needed, an alert spec will look like: alert mail.alert ad...@xy... @assoc-host There are two association types (possibly more in the future): host associations, and user-defined associations. Host associations use the "assoc-host" specifier, and map one or more username to an individual host. User-defined associations are just that, and begin with the "assoc-u-" specifier. Monitors return associations via the "assoc-*" key in the monitor protocol. Alerts obtain association information either via command-line arguments which were expanded by the server from "@assoc-*" in the config file, or via the "assoc-*" key in the alert protocol. - Metrics are only passed to the mon server for "monitoring" purposes, but can be marked up in such a way that they could be easily piped to a logging utility, one which is not part of the mon process itself. monitors are _encouraged_ to collect and report performance data. "Failures" are basically just a conclusion based upon performance data and it makes no sense to collect the data twice, e.g. if you have mon polling ifInOctets.0 on a system, why should mrtg have to poll on its own. It may be desireable to propose a "unified logging system" which all monitors can easily use, something which is pluggable and extensible - The hostgroup syntax is going to be extended to add per host options. (which will be passed to the monitors / alerts using the new protocol) ns1.teuto.net( fs="/(80%,90%)",mail_list="lm...@te..." ) would be passed as "h_fs=/(80%,90%)" and "h_mail_list="lm...@te..." FLOATING MONITORS A floating monitor is started by mon and remains running for the entire time. If it dies, it is automatically restarted. The server forks off a separate process for fping and communicates with it via some IPC, like a named pipe or a socket or something. The floating monitor sits there waiting for a message from the server that says "start checking now". The server then adds this descriptor to %fhandles and %running and treats it similar to other forked monitors. When the floting monitor is done, it spits its output back to the server and then goes dormant again, awaiting another message from the server. Floating monitors are started when mon starts, and are restarted if mon notices that they go away. This is a way to save on fork() overhead, but to also PROTOCOL The protocol will be simple and ASCII based, in the form of "key=value". Line continuation will be provided by prefixing following lines with a ">". A "\n" on a line by itself indicates the start of a new block. The order of the keys should not be important. The first block will always contain metadata further defining the following blocks. The "version" key is always present. The current protocol version is "1". (In the examples, everything after a "#" is a comment and should be cut out) KEY CONVENTIONS Keys only private to monitors will be prefixed with an "m_". In the same vain, keys private to alerts will be prefixed with a "a_", and additional host option keys specified in the mon.cf file will be prefixed with a "h_" before being passed to monitors/alerts. By convention, flags only pertaining to a specific alert will embed that name in the key name too - ie keys only pertaining to "mail.alert" will start with "a_mail_". The key/values pairs will be passed to all processes for a specific service. "h_" are static between invocations as they come from the mon.cf file. "m_" keys will be preserved between multiple monitor executions. "a_" keys will be passed from the monitor to the alert script. MONITOR PROTOCOL (monitor -> mon) The metadata block is followed by a block describing the overall hostgroup status, followed by a detailled status for each host. The following keys are defined for the blocks: "summary" = contains a one line short summary of the status. "status" = up, fail, ignore "metric_1" = an opaque floating point number which can be referenced for triggering alerts. May try to give an "operational percentage". More than one metric may be returned. (Ping rtt, packet loss, disk space etc) "description" = longer elaborate description of the current status. "host" = hostgroup member to which this status applies. The overall hostgroup status does not include this field. "assoc-host" = host association "assoc-u-*" = user-defined association Here is an example for a hypothetical hostgroup with 2 hosts and the ping service. ### version=1 summary=Still alive. metric_1=50 # Packetloss metric_2=20.23 # rtt times description=1 out of 2 hosts still responding. > Whatever else one might want to say about the status. It is difficult to > come up with a good text here so I will just babble. status=up host=foo.bar.com metric_1=100 metric_2=0 # 100% packet loss make rtt measurements difficult ;) summary=ICMP unreachable from 2.2.2.2 status=fail description=PING 2.2.2.2 (2.2.2.2): 56 data bytes > >--- 2.2.2.2 ping statistics --- >23 packets transmitted, 0 packets received, 100% packet loss metric_1=0 metric_2=52.1 summary=ICMP echo reply received ok status=up description=64 bytes from 212.8.197.2: icmp_seq=0 ttl=60 time=110.0 ms >64 bytes from 212.8.197.2: icmp_seq=1 ttl=60 time=32.3 ms >64 bytes from 212.8.197.2: icmp_seq=2 ttl=60 time=32.8 ms >64 bytes from 212.8.197.2: icmp_seq=3 ttl=60 time=33.4 ms > >--- ns1.teuto.net ping statistics --- >4 packets transmitted, 4 packets received, 0% packet loss >round-trip min/avg/max = 32.3/52.1/110.0 ms host=baz.bar.com ###### Points still open: - mon -> monitor communication - mon <-> alert communication - the new trap protocol - muxpect - a unified logging proposal |
From: Augie S. <aug...@gm...> - 2007-03-15 19:23:34
|
A few patches that fixed some problems I found when building a mon RPM. Patch spec file to add chkconfig: --- mon.spec 2005-04-17 00:42:25.000000000 -0700 +++ mon.spec 2007-03-15 12:18:13.901720283 -0700 @@ -133,6 +133,9 @@ if [ -d %{_localstatedir}/log -a ! -f %{_localstatedir}/log/mon_history.log ]; then touch %{_localstatedir}/log/mon_history.log fi +if [ $1 = 1 ]; then + /sbin/chkconfig --add mon +fi ################################################################### %postun Patch the init. script to point to the correct mon executable: --- S99mon 2007-03-15 12:01:33.851816189 -0700 +++ S99mon 2007-03-15 12:04:13.203365320 -0700 @@ -21,7 +21,7 @@ case "$1" in start) echo -n "Starting mon daemon: " - daemon /usr/lib/mon/mon -f -l -c /etc/mon/mon.cf + daemon /usr/sbin/mon -f -l -c /etc/mon/mon.cf echo touch /var/lock/subsys/mon ;; -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-08 14:07:13
|
On 3/7/07, David Nolan <vit...@cm...> wrote: > On 3/7/07, Augie Schwer <aug...@gm...> wrote: > > It seems that mon's ack system is flawed at least if it is used how > > the examples show multiple hosts in a watch group. > > Ack'ing an alert acks the service in the group, not the host in the > > group that is alerting, so for example if you have a host group for > > your web servers and you watch http; if http alerts on one of the > > hosts and you ack it, the rest of your web servers could go down and > > you would never know about it because the other host's http alerts > > would be suppressed. > > Is this expected behavior? Am I wrong to think that this is a flaw? > You are correct that the old mon 0.99.2 code exhibits this behavior. > The more recent code in CVS has a configurable feature that causes mon > to remove the ack state from a service if the summary component of the > failure message changes. In most common usage the summary is the list > of hosts that are failing, so additional hosts failing would remove an > ack. Hot damn; that's exactly what I was looking at implementing as a work around. > > I know the mon project is pretty much not maintained anymore, so if I > > don't get any response back I won't be surprised, but I thought I > > would float this question out there and see if I get any responses. > While thats an understandable conclusion based on the lack of a stable > release in approximately forver, there has been a lot of work since > the last release. The lack of a (declared) stable release has been in > part because of a lack of feedback on the development versions. In > fact I posted a release candidate for mon 1.2.0 back in september > (http://www.managedandmonitored.net/mon/) but I have received almost > no feedback on this version. In many cases I assume the mon users > just haven't had the opportunity to replace known-working systems or > setup parallel monitoring infrastructure. Well it is good to hear that the project hasn't been abandoned; I think people will naturally assume that the project is abandoned not only because of the lack of stable release, but because of the cosmetic public facing things: the kernel.org mon site has a few broken links, and the SourceForge mailing list archive either isn't keeping up, or is slow on this thread because the last message on there is over a year old. All these things combined will most assuredly hurt the mon project's popularity and its ability to gather feedback; that being said, I will grab the latest CVS, stay on the list and hopefully have some feedback for you. :) Thank you very much for the reply David. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |