Thread: [Mon-devel] Mon's Acknowledge system flawed?
Brought to you by:
trockij
From: Augie S. <aug...@gm...> - 2007-03-07 20:03:21
|
It seems that mon's ack system is flawed at least if it is used how the examples show multiple hosts in a watch group. Ack'ing an alert acks the service in the group, not the host in the group that is alerting, so for example if you have a host group for your web servers and you watch http; if http alerts on one of the hosts and you ack it, the rest of your web servers could go down and you would never know about it because the other host's http alerts would be suppressed. Is this expected behavior? Am I wrong to think that this is a flaw? It seems like the only way to use mon and not be bitten by this is to only have one host per host group. I know the mon project is pretty much not maintained anymore, so if I don't get any response back I won't be surprised, but I thought I would float this question out there and see if I get any responses. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: David N. <vit...@cm...> - 2007-03-08 05:03:50
|
On 3/7/07, Augie Schwer <aug...@gm...> wrote: > It seems that mon's ack system is flawed at least if it is used how > the examples show multiple hosts in a watch group. > > Ack'ing an alert acks the service in the group, not the host in the > group that is alerting, so for example if you have a host group for > your web servers and you watch http; if http alerts on one of the > hosts and you ack it, the rest of your web servers could go down and > you would never know about it because the other host's http alerts > would be suppressed. > > Is this expected behavior? Am I wrong to think that this is a flaw? You are correct that the old mon 0.99.2 code exhibits this behavior. The more recent code in CVS has a configurable feature that causes mon to remove the ack state from a service if the summary component of the failure message changes. In most common usage the summary is the list of hosts that are failing, so additional hosts failing would remove an ack. There has also been some discussion in the past of adding true per-host status tracking to Mon, but that proposal has never been followed through on. (IIRC, we got bogged down in discussion of how we would need to add structure to the data communicated between mon and the monitor/alert scripts, and how to maintain backwards compatibility with existing scripts) > > I know the mon project is pretty much not maintained anymore, so if I > don't get any response back I won't be surprised, but I thought I > would float this question out there and see if I get any responses. While thats an understandable conclusion based on the lack of a stable release in approximately forver, there has been a lot of work since the last release. The lack of a (declared) stable release has been in part because of a lack of feedback on the development versions. In fact I posted a release candidate for mon 1.2.0 back in september (http://www.managedandmonitored.net/mon/) but I have received almost no feedback on this version. In many cases I assume the mon users just haven't had the opportunity to replace known-working systems or setup parallel monitoring infrastructure. -David |
From: Augie S. <aug...@gm...> - 2007-03-22 16:43:17
|
On 3/7/07, David Nolan <vit...@cm...> wrote: > You are correct that the old mon 0.99.2 code exhibits this behavior. > The more recent code in CVS has a configurable feature that causes mon > to remove the ack state from a service if the summary component of the > failure message changes. In most common usage the summary is the list > of hosts that are failing, so additional hosts failing would remove an > ack. After poking around in the latest HEAD release I found the unack_summary configuration option, and it works great after patching to fix a few bugs: --- mon 2007-03-20 15:33:26.000000000 -0700 +++ mon 2007-03-21 14:15:22.000000000 -0700 @@ -1132,11 +1132,11 @@ } elsif ($1 eq "unack_summary") { if (defined $2) { if ($2 =~ /y(es)?/i) { - $2 = 1; + $UNACK_SUMMARY= 1; } elsif ($2 =~ /n(o)?/i) { - $2 = 0; + $UNACK_SUMMARY= 0; } - if ($2 eq "0" || $2 eq "1") { + elsif ($2 eq "0" || $2 eq "1") { $UNACK_SUMMARY = $2; } else { return "cf error: invalid unack_summary value '$2' (syntax: unack_summary [0|1|y|yes|n|no])"; $2 is a read-only variable and trying to assign to it throws errors on my system, the above patch addresses that. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Augie S. <aug...@gm...> - 2007-03-08 14:07:13
|
On 3/7/07, David Nolan <vit...@cm...> wrote: > On 3/7/07, Augie Schwer <aug...@gm...> wrote: > > It seems that mon's ack system is flawed at least if it is used how > > the examples show multiple hosts in a watch group. > > Ack'ing an alert acks the service in the group, not the host in the > > group that is alerting, so for example if you have a host group for > > your web servers and you watch http; if http alerts on one of the > > hosts and you ack it, the rest of your web servers could go down and > > you would never know about it because the other host's http alerts > > would be suppressed. > > Is this expected behavior? Am I wrong to think that this is a flaw? > You are correct that the old mon 0.99.2 code exhibits this behavior. > The more recent code in CVS has a configurable feature that causes mon > to remove the ack state from a service if the summary component of the > failure message changes. In most common usage the summary is the list > of hosts that are failing, so additional hosts failing would remove an > ack. Hot damn; that's exactly what I was looking at implementing as a work around. > > I know the mon project is pretty much not maintained anymore, so if I > > don't get any response back I won't be surprised, but I thought I > > would float this question out there and see if I get any responses. > While thats an understandable conclusion based on the lack of a stable > release in approximately forver, there has been a lot of work since > the last release. The lack of a (declared) stable release has been in > part because of a lack of feedback on the development versions. In > fact I posted a release candidate for mon 1.2.0 back in september > (http://www.managedandmonitored.net/mon/) but I have received almost > no feedback on this version. In many cases I assume the mon users > just haven't had the opportunity to replace known-working systems or > setup parallel monitoring infrastructure. Well it is good to hear that the project hasn't been abandoned; I think people will naturally assume that the project is abandoned not only because of the lack of stable release, but because of the cosmetic public facing things: the kernel.org mon site has a few broken links, and the SourceForge mailing list archive either isn't keeping up, or is slow on this thread because the last message on there is over a year old. All these things combined will most assuredly hurt the mon project's popularity and its ability to gather feedback; that being said, I will grab the latest CVS, stay on the list and hopefully have some feedback for you. :) Thank you very much for the reply David. -- Augie Schwer - Augie@Schwer.us - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 |
From: Jim T. <tr...@ar...> - 2007-03-16 17:57:55
|
On Thu, 8 Mar 2007, Augie Schwer wrote: > Well it is good to hear that the project hasn't been abandoned; I Not at all. I have a list of TODO items for fixes and feature enhancements, some of them as recent as a week or two ago. > think people will naturally assume that the project is abandoned not > only because of the lack of stable release, but because of the > cosmetic public facing things: the kernel.org mon site has a few > broken links, and the SourceForge mailing list archive either isn't > keeping up, or is slow on this thread because the last message on > there is over a year old. > > All these things combined will most assuredly hurt the mon project's > popularity and its ability to gather feedback; that being said, I will > grab the latest CVS, stay on the list and hopefully have some feedback > for you. :) Thank you very much for the reply David. Yes, feedback is very important, bad or good. Here's the todo so far, which covers protocol changes to handle the per-host status: ------------------------------- -no alerts for n mins -better cookbook of examples, including some pre-fab m4 defines for templates with focus on the ability to quickly configure mon out-of-the-box for the most common setups -period "templates" > like I have to repeat my period definitions all 260 times, one for > each watch. we should have templates in the Mon config file for any > kind of object so it can be reused. so do you mean a way to define a "template" for a period so that you don't need to keep rewriting "wd {Sun-Sat}", or so that it'll use some default period if you don't specify one, or what? i can see this working a bunch of different ways. like this? define period-template xyz period wd {Sun-Sat} alert mail.alert mi...@do... alert page.alert mis...@do... alertevery 1h watch something service something period template(xyz) watch somethingelse service something period template(xyz) # override the 1h alertevery 2h -my recent thoughts on config management are that the parsing should be all modularized, (a keeping the config parsing code in a separate perl module to be reused by other apps), and there should be a way to turn the resulting data structure into xml and importing the same back, not so you can write your config by hand in xml, but so you can use some generic xml editing tool to mess around with the config, to get one type of gui. -the most common things should be easiest to do, regardless of a gui or text file config. that is what makes stuff "easy". however, i don't think more complicated setups lend themselves to guis as much, and in complicated setups you have to invest a lot of time to learn how the tool works, and a fancy gui in that case is less of a payoff. this is for configuration, i mean. fancy guis for reporting and stuff are good, no doubt. -global alert definitions with their own squelches (alertevery, etc.) > also, alarms need to be collated so pagers and cell phones don't get > buried with large numbers of alerts. I have a custom solution that I > wrote for this, but it's a lousy solution since it essentially implements > its own paging system. i could see how it would be good to be able to define some alert destinations *outside* of the period definitions, then refer to them in the period definitions, then you can do "collation" that way. like this: define global-alert xyz mail.alert xy...@lm... alertevery 1h watch service period globalalert xyz <---collated globally watch service period globalalert xyz <---collated globally alert mail.alert pd...@lm... <---not collated that would be quite easy to do and i think very useful. you could apply all the same squelch knobs (alertevery, etc.) to the global ones. ----- (from mon-1.2.0) $Id: TODO,v 1.2 2004/11/15 14:45:16 vitroth Exp $ -add short a "radius howto" to the doc/ directory. -make traps authenticate via the same scheme used to obscure the password in RADIUS packets -descriptions defined in mon.cf should be 'quoted' -document command section and trap section in authfile -finish support for receiving snmp traps -output to client should be buffered and incorporated into the I/O loop. There is the danger that a sock_write to a client will block the server. -finish muxpect -make "chainable" alerts ?? i don't recall who asked for this or how it would work -make alerts nonblocking, and handle them in a similar fashion to monitors. i.e., serialize per-service (or per-period) alerts. -document "clear" client command -Document trap authentication. -Document traps. -Make monitors parallelize their tasks, similar to fping.monitor. This is an important scalability problem. -re-vamp the host disabling. 1) store them in a table with a timeout on each so that they can automatically re-enable themselves so people don't forget to re-enable them manually. 2) don't do the disabling by "commenting" them out of the host groups. We still want them to be tested for failure, but just disable alerts that have to do with the disabled hosts. When a host is commented out, accept a "reason" field that is later accessible so that you can tell why someone disabled the host. -allow checking a service at a particular time of day, maybe using inPeriod. -maybe make a command that will disable an alert for a certain amount of time -make it possible to disable just one of multiple alarms in a service -make a logging facility which forks and execs external logging daemons and writes to them via some ipc such as unix domain socket. mon should be sure that one of each type of these loggers is running at all times. configure the logging either globally or for each service. write both the success and failure status to the log in some "list opstatus" type format. each logger can do as it wishes with the data (e.g. stuff it into rrdtool, mysql, cat it to a file, etc.) # global setting logger = file watch stuff service http logger file -p _LOGDIR_ ... service fping # this will use the global logger setting ... service # this will override the global logger setting logger none ... common options to logger: -d dir path to logging dir -f file name of log file -g, -s group, service ----------- notes on a v2 protocol redesign from trockij - Configuring on a hostgroup scheme works very well. In the beginning, mon was never intended to get this complex(tm), it was intended to be a tool where it was easy to whip up custom monitoring scripts and alert scripts and plug them into a framework which allowed them all to connect to each other, and to have a way to easily build custom clients and report generators as well. - However, per host status is needed now. - This requires changes to both mon itself and also the monitors / alerts. Backward compatibility is important, and KISS is very important to retain the ease at which one can whip up a new monitor or alert or reporting client. - There will be a new protocol for communicating with the monitors / alerts, which will be masked by a Mon::Monitor / Mon::Alert module in Perl. Appropriate shell functions will be provided by the first one who asks. See below for the protocol. - We still want to retain the benefits of the old behaviour, but extend some alert management features, such as the ability to liberate alert definitions from the service periods so they can be used globally. - The server code might be broken up into multiple files (I/O routines, config parser, related parts, etc) - monitors can communicate better with the alerts (see below). For example, the monitor might hint (using "a_mail_list") the mail.alert about where else to send a warning that a user dir goes over quota. (Attention should be paid to privacy that we don't accidentially inform all users that /home/foo/how-i-will-destroy-western-civilization/ is consuming 1GB too much space ;) - Associations: these allow monitors to communicate details about failures back to the server which can be used to specify who to alert. The associations are based on key/value pairs specified in the association config file, and are expanded on the alert command line (or possibly within the alert protocol) if "@assoc-*" is in the configuration. If a host assoc. is needed, an alert spec will look like: alert mail.alert ad...@xy... @assoc-host There are two association types (possibly more in the future): host associations, and user-defined associations. Host associations use the "assoc-host" specifier, and map one or more username to an individual host. User-defined associations are just that, and begin with the "assoc-u-" specifier. Monitors return associations via the "assoc-*" key in the monitor protocol. Alerts obtain association information either via command-line arguments which were expanded by the server from "@assoc-*" in the config file, or via the "assoc-*" key in the alert protocol. - Metrics are only passed to the mon server for "monitoring" purposes, but can be marked up in such a way that they could be easily piped to a logging utility, one which is not part of the mon process itself. monitors are _encouraged_ to collect and report performance data. "Failures" are basically just a conclusion based upon performance data and it makes no sense to collect the data twice, e.g. if you have mon polling ifInOctets.0 on a system, why should mrtg have to poll on its own. It may be desireable to propose a "unified logging system" which all monitors can easily use, something which is pluggable and extensible - The hostgroup syntax is going to be extended to add per host options. (which will be passed to the monitors / alerts using the new protocol) ns1.teuto.net( fs="/(80%,90%)",mail_list="lm...@te..." ) would be passed as "h_fs=/(80%,90%)" and "h_mail_list="lm...@te..." FLOATING MONITORS A floating monitor is started by mon and remains running for the entire time. If it dies, it is automatically restarted. The server forks off a separate process for fping and communicates with it via some IPC, like a named pipe or a socket or something. The floating monitor sits there waiting for a message from the server that says "start checking now". The server then adds this descriptor to %fhandles and %running and treats it similar to other forked monitors. When the floting monitor is done, it spits its output back to the server and then goes dormant again, awaiting another message from the server. Floating monitors are started when mon starts, and are restarted if mon notices that they go away. This is a way to save on fork() overhead, but to also PROTOCOL The protocol will be simple and ASCII based, in the form of "key=value". Line continuation will be provided by prefixing following lines with a ">". A "\n" on a line by itself indicates the start of a new block. The order of the keys should not be important. The first block will always contain metadata further defining the following blocks. The "version" key is always present. The current protocol version is "1". (In the examples, everything after a "#" is a comment and should be cut out) KEY CONVENTIONS Keys only private to monitors will be prefixed with an "m_". In the same vain, keys private to alerts will be prefixed with a "a_", and additional host option keys specified in the mon.cf file will be prefixed with a "h_" before being passed to monitors/alerts. By convention, flags only pertaining to a specific alert will embed that name in the key name too - ie keys only pertaining to "mail.alert" will start with "a_mail_". The key/values pairs will be passed to all processes for a specific service. "h_" are static between invocations as they come from the mon.cf file. "m_" keys will be preserved between multiple monitor executions. "a_" keys will be passed from the monitor to the alert script. MONITOR PROTOCOL (monitor -> mon) The metadata block is followed by a block describing the overall hostgroup status, followed by a detailled status for each host. The following keys are defined for the blocks: "summary" = contains a one line short summary of the status. "status" = up, fail, ignore "metric_1" = an opaque floating point number which can be referenced for triggering alerts. May try to give an "operational percentage". More than one metric may be returned. (Ping rtt, packet loss, disk space etc) "description" = longer elaborate description of the current status. "host" = hostgroup member to which this status applies. The overall hostgroup status does not include this field. "assoc-host" = host association "assoc-u-*" = user-defined association Here is an example for a hypothetical hostgroup with 2 hosts and the ping service. ### version=1 summary=Still alive. metric_1=50 # Packetloss metric_2=20.23 # rtt times description=1 out of 2 hosts still responding. > Whatever else one might want to say about the status. It is difficult to > come up with a good text here so I will just babble. status=up host=foo.bar.com metric_1=100 metric_2=0 # 100% packet loss make rtt measurements difficult ;) summary=ICMP unreachable from 2.2.2.2 status=fail description=PING 2.2.2.2 (2.2.2.2): 56 data bytes > >--- 2.2.2.2 ping statistics --- >23 packets transmitted, 0 packets received, 100% packet loss metric_1=0 metric_2=52.1 summary=ICMP echo reply received ok status=up description=64 bytes from 212.8.197.2: icmp_seq=0 ttl=60 time=110.0 ms >64 bytes from 212.8.197.2: icmp_seq=1 ttl=60 time=32.3 ms >64 bytes from 212.8.197.2: icmp_seq=2 ttl=60 time=32.8 ms >64 bytes from 212.8.197.2: icmp_seq=3 ttl=60 time=33.4 ms > >--- ns1.teuto.net ping statistics --- >4 packets transmitted, 4 packets received, 0% packet loss >round-trip min/avg/max = 32.3/52.1/110.0 ms host=baz.bar.com ###### Points still open: - mon -> monitor communication - mon <-> alert communication - the new trap protocol - muxpect - a unified logging proposal |