[Mon-commit] mon TODO,1.2,1.2.2.1
Brought to you by:
trockij
From: Jim T. <tr...@us...> - 2007-06-27 11:51:23
|
Update of /cvsroot/mon/mon In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv19963 Modified Files: Tag: mon-1-2-branch TODO Log Message: Index: TODO =================================================================== RCS file: /cvsroot/mon/mon/TODO,v retrieving revision 1.2 retrieving revision 1.2.2.1 diff -C2 -d -r1.2 -r1.2.2.1 *** TODO 15 Nov 2004 14:45:16 -0000 1.2 --- TODO 27 Jun 2007 11:51:17 -0000 1.2.2.1 *************** *** 1,2 **** --- 1,97 ---- + -implement trap delivery for "redistribute" in the mon server itself as an + option. retain the "call script" behavior, but maybe specify internal + trap delivery via "redistribute -h hostname [hostname...]". also allow + multiple redistribute lines to build a list of scripts to call + + -deliver traps with acknowledgement via tcp + + -add protocol commands to dump entire status + configuration in one operation + to reduce latency (not so many serialized get/response operations just to + get status) + + -no alerts for n mins + + -better cookbook of examples, including some pre-fab m4 defines for templates + with focus on the ability to quickly configure mon out-of-the-box for + the most common setups + + -period "templates" + > like I have to repeat my period definitions all 260 times, one for + > each watch. we should have templates in the Mon config file for any + > kind of object so it can be reused. + + so do you mean a way to define a "template" for a period so that + you don't need to keep rewriting "wd {Sun-Sat}", or so that it'll use + some default period if you don't specify one, or what? i can see this + working a bunch of different ways. + + + like this? + + define period-template xyz + period wd {Sun-Sat} + alert mail.alert mi...@do... + alert page.alert mis...@do... + alertevery 1h + + + watch something + service something + period template(xyz) + + watch somethingelse + service something + period template(xyz) + # override the 1h + alertevery 2h + + + -my recent thoughts on config management are that the parsing should be + all modularized, (a keeping the config parsing code in a separate + perl module to be reused by other apps), + and there should be a way to turn the resulting data + structure into xml and importing the same back, not so you can write + your config by hand in xml, but so you can use some generic xml editing + tool to mess around with the config, to get one type of gui. + + -the most common things should be easiest to do, regardless of + a gui or text file config. that is what makes stuff "easy". however, + i don't think more complicated setups lend themselves to guis as much, + and in complicated setups you have to invest a lot of time to learn how + the tool works, and a fancy gui in that case is less of a payoff. + this is for configuration, i mean. fancy guis for reporting and stuff + are good, no doubt. + + -global alert definitions with their own squelches (alertevery, etc.) + > also, alarms need to be collated so pagers and cell phones don't get + > buried with large numbers of alerts. I have a custom solution that I + > wrote for this, but it's a lousy solution since it essentially implements + > its own paging system. + + i could see how it would be good to be able to define some alert + destinations *outside* of the period definitions, then refer to them + in the period definitions, then you can do "collation" that way. like + this: + + define global-alert xyz mail.alert xy...@lm... + alertevery 1h + + watch + service + period + globalalert xyz <---collated globally + + watch + service + period + globalalert xyz <---collated globally + alert mail.alert pd...@lm... <---not collated + + + that would be quite easy to do and i think very useful. you could + apply all the same squelch knobs (alertevery, etc.) to the global ones. + + ----- + (from mon-1.2.0) $Id$ *************** *** 18,21 **** --- 113,117 ---- -make "chainable" alerts + ?? i don't recall who asked for this or how it would work -make alerts nonblocking, and handle them in a similar fashion to *************** *** 28,39 **** -Document traps. - -fix client opstatus parsing by converting clients to use Mon::Client - -Make monitors parallelize their tasks, similar to fping.monitor. This is an important scalability problem. - -make changes to tkined so that it can query a mon server and - update the graphical map accordingly. - -re-vamp the host disabling. 1) store them in a table with a timeout on each so that they can automatically re-enable themselves so --- 124,130 ---- *************** *** 50,54 **** -maybe make a command that will disable an alert for a certain amount ! of time (maybe implement this as an at(1) job??) -make it possible to disable just one of multiple alarms in a service --- 141,145 ---- -maybe make a command that will disable an alert for a certain amount ! of time -make it possible to disable just one of multiple alarms in a service *************** *** 84,85 **** --- 175,369 ---- -g, -s group, service + ----------- + notes on a v2 protocol redesign from trockij + + - Configuring on a hostgroup scheme works very well. In the beginning, mon was + never intended to get this complex(tm), it was intended to be a tool + where it was easy to whip up custom monitoring scripts and alert scripts + and plug them into a framework which allowed them all to connect to each + other, and to have a way to easily build custom clients and report + generators as well. + + - However, per host status is needed now. + + - This requires changes to both mon itself and also the monitors / alerts. + + Backward compatibility is important, and KISS is very important to + retain the ease at which one can whip up a new monitor or alert or reporting + client. + + - There will be a new protocol for communicating with the monitors / alerts, + which will be masked by a Mon::Monitor / Mon::Alert module in Perl. + Appropriate shell functions will be provided by the first one who asks. + See below for the protocol. + + - We still want to retain the benefits of the old behaviour, but extend + some alert management features, such as the ability to liberate + alert definitions from the service periods so they can be used globally. + + - The server code might be broken up into multiple files (I/O routines, config + parser, related parts, etc) + + - monitors can communicate better with the alerts (see below). For example, + the monitor might hint (using "a_mail_list") the mail.alert about where else + to send a warning that a user dir goes over quota. + (Attention should be paid to privacy that we don't accidentially inform + all users that /home/foo/how-i-will-destroy-western-civilization/ + is consuming 1GB too much space ;) + + - Associations: these allow monitors to communicate details + about failures back to the server which can be used to specify who + to alert. + + The associations are based on key/value pairs specified in the + association config file, and are expanded on the alert command line + (or possibly within the alert protocol) if "@assoc-*" is in the + configuration. If a host assoc. is needed, an alert spec will look like: + + alert mail.alert ad...@xy... @assoc-host + + There are two association types (possibly more in the future): host + associations, and user-defined associations. Host associations use the + "assoc-host" specifier, and map one or more username to an individual + host. User-defined associations are just that, and begin with the + "assoc-u-" specifier. + + Monitors return associations via the "assoc-*" key in the monitor + protocol. + + Alerts obtain association information either via command-line arguments + which were expanded by the server from "@assoc-*" in the config file, + or via the "assoc-*" key in the alert protocol. + + - Metrics are only passed to the mon server for "monitoring" purposes, but can + be marked up in such a way that they could be easily piped to a logging + utility, one which is not part of the mon process itself. + monitors are _encouraged_ to collect and report performance data. + + "Failures" are basically just a conclusion based upon performance data and + it makes no sense to collect the data twice, e.g. if you have mon polling + ifInOctets.0 on a system, why should mrtg have to poll on its own. + + It may be desireable to propose a "unified logging system" which all + monitors can easily use, something which is pluggable and extensible + + - The hostgroup syntax is going to be extended to add per host options. (which + will be passed to the monitors / alerts using the new protocol) + ns1.teuto.net( fs="/(80%,90%)",mail_list="lm...@te..." ) + would be passed as "h_fs=/(80%,90%)" and "h_mail_list="lm...@te..." + + FLOATING MONITORS + + A floating monitor is started by mon and remains running for the entire time. + If it dies, it is automatically restarted. + + The server forks off a separate process for fping and communicates with + it via some IPC, like a named pipe or a socket or something. The floating + monitor sits there waiting for a message from the server that says "start + checking now". The server then adds this descriptor to %fhandles and %running + and treats it similar to other forked monitors. When the floting monitor is + done, it spits its output back to the server and then goes dormant again, + awaiting another message from the server. Floating monitors are started + when mon starts, and are restarted if mon notices that they go away. This + is a way to save on fork() overhead, but to also + + PROTOCOL + + The protocol will be simple and ASCII based, in the form of "key=value". Line + continuation will be provided by prefixing following lines with a ">". A "\n" + on a line by itself indicates the start of a new block. + + The order of the keys should not be important. + + The first block will always contain metadata further defining the following + blocks. The "version" key is always present. + + The current protocol version is "1". + + (In the examples, everything after a "#" is a comment and should be cut out) + + KEY CONVENTIONS + + Keys only private to monitors will be prefixed with an "m_". In the same + vain, keys private to alerts will be prefixed with a "a_", and additional + host option keys specified in the mon.cf file will be prefixed with a "h_" + before being passed to monitors/alerts. + + By convention, flags only pertaining to a specific alert will embed that name + in the key name too - ie keys only pertaining to "mail.alert" will start with + "a_mail_". + + The key/values pairs will be passed to all processes for a specific service. + "h_" are static between invocations as they come from the mon.cf file. "m_" + keys will be preserved between multiple monitor executions. "a_" keys will be + passed from the monitor to the alert script. + + + MONITOR PROTOCOL (monitor -> mon) + + The metadata block is followed by a block describing the overall hostgroup + status, followed by a detailled status for each host. + + The following keys are defined for the blocks: + "summary" = contains a one line short summary of the status. + "status" = up, fail, ignore + "metric_1" = an opaque floating point number which can be referenced for + triggering alerts. May try to give an "operational percentage". + More than one metric may be returned. + (Ping rtt, packet loss, disk space etc) + "description" = longer elaborate description of the current status. + "host" = hostgroup member to which this status applies. The overall + hostgroup status does not include this field. + "assoc-host" = host association + "assoc-u-*" = user-defined association + + Here is an example for a hypothetical hostgroup with 2 hosts and the ping + service. + + ### + version=1 + + summary=Still alive. + metric_1=50 # Packetloss + metric_2=20.23 # rtt times + description=1 out of 2 hosts still responding. + > Whatever else one might want to say about the status. It is difficult to + > come up with a good text here so I will just babble. + status=up + + host=foo.bar.com + metric_1=100 + metric_2=0 # 100% packet loss make rtt measurements difficult ;) + summary=ICMP unreachable from 2.2.2.2 + status=fail + description=PING 2.2.2.2 (2.2.2.2): 56 data bytes + > + >--- 2.2.2.2 ping statistics --- + >23 packets transmitted, 0 packets received, 100% packet loss + + metric_1=0 + metric_2=52.1 + summary=ICMP echo reply received ok + status=up + description=64 bytes from 212.8.197.2: icmp_seq=0 ttl=60 time=110.0 ms + >64 bytes from 212.8.197.2: icmp_seq=1 ttl=60 time=32.3 ms + >64 bytes from 212.8.197.2: icmp_seq=2 ttl=60 time=32.8 ms + >64 bytes from 212.8.197.2: icmp_seq=3 ttl=60 time=33.4 ms + > + >--- ns1.teuto.net ping statistics --- + >4 packets transmitted, 4 packets received, 0% packet loss + >round-trip min/avg/max = 32.3/52.1/110.0 ms + host=baz.bar.com + ###### + + + Points still open: + - mon -> monitor communication + + - mon <-> alert communication + + - the new trap protocol + + - muxpect + + - a unified logging proposal |