[Mon-commit] mon TODO,1.2,1.2.2.1

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/mon/mon
In directory sc8-pr-cvs16.sourceforge.net:/tmp/cvs-serv19963

Modified Files:
      Tag: mon-1-2-branch
	TODO 
Log Message:


Index: TODO
===================================================================
RCS file: /cvsroot/mon/mon/TODO,v
retrieving revision 1.2
retrieving revision 1.2.2.1
diff -C2 -d -r1.2 -r1.2.2.1
*** TODO	15 Nov 2004 14:45:16 -0000	1.2
--- TODO	27 Jun 2007 11:51:17 -0000	1.2.2.1
***************
*** 1,2 ****
--- 1,97 ----
+ -implement trap delivery for "redistribute" in the mon server itself as an
+  option. retain the "call script" behavior, but maybe specify internal
+  trap delivery via "redistribute -h hostname [hostname...]". also allow
+  multiple redistribute lines to build a list of scripts to call
+ 
+ -deliver traps with acknowledgement via tcp
+ 
+ -add protocol commands to dump entire status + configuration in one operation
+  to reduce latency (not so many serialized get/response operations just to
+  get status)
+ 
+ -no alerts for n mins
+ 
+ -better cookbook of examples, including some pre-fab m4 defines for templates
+  with focus on the ability to quickly configure mon out-of-the-box for
+  the most common setups
+ 
+ -period "templates"
+     > like I have to repeat my period definitions all 260 times, one for
+     > each watch.  we should have templates in the Mon config file for any
+     > kind of object so it can be reused.
+ 
+     so do you mean a way to define a "template" for a period so that
+     you don't need to keep rewriting "wd {Sun-Sat}", or so that it'll use
+     some default period if you don't specify one, or what? i can see this
+     working a bunch of different ways.
+ 
+ 
+     like this?
+ 
+     define period-template xyz
+ 	period wd {Sun-Sat}
+ 		 alert mail.alert mi...@do...
+ 		 alert page.alert mis...@do...
+ 		 alertevery 1h
+ 
+ 
+     watch something
+ 	 service something
+ 	    period template(xyz)
+ 
+     watch somethingelse
+ 	 service something
+ 	    period template(xyz)
+ 		# override the 1h
+ 		alertevery 2h
+ 
+ 
+ -my recent thoughts on config management are that the parsing should be
+  all modularized, (a keeping the config parsing code in a separate
+  perl module to be reused by other apps),
+  and there should be a way to turn the resulting data
+  structure into xml and importing the same back, not so you can write
+  your config by hand in xml, but so you can use some generic xml editing
+  tool to mess around with the config, to get one type of gui.
+ 
+ -the most common things should be easiest to do, regardless of
+  a gui or text file config. that is what makes stuff "easy". however,
+  i don't think more complicated setups lend themselves to guis as much,
+  and in complicated setups you have to invest a lot of time to learn how
+  the tool works, and a fancy gui in that case is less of a payoff.
+  this is for configuration, i mean. fancy guis for reporting and stuff
+  are good, no doubt.
+ 
+ -global alert definitions with their own squelches (alertevery, etc.)
+  > also, alarms need to be collated so pagers and cell phones don't get
+  > buried with large numbers of alerts.  I have a custom solution that I
+  > wrote for this, but it's a lousy solution since it essentially implements
+  > its own paging system.
+ 
+  i could see how it would be good to be able to define some alert
+  destinations *outside* of the period definitions, then refer to them
+  in the period definitions, then you can do "collation" that way. like
+  this:
+ 
+     define global-alert xyz mail.alert xy...@lm...
+ 	 alertevery 1h
+ 
+     watch
+        service
+ 	 period
+ 	   globalalert xyz     <---collated globally
+ 
+     watch
+        service
+ 	 period
+ 	   globalalert xyz     <---collated globally
+ 	   alert mail.alert pd...@lm...   <---not collated
+ 
+ 
+ that would be quite easy to do and i think very useful. you could
+ apply all the same squelch knobs (alertevery, etc.) to the global ones.
+ 
+ -----
+ (from mon-1.2.0)
  $Id$
  
***************
*** 18,21 ****
--- 113,117 ----
  
  -make "chainable" alerts
+  ?? i don't recall who asked for this or how it would work
  
  -make alerts nonblocking, and handle them in a similar fashion to
***************
*** 28,39 ****
  -Document traps.
  
- -fix client opstatus parsing by converting clients to use Mon::Client
- 
  -Make monitors parallelize their tasks, similar to fping.monitor. This
   is an important scalability problem.
  
- -make changes to tkined so that it can query a mon server and
-  update the graphical map accordingly.
- 
  -re-vamp the host disabling. 1) store them in a table with a timeout
   on each so that they can automatically re-enable themselves so
--- 124,130 ----
***************
*** 50,54 ****
  
  -maybe make a command that will disable an alert for a certain amount
!  of time (maybe implement this as an at(1) job??)
  
  -make it possible to disable just one of multiple alarms in a service
--- 141,145 ----
  
  -maybe make a command that will disable an alert for a certain amount
!  of time
  
  -make it possible to disable just one of multiple alarms in a service
***************
*** 84,85 ****
--- 175,369 ----
      -g, -s	group, service
  
+ -----------
+ notes on a v2 protocol redesign from trockij
+ 
+ - Configuring on a hostgroup scheme works very well. In the beginning, mon was
+   never intended to get this complex(tm), it was intended to be a tool
+   where it was easy to whip up custom monitoring scripts and alert scripts
+   and plug them into a framework which allowed them all to connect to each
+   other, and to have a way to easily build custom clients and report
+   generators as well.
+ 
+ - However, per host status is needed now.
+ 
+ - This requires changes to both mon itself and also the monitors / alerts.
+   
+   Backward compatibility is important, and KISS is very important to
+   retain the ease at which one can whip up a new monitor or alert or reporting
+   client.
+ 
+ - There will be a new protocol for communicating with the monitors / alerts,
+   which will be masked by a Mon::Monitor / Mon::Alert module in Perl.
+   Appropriate shell functions will be provided by the first one who asks.
+   See below for the protocol.
+ 
+ - We still want to retain the benefits of the old behaviour, but extend
+   some alert management features, such as the ability to liberate
+   alert definitions from the service periods so they can be used globally.
+ 
+ - The server code might be broken up into multiple files (I/O routines, config
+   parser, related parts, etc)
+ 
+ - monitors can communicate better with the alerts (see below). For example,
+   the monitor might hint (using "a_mail_list") the mail.alert about where else
+   to send a warning that a user dir goes over quota.
+   (Attention should be paid to privacy that we don't accidentially inform
+   all users that /home/foo/how-i-will-destroy-western-civilization/
+   is consuming 1GB too much space ;)
+ 
+ - Associations: these allow monitors to communicate details
+   about failures back to the server which can be used to specify who
+   to alert.
+ 
+   The associations are based on key/value pairs specified in the
+   association config file, and are expanded on the alert command line
+   (or possibly within the alert protocol) if "@assoc-*" is in the
+   configuration. If a host assoc. is needed, an alert spec will look like:
+ 
+     alert mail.alert ad...@xy... @assoc-host
+ 
+   There are two association types (possibly more in the future): host
+   associations, and user-defined associations.  Host associations use the
+   "assoc-host" specifier, and map one or more username to an individual
+   host. User-defined associations are just that, and begin with the
+   "assoc-u-" specifier.
+ 
+   Monitors return associations via the "assoc-*" key in the monitor
+   protocol.
+ 
+   Alerts obtain association information either via command-line arguments
+   which were expanded by the server from "@assoc-*" in the config file,
+   or via the "assoc-*" key in the alert protocol.
+ 
+ - Metrics are only passed to the mon server for "monitoring" purposes, but can
+   be marked up in such a way that they could be easily piped to a logging
+   utility, one which is not part of the mon process itself.
+   monitors are _encouraged_ to collect and report performance data.
+ 
+   "Failures" are basically just a conclusion based upon performance data and
+   it makes no sense to collect the data twice, e.g. if you have mon polling
+   ifInOctets.0 on a system, why should mrtg have to poll on its own.
+ 
+   It may be desireable to propose a "unified logging system" which all
+   monitors can easily use, something which is pluggable and extensible
+ 
+ - The hostgroup syntax is going to be extended to add per host options. (which
+   will be passed to the monitors / alerts using the new protocol)
+   ns1.teuto.net( fs="/(80%,90%)",mail_list="lm...@te..." )
+   would be passed as "h_fs=/(80%,90%)" and "h_mail_list="lm...@te..."
+   
+ FLOATING MONITORS
+ 
+ A floating monitor is started by mon and remains running for the entire time.
+ If it dies, it is automatically restarted.
+ 
+ The server forks off a separate process for fping and communicates with
+ it via some IPC, like a named pipe or a socket or something. The floating
+ monitor sits there waiting for a message from the server that says "start
+ checking now". The server then adds this descriptor to %fhandles and %running
+ and treats it similar to other forked monitors. When the floting monitor is
+ done, it spits its output back to the server and then goes dormant again,
+ awaiting another message from the server. Floating monitors are started
+ when mon starts, and are restarted if mon notices that they go away. This
+ is a way to save on fork() overhead, but to also
+ 
+ PROTOCOL
+ 
+ The protocol will be simple and ASCII based, in the form of "key=value". Line
+ continuation will be provided by prefixing following lines with a ">". A "\n"
+ on a line by itself indicates the start of a new block.
+ 
+ The order of the keys should not be important.
+ 
+ The first block will always contain metadata further defining the following
+ blocks. The "version" key is always present.
+ 
+ The current protocol version is "1".
+ 
+ (In the examples, everything after a "#" is a comment and should be cut out)
+ 
+ KEY CONVENTIONS
+ 
+ Keys only private to monitors will be prefixed with an "m_". In the same
+ vain, keys private to alerts will be prefixed with a "a_", and additional
+ host option keys specified in the mon.cf file will be prefixed with a "h_"
+ before being passed to monitors/alerts.
+ 
+ By convention, flags only pertaining to a specific alert will embed that name
+ in the key name too - ie keys only pertaining to "mail.alert" will start with
+ "a_mail_".
+ 
+ The key/values pairs will be passed to all processes for a specific service.
+ "h_" are static between invocations as they come from the mon.cf file. "m_"
+ keys will be preserved between multiple monitor executions. "a_" keys will be
+ passed from the monitor to the alert script.
+ 
+ 
+ MONITOR PROTOCOL (monitor -> mon)
+ 
+ The metadata block is followed by a block describing the overall hostgroup
+ status, followed by a detailled status for each host.
+ 
+ The following keys are defined for the blocks:
+ "summary" = contains a one line short summary of the status.
+ "status"  = up, fail, ignore
+ "metric_1"  = an opaque floating point number which can be referenced for
+             triggering alerts. May try to give an "operational percentage".
+ 	    More than one metric may be returned.
+ 	    (Ping rtt, packet loss, disk space etc)
+ "description" = longer elaborate description of the current status.
+ "host"        = hostgroup member to which this status applies. The overall
+                 hostgroup status does not include this field.
+ "assoc-host"  = host association
+ "assoc-u-*"   = user-defined association
+ 
+ Here is an example for a hypothetical hostgroup with 2 hosts and the ping
+ service.
+ 
+ ###
+ version=1
+ 
+ summary=Still alive.
+ metric_1=50 # Packetloss
+ metric_2=20.23 # rtt times
+ description=1 out of 2 hosts still responding.
+ > Whatever else one might want to say about the status. It is difficult to
+ > come up with a good text here so I will just babble.
+ status=up
+ 
+ host=foo.bar.com
+ metric_1=100
+ metric_2=0 # 100% packet loss make rtt measurements difficult ;)
+ summary=ICMP unreachable from 2.2.2.2
+ status=fail
+ description=PING 2.2.2.2 (2.2.2.2): 56 data bytes
+ >
+ >--- 2.2.2.2 ping statistics ---
+ >23 packets transmitted, 0 packets received, 100% packet loss
+ 
+ metric_1=0
+ metric_2=52.1
+ summary=ICMP echo reply received ok
+ status=up
+ description=64 bytes from 212.8.197.2: icmp_seq=0 ttl=60 time=110.0 ms
+ >64 bytes from 212.8.197.2: icmp_seq=1 ttl=60 time=32.3 ms
+ >64 bytes from 212.8.197.2: icmp_seq=2 ttl=60 time=32.8 ms
+ >64 bytes from 212.8.197.2: icmp_seq=3 ttl=60 time=33.4 ms
+ >
+ >--- ns1.teuto.net ping statistics ---
+ >4 packets transmitted, 4 packets received, 0% packet loss
+ >round-trip min/avg/max = 32.3/52.1/110.0 ms
+ host=baz.bar.com
+ ######
+ 
+ 
+ Points still open:
+ - mon -> monitor communication
+ 
+ - mon <-> alert communication
+ 
+ - the new trap protocol
+ 
+ - muxpect
+ 
+ - a unified logging proposal