skummee Code
Brought to you by:
cunnijd
File | Date | Author | Commit |
---|---|---|---|
contrib | 2014-04-14 | cunnijd | [r4] Add php, update README. |
AUTHORS | 2010-11-29 | cunnijd | [r1] First revision. |
COPYING | 2010-11-29 | cunnijd | [r1] First revision. |
DISCLAIMER | 2010-11-29 | cunnijd | [r1] First revision. |
META | 2014-04-14 | cunnijd | [r3] New release. |
Make-rpm.mk | 2010-11-29 | cunnijd | [r1] First revision. |
Makefile | 2014-04-14 | cunnijd | [r3] New release. |
README | 2014-04-14 | cunnijd | [r4] Add php, update README. |
get_ipmi.c | 2010-11-29 | cunnijd | [r1] First revision. |
get_snmp.c | 2010-11-29 | cunnijd | [r1] First revision. |
get_symbol.cpp | 2014-04-14 | cunnijd | [r3] New release. |
skummee.c | 2014-04-14 | cunnijd | [r3] New release. |
skummee.h | 2014-04-14 | cunnijd | [r3] New release. |
skummee.spec | 2014-04-14 | cunnijd | [r3] New release. |
INTRODUCTION skummee is a package created for "monitoring" hosts in a large-scale environment, mostly with SNMP. Monitoring in this context includes threshold status checking/reporting, and historical trending. skummee was designed with large clusters in mind, taking advantage of the inherent parallel nature of many similarly configured nodes, although this tool is able to function in heterogenous environments as well. skummee implements parallelism with process threads, where each host is assigned a process for data gathering and storage. skummee stores its data in two forms: in MySQL tables indicating host/variable status, and in Round Robin Database (RRD) files for historical trending. This allows for quick access to overall machine status (MySQL) and comprehensive/graphical trending (RRD), both of which have PHP interfaces for web display. INSTALLATION skummee relies on several packages, thus requiring certain library installation before skummee is compiled. Given here are libraries which are usually not part of a standard development environment that are required by skummee. MySQL: http://www.mysql.com/ Net-SNMP: http://net-snmp.sourceforge.net/ RRD Tool: http://oss.oetiker.ch/rrdtool/ SSL: http://www.openssl.org/ PROCPS: http://procps.sourceforge.net/ FreeIPMI (optional): http://www.gnu.org/software/freeipmi/ NetApp Manageability (NM) SDK (optional): http://support.netapp.com/ skummee depends on freeipmi for the libipmimonitoring library, which is required if out-of-band ipmi monitoring is necessary. Also, skummee may use the Netapp Manageability SDK to monitor NetApp devices. After these libraries are installed, skummee can be built by issuing the following command in the base skummee source directory. make This builds the executable file named skummee. CONFIGURATION skummee keeps its configuration in the same MySQL tables where node and variable status are stored. In order for skummee to run, there are ten tables that need to be created, five of which must be populated. Also, the "thresh" database, which is the database name used by skummee, needs to be created and used prior to table creation. This is accomplished by running at the MySQL prompt: mysql> create database thresh; mysql> use thresh; The tables: The meta table: meta The nodes table: <machine prefix>_n The variables table: <machine prefix>_vl The variable thresholds table: <machine prefix>_th The node/var mapping table: <machine prefix>_v The analysis table: <machine prefix>_an The lookup table: <machine prefix>_lu The mod table: <machine prefix>_mod The email policies table: <machine prefix>_ep The email addresses table: <machine prefix>_ea The meta table is a listing of all the clusters that will be polled. The nodes table contains the list of hosts to be monitored for some machine, along with several attributes. The variables table contains the list of all possible variables for any node in our machine (thus some logical collection of hosts to be monitored should have some amount of similarity between themselves for this table to have optimal conciseness). The node/var mapping table outlines which variables from the variables table map to hosts from the nodes table. The variable thresholds table lists the various thresholds per variable, and has a many-to-one relationship with the variables table. The analysis table holds data related to node metrics that have crossed thresholds, and the data is kept for a peroid of time. The lookup table provides integer values for non-number string values that are returned from SNMP polls. The mod table contains a list of external scripts that are to be run to gather data externally. The format of these tables will be given in detail below. The "machine prefix" is a unique name of a machine, so that more than one machine may be monitored by a single management host. For instance, if we have a cluster of machines named "atlas", we would name our tables: "atlas": atlas_n, atlas_vl, atlas_v, atlas_th, atlas_an ... In this case, our machine prefix is "atlas", and, no, the prefix does NOT need to contain any of the letters or numbers of the actual machine name. Continuing with our previous example, here's what the "variable" table could look like for our "atlas" machine: mysql> describe atlas_vl; +----------+----------------------+------+-----+--------------------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+----------------------+------+-----+--------------------+-------+ | var | smallint(5) unsigned | | PRI | 0 | | | counter | tinyint(3) unsigned | | | 0 | | | oid | varchar(254) | | | .1.3.6.1.2.1.1.3.0 | | | des | varchar(64) | | | Description | | | alias | varchar(32) | | | alias | | | blurb | text | YES | | NULL | | | discrete | tinyint(3) unsigned | | | 0 | | | var_type | tinyint(3) unsigned | | | 0 | | +----------+----------------------+------+-----+--------------------+-------+ 8 rows in set (0.00 sec) mysql> And the SQL command required to create this table: CREATE TABLE `atlas_vl` ( `var` smallint(5) unsigned NOT NULL default '0', `counter` tinyint(3) unsigned NOT NULL default '0', `oid` varchar(254) NOT NULL default '.1.3.6.1.2.1.1.3.0', `des` varchar(64) NOT NULL default 'Description', `alias` varchar(32) NOT NULL default 'alias', `blurb` text, `discrete` tinyint(3) unsigned NOT NULL default '0', `var_type` tinyint(3) unsigned NOT NULL default '0', PRIMARY KEY (`var`) ); Some sample data: mysql> select * from atlas_vl order by 1; +-----+---------+------------------------------+-------------------------------+---------------+------------------------------------------------------------------------+----------+----------+ | var | counter | oid | des | alias | blurb | discrete | var_type | +-----+---------+------------------------------+-------------------------------+---------------+------------------------------------------------------------------------+----------+----------+ | 1 | 0 | .1.3.6.1.4.1.2021.9.1.9.1 | % of / partition used | / | Percentage of / Filesystem Usage | 0 | 0 | | 2 | 0 | .1.3.6.1.4.1.2021.9.1.9.2 | % of /boot partition used | /boot | Percentage of /boot Filesystem Usage | 0 | 0 | | 3 | 0 | .1.3.6.1.4.1.2021.9.1.9.3 | % of /tmp partition used | /tmp | Percentage of /tmp Filesystem Usage | 0 | 0 | | 4 | 0 | .1.3.6.1.4.1.2021.9.1.9.4 | % of /usr partition used | /usr | Percentage of /usr Filesystem Usage | 0 | 0 | | 5 | 0 | .1.3.6.1.4.1.2021.9.1.9.5 | % of /var partition used | /var | Percentage of /var Filesystem Usage | 0 | 0 | | 6 | 0 | .1.3.6.1.4.1.2021.9.1.9.6 | % of /tftpboot partition used | /tftpboot | Percentage of /tftpboot Filesystem Usage | 0 | 0 | | 7 | 0 | .1.3.6.1.4.1.2021.2.1.100.1 | atd Daemon | atd | atd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 8 | 0 | .1.3.6.1.4.1.2021.2.1.100.2 | crond Daemon | crond | crond daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 9 | 0 | .1.3.6.1.4.1.2021.2.1.100.3 | ntpd Daemon | ntpd | ntpd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 10 | 0 | .1.3.6.1.4.1.2021.2.1.100.4 | portmap Daemon | portmap | portmap daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 11 | 0 | .1.3.6.1.4.1.2021.2.1.100.5 | rsshd Daemon | rsshd | rsshd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 12 | 0 | .1.3.6.1.4.1.2021.2.1.100.6 | sshd Daemon | sshd | sshd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 13 | 0 | .1.3.6.1.4.1.2021.2.1.100.7 | syslog-ng Daemon | syslog-ng | syslog-ng daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 14 | 0 | .1.3.6.1.4.1.2021.2.1.100.8 | xinetd Daemon | xinetd | xinetd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 15 | 0 | .1.3.6.1.4.1.2021.2.1.100.9 | cerebrod Daemon | cerebrod | cerebrod daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 16 | 0 | .1.3.6.1.4.1.2021.2.1.100.10 | munged Daemon | munged | munged daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 17 | 0 | .1.3.6.1.4.1.2021.2.1.100.11 | slurmd Daemon | slurmd | slurmd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 18 | 0 | .1.3.6.1.4.1.2021.2.1.100.12 | spd Daemon | spd | spd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 19 | 0 | .1.3.6.1.4.1.2021.2.1.100.13 | lrmmond Daemon | lrmmond | lrmmond daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 20 | 0 | .1.3.6.1.4.1.2021.2.1.100.14 | lrmrouted Daemon | lrmrouted | lrmrouted daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 21 | 0 | .1.3.6.1.4.1.2021.2.1.100.15 | pspd Daemon | pspd | pspd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 22 | 0 | .1.3.6.1.4.1.2021.2.1.100.16 | cupsd Daemon | cupsd | cupsd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 23 | 0 | .1.3.6.1.4.1.2021.2.1.100.17 | slurmctld Daemon | slurmctld | slurmctld daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 24 | 0 | .1.3.6.1.4.1.2021.2.1.100.18 | conmand Daemon | conmand | conmand daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 25 | 0 | .1.3.6.1.4.1.2021.2.1.100.19 | powermand Daemon | powermand | powermand daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 26 | 0 | .1.3.6.1.4.1.2021.2.1.100.20 | httpd Daemon | httpd | httpd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 27 | 0 | .1.3.6.1.4.1.2021.2.1.100.21 | named Daemon | named | named daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 28 | 0 | .1.3.6.1.4.1.2021.2.1.100.22 | dhcpd Daemon | dhcpd | dhcpd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 29 | 0 | .1.3.6.1.4.1.2021.2.1.100.23 | netdump-server Daemon | netdump-serve | netdump-server daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 30 | 0 | .1.3.6.1.4.1.2021.2.1.100.24 | sendmail Daemon | sendmail | sendmail daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 31 | 0 | .1.3.6.1.4.1.2021.2.1.100.25 | nsrexecd Daemon | nsrexecd | nsrexecd daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 32 | 0 | .1.3.6.1.4.1.2021.2.1.100.26 | xfs Daemon | xfs | xfs daemon<br>0 = Process is Running<br>1 = Process is Dead | 1 | 0 | | 33 | 0 | .1.3.6.1.4.1.2021.10.1.3.1 | Load Average | load | The Current Load Average | 0 | 0 | | 34 | 0 | .1.3.6.1.4.1.2021.4.4.0 | Available Swap Space | swap | The Current Amount of Unused Swap Space in Kilobytes | 0 | 0 | | 35 | 1 | .1.3.6.1.4.1.2021.11.50.0 | User CPU Time | UCPU | Percentage of CPU Time Consumed by User Processes Since the Last Poll | 0 | 0 | +-----+---------+------------------------------+-------------------------------+---------------+------------------------------------------------------------------------+----------+----------+ 35 rows in set (0.00 sec) Column Descriptions: var - Simple unique integer for variable identification; Table index; Must be sequential, and must start at 1. counter - Some SNMP variables are calculated as counter data. A "1" here indicates a counter variable. Otherwise, a "0" is given. oid - The SNMP/IPMI/Symbol OID of the variable. Also acts as an index for scripts run from the "mod" table. The IPMI OID is simply the index number returned from an ipmi-sensors command from the freeipmi package. For example: # porterj /root > ipmi-sensors Caching SDR repository information: /root/.freeipmi/sdr-cache/sdr-cache-porterj.localhost Caching SDR record 59 of 59 (current record ID 59) ID | Name | Type | Reading | Units | Event 3 | BB Inlet Temp | Temperature | 26.00 | C | 'OK' 4 | SSB Temp | Temperature | 43.00 | C | 'OK' 5 | BB BMC Temp | Temperature | 33.00 | C | 'OK' ... 28 | BB +1.1V SB | Voltage | 1.08 | V | 'OK' 29 | BB P3_3V STBY | Voltage | 3.25 | V | 'OK' 30 | BB 1_1V PCH | Voltage | 1.08 | V | 'OK' # porterj /root > The OID for "SSB Temp" is 4, and for "BB +1.1V SB", it's 28. The Netapp Symbol API OID consists of the format: <deviceStat>@<TrayID>@<Location> If a device/stat does not have a TrayID, 0 should be use, and Location maps to the device number. For example, to monitor the status of the disk located in Tray 45, Location 38: drivestatus@45@38 Or the status of Controller 1: controller@0@1 skummee only supports a subset of the stats provided by the Symbol SDK, which is listed here: controller volumestatus drivestatus drivetemp drivepfa esm ups minihub gbic sfp interconnectCRU alarm processorMemoryDimm fan battery powerSupply thermalSensor drawer cacheMemoryDimm hostBoard volumeRCacheActive volumeWCacheActive volumeRCacheEnable volumeWCacheEnable volumeCacheMirrorActive volumeCacheMirrorEnable des - variable description. Not necessary for skummee to run, but provides a simple descrition for the user interface. alias - abbreviation of variable description. Not necessary for skummee to run, but provides a very short descrition for the user interface. blurb - Verbose variable description. Not necessary for skummee to run, but provides a verbode description for the user interface. discrete - For supplied user interfaces only. A type of 1 indicates that the variable has 2 or more discrete values, to be indicated as Normal, Warning, or Critical. Otherwise, a type value of 0 is used. var_type - Variable Type. There are currently 6 types supported: 0 = SNMP 1 = IPMI Reading 2 = INTERNAL 3 = Reserved 4 = IPMI State 5 = IPMI System Event Log 6 = NetApp Symbol API For the 3 different IPMI variabe types: Type 1, "IPMI Reading", is used for IPMI variables that return a gauge type, such as a temperature. Type 4, "IPMI State", is used for IPMI variables that return a discrete "Good" or "Bad" value. Type 5, "IPMI System Event Log", will return a count of IPMI SEL entries for a specified SEL entry type. And here's a sample "node" table: mysql> describe atlas_n; +---------------+----------------------+------+-----+-----------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+----------------------+------+-----+-----------+-------+ | host | smallint(5) unsigned | | PRI | 0 | | | status0 | tinyint(3) unsigned | | | 0 | | | address | varchar(16) | | | 127.0.0.1 | | | name | varchar(128) | | UNI | hostname | | | community | varchar(32) | | | public | | | status1 | tinyint(3) unsigned | | | 0 | | | crit | tinyint(3) unsigned | | | 0 | | | port | smallint(5) unsigned | | | 161 | | | version | tinyint(3) unsigned | | | 2 | | | ign | tinyint(3) unsigned | | | 0 | | | retries | tinyint(3) unsigned | | | 0 | | | ipmi_address | varchar(16) | | | 127.0.0.1 | | | host_user | varchar(20) | | | user | | | host_password | varchar(20) | | | password | | | max_snmp | smallint(5) unsigned | | | 65535 | | | timeout | smallint(5) unsigned | | | 10 | | +---------------+----------------------+------+-----+-----------+-------+ 16 rows in set (0.00 sec) SQL create command: CREATE TABLE `atlas_n` ( `host` smallint(5) unsigned NOT NULL default '0', `status0` tinyint(3) unsigned NOT NULL default '2', `address` varchar(16) NOT NULL default '127.0.0.1', `name` varchar(128) NOT NULL default 'hostname', `community` varchar(32) NOT NULL default 'public', `status1` tinyint(3) unsigned NOT NULL default '2', `crit` tinyint(3) unsigned NOT NULL default '0', `port` smallint(5) unsigned NOT NULL default '161', `version` tinyint(3) unsigned NOT NULL default '2', `ign` tinyint(3) unsigned NOT NULL default '0', `retries` tinyint(3) unsigned NOT NULL default '0', `ipmi_address` varchar(16) NOT NULL default '127.0.0.1', `host_user` varchar(20) NOT NULL default 'user', `host_password` varchar(20) NOT NULL default 'password', `max_snmp` smallint(5) unsigned NOT NULL default '65535', `timeout` smallint(5) unsigned NOT NULL default '10', PRIMARY KEY (`host`), UNIQUE KEY `name_index` (`name`) ); Sample data: mysql> select * from igs_n where host < 10 order by 1; +------+---------+--------------+------+-----------+---------+------+------+---------+-----+---------+--------------+-----------+---------------+----------+---------+ | host | status0 | address | name | community | status1 | crit | port | version | ign | retries | ipmi_address | host_user | host_password | max_snmp | timeout | +------+---------+--------------+------+-----------+---------+------+------+---------+-----+---------+--------------+-----------+---------------+----------+---------+ | 1 | 2 | 192.168.64.1 | igs1 | public | 2 | 1 | 161 | 2 | 0 | 1 | 127.0.0.1 | user | password | 65535 | 10 | | 2 | 2 | 192.168.64.2 | igs2 | public | 2 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 3 | 2 | 192.168.64.3 | igs3 | public | 2 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 4 | 2 | 192.168.64.4 | igs4 | public | 2 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 5 | 2 | 192.168.64.5 | igs5 | public | 2 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 6 | 3 | 192.168.64.6 | igs6 | public | 3 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 7 | 5 | 192.168.64.7 | igs7 | public | 5 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 8 | 2 | 192.168.64.8 | igs8 | public | 2 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | | 9 | 2 | 192.168.64.9 | igs9 | public | 2 | 0 | 161 | 2 | 0 | 0 | 127.0.0.1 | user | password | 65535 | 10 | +------+---------+--------------+------+-----------+---------+------+------+---------+-----+---------+--------------+-----------+---------------+----------+---------+ 9 rows in set (0.00 sec) Column Descriptions: host - Simple unique integer for node identification; Table index status0 - Current overall status of the host, which is the greatest status of all this host's variables. Alternates with status1 to provide a 1 poll history. address - The IPv4 address of the host. name - The node's name. Usually its hostname, but this is not necessary. The name here is only used for display purposes, and is not used for polling. community - The community string used to access this node's SNMP variables. status1 - current overall status of the host, which is the greatest status of all this host's variables. Alternates with status0 to provide a 1 poll history. crit - Denotes the importance of a node. Binary value: "1" = critical node. "0" = not critical. port - The port number used by the SNMP server running on the node. version - The SNMP version number to use for the SNMP request. Must be 1 or 2. ign - The ignored state. A value of 1 indicates that the interface should not include this node. A value of 0 indicates that the node should be displayed normally. retries - The number of times skummee should retry a network request, be it SNMP or IPMI. ipmi_address - The IPMI IPv4 out-of-band address for the node. host_user - If the node has IPMI variables, this is the ipmi user. host_password - If the node has IPMI variables, this is the ipmi password. max_snmp - The maximum number of SNMP variables to bundle per SNMP request. This should not normally be changed from the default of 65535, unless the remote SNMP agent cannot handle large SNMP message sizes. timeout: The amount of time in seconds before a network request times out for this node. "node/var" mapping table: mysql> describe atlas_v; +---------------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+----------------------+------+-----+---------+-------+ | host | smallint(5) unsigned | | PRI | 0 | | | var | smallint(5) unsigned | | PRI | 0 | | | status0 | tinyint(3) unsigned | | | 2 | | | status1 | tinyint(3) unsigned | | | 2 | | | internal_poll | double | | | -22222 | | | value0 | double | | | -22222 | | | value1 | double | | | -22222 | | +---------------+----------------------+------+-----+---------+-------+ 7 rows in set (0.00 sec) SQL Create command: CREATE TABLE `atlas_v` ( `host` smallint(5) unsigned NOT NULL default '0', `var` smallint(5) unsigned NOT NULL default '0', `status0` tinyint(3) unsigned NOT NULL default '2', `status1` tinyint(3) unsigned NOT NULL default '2', `internal_poll` double NOT NULL default '-22222', `value0` double NOT NULL default '-22222', `value1` double NOT NULL default '-22222', PRIMARY KEY (`host`,`var`) ); mysql> select * from atlas_v order by 1,2; mysql> select * from porter_v order by 1,2; +------+-----+---------+---------+---------------+-------------+-------------+ | host | var | status0 | status1 | internal_poll | value0 | value1 | +------+-----+---------+---------+---------------+-------------+-------------+ | 1 | 1 | 0 | 0 | -22222 | 0 | 0 | | 1 | 2 | 0 | 0 | -22222 | 0 | 0 | | 1 | 3 | 0 | 0 | -22222 | 0 | 0 | | 1 | 4 | 0 | 0 | -22222 | 0 | 0 | | 1 | 5 | 0 | 0 | -22222 | 3.08 | 4.04 | | 1 | 6 | 0 | 0 | -22222 | 28885 | 28901 | | 1 | 7 | 0 | 0 | -22222 | 3 | 3 | ... | 1080 | 376 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 377 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 378 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 379 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 380 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 381 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 382 | 0 | 0 | -22222 | 0 | 0 | | 1080 | 383 | 0 | 0 | -22222 | 0 | 0 | +------+-----+---------+---------+---------------+-------------+-------------+ 19520 rows in set (0.03 sec) mysql> Column Descriptions: host - Integer identifying the host from the "nodes" table; One of the two table indices var - Integer identifying the variable from the "variables" table; One of the two table indices status0 - Current variable status. Alternates with status1 to provide a 1 poll history. status1 - Current variable status. Alternates with status0 to provide a 1 poll history. internal_poll - This is where values provided by external scripts are stored. skummee will set the values back to NAN's after each poll to ensure the data is current. value0 - Current variable value. Alternates with value1 to provide a 1 poll history. value1 - Current variable value. Alternates with value0 to provide a 1 poll history. Sample "Thresh" table: mysql> describe atlas_th; +-------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+----------------------+------+-----+---------+-------+ | var | smallint(5) unsigned | NO | MUL | 0 | | | low | double | YES | | NULL | | | high | double | YES | | NULL | | | state | tinyint(3) unsigned | NO | | 2 | | +-------+----------------------+------+-----+---------+-------+ 4 rows in set (0.00 sec) mysql> select * from atlas_th where var < 36 order by 1,4 desc; +-----+------+---------+-------+ | var | low | high | state | +-----+------+---------+-------+ | 1 | 98 | NULL | 200 | | 1 | 95 | 98 | 100 | | 2 | 98 | NULL | 200 | | 2 | 95 | 98 | 100 | | 3 | 98 | NULL | 200 | | 3 | 95 | 98 | 100 | | 4 | 98 | NULL | 200 | | 4 | 95 | 98 | 100 | | 5 | 98 | NULL | 200 | | 5 | 95 | 98 | 100 | | 6 | 98 | NULL | 200 | | 6 | 95 | 98 | 100 | | 7 | 1 | NULL | 200 | | 8 | 1 | NULL | 200 | | 9 | 1 | NULL | 200 | | 10 | 1 | NULL | 200 | | 11 | 1 | NULL | 200 | | 12 | 1 | NULL | 200 | | 13 | 1 | NULL | 200 | | 14 | 1 | NULL | 200 | | 15 | 1 | NULL | 100 | | 16 | 1 | NULL | 200 | | 17 | 1 | NULL | 200 | | 18 | 1 | NULL | 200 | | 19 | 1 | NULL | 200 | | 20 | 1 | NULL | 200 | | 21 | 1 | NULL | 200 | | 22 | 1 | NULL | 200 | | 23 | 1 | NULL | 200 | | 24 | 1 | NULL | 200 | | 25 | 1 | NULL | 200 | | 26 | 1 | NULL | 200 | | 27 | 1 | NULL | 200 | | 28 | 1 | NULL | 200 | | 29 | 1 | NULL | 200 | | 30 | 1 | NULL | 200 | | 31 | 1 | NULL | 200 | | 32 | 1 | NULL | 200 | | 33 | 32 | NULL | 100 | | 34 | NULL | 1000000 | 100 | +-----+------+---------+-------+ 40 rows in set (0.00 sec) SQL Create command: CREATE TABLE `atlas_th` ( `var` smallint(5) unsigned NOT NULL default '0', `low` double default NULL, `high` double default NULL, `state` tinyint(3) unsigned NOT NULL default '2', KEY `ind` (`var`,`low`) ) Column Descriptions: var - Integer identifying the variable from the "variables" table; One of the three table indices. low - The lower bound to use for this particular threshold; One of the three table indices. A value of NULL indicates that there is no lower bound (the lower bound is negatively infinite). high - The upper bound to use for this particular threshold. One of the three table indices. A value of NULL indicates that there is no upper bound (the upper bound is infinite). state - The value to assign to the status of a variable if it is determined that the variable's current value is between the low and high values. State may be 1 of 254 possible values, from 1 to 254 (0 and 255 are reserved for the Nominal and Timeout states, respectively). The higher the state value, the more serious the condition. Sample "Meta" table: mysql> describe meta; +-----------------+----------------------+------+-----+----------------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+----------------------+------+-----+----------------+-------+ | name | varchar(32) | NO | PRI | | | | display | varchar(32) | NO | | | | | rrd_path | varchar(64) | NO | | /tmp | | | stale | int(10) unsigned | NO | | 0 | | | stat | tinyint(3) unsigned | NO | | 0 | | | priority | smallint(5) unsigned | NO | | 0 | | | pid | int(10) unsigned | NO | | 0 | | | hidden | tinyint(3) unsigned | NO | | 0 | | | crit0 | smallint(5) unsigned | NO | | 0 | | | crit1 | smallint(5) unsigned | NO | | 0 | | | noncrit0 | smallint(5) unsigned | NO | | 0 | | | noncrit1 | smallint(5) unsigned | NO | | 0 | | | nom0 | smallint(5) unsigned | NO | | 0 | | | nom1 | smallint(5) unsigned | NO | | 0 | | | critmax | tinyint(3) unsigned | NO | | 0 | | | noncritmax | tinyint(3) unsigned | NO | | 0 | | +-----------------+----------------------+------+-----+----------------+-------+ 16 rows in set (0.00 sec) mysql> select * from meta; +------+---------+-----------------------+------------+------+----------+-----+--------+-------+-------+----------+----------+------+------+---------+------------+ | name | display | rrd_path | stale | stat | priority | pid | hidden | crit0 | crit1 | noncrit0 | noncrit1 | nom0 | nom1 | critmax | noncritmax | +------+---------+-----------------------+------------+------+----------+-----+--------+-------+-------+----------+----------+------+------+---------+------------+ | hype | Hype | /var/skummee/hype/rrd | 1233169218 | 0 | 0 | 0 | 0 | 1 | 1 | 133 | 133 | 7 | 7 | 50 | 255 | | test | Test | /var/skummee/test/rrd | 1219163418 | 0 | 0 | 0 | 0 | 2 | 2 | 50 | 50 | 60 | 60 | 50 | 100 | +------+---------+-----------------------+------------+------+----------+-----+--------+-------+-------+----------+----------+------+------+---------+------------+ 2 rows in set (0.00 sec) SQL Create command: CREATE TABLE `meta` ( `name` varchar(32) NOT NULL default '', `display` varchar(32) NOT NULL default '', `rrd_path` varchar(64) NOT NULL default '/tmp', `stale` int(10) unsigned NOT NULL default '0', `stat` tinyint(3) unsigned NOT NULL default '0', `priority` smallint(5) unsigned NOT NULL default '0', `pid` int(10) unsigned NOT NULL default '0', `hidden` tinyint(3) unsigned NOT NULL default '0', `crit0` smallint(5) unsigned NOT NULL default '0', `crit1` smallint(5) unsigned NOT NULL default '0', `noncrit0` smallint(5) unsigned NOT NULL default '0', `noncrit1` smallint(5) unsigned NOT NULL default '0', `nom0` smallint(5) unsigned NOT NULL default '0', `nom1` smallint(5) unsigned NOT NULL default '0', `critmax` tinyint(3) unsigned NOT NULL default '0', `noncritmax` tinyint(3) unsigned NOT NULL default '0', PRIMARY KEY (`name`) ) Column Descriptions: name: The machine name used by skummee to determine the table names (<name>_n, <name_vl>, etc.). display: How the machine will be displayed for the user interface. rrd_path: Path to the RRD files stale: The unix timestamp for when the cluster was last succesfully polled. stat: The alternating (between 0 and 1) value that determines which stat column to use in the node and mapping tables. priority: This is used to determine the machine display order for the user interface. pid: The process ID of any previous skummee instance. If non-zero, indicates that either a previous instance of skummee is still running, or that the last instance of skummee was terminated abnormally. hidden: A value of 1 will hide the machine from the user interface, but will continue to be polled normally. A value of 0 is used for normal operation. crit0/crit1: Set by skummee at the end of each poll to be the number of critical nodes that are not nominal. noncrit0/noncrit1: Set by skummee at the end of each poll to be the number of non-critical nodes that are not nominal. nom0/nom1: Set by skummee at the end of each poll to be the number of nodes that are nominal. critmax: Set by skummee at the end of each poll to be the status of the critical node(s) with the highest status. noncritmax: Set by skummee at the end of each poll to be the status of the non-critical node(s) with the highest status. Sample "mod" table: mysql> describe atlas_mod; +-----------+------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+------------------+------+-----+---------+-------+ | command | varchar(255) | | PRI | | | | arguments | varchar(255) | | PRI | | | | timeout | int(10) unsigned | | | 10 | | +-----------+------------------+------+-----+---------+-------+ 3 rows in set (0.00 sec) mysql> select * from atlas_mod; +-----------------------------------+-----------+---------+ | command | arguments | timeout | +-----------------------------------+-----------+---------+ | /usr/bin/slurm_failures | -c 7 | 3 | | /usr/bin/mpi_failures | | 5 | | /tmp/hello_world | --leak=no | 10 | +-----------------------------------+-----------+---------+ 3 rows in set (0.00 sec) SQL Create command: CREATE TABLE `atlas_mod` ( `command` varchar(255) NOT NULL default '', `arguments` varchar(255) NOT NULL default '', `timeout` int(10) unsigned NOT NULL default '10', PRIMARY KEY (`command`,`arguments`) ); The "mod" table provides a method for script based data gathering. The output of the script must give exactly one value per line, and have the following format: nodename1,oid1,value1 nodename2,oid2,value2 ... Where the node names correspond to the "name" column from the node table, and the oid's correspond to the "oid" column from the variable list table. When skummee runs, It executes the scripts in the "mod" table, and stores the output in the "internal_poll" column of the node/var mapping table, then reads the value when normal polling for the node is performed, and then the "internal_poll" column is set back to NAN's. Column Descriptions: command: Path to executable. arguments: Arguments for the executable in the "command" column. timeout: The amount of time in seconds given for the command to run. If the timeout value is exceeded, the process associated with the command is sent a kill signal. Sample "analysis" Table: mysql> describe atlas_an; +-----------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+----------------------+------+-----+---------+-------+ | host | smallint(5) unsigned | | PRI | 0 | | | epochtime | int(10) unsigned | | PRI | 0 | | | status | tinyint(3) unsigned | | | 0 | | | var_list | varchar(254) | YES | | NULL | | +-----------+----------------------+------+-----+---------+-------+ 4 rows in set (0.00 sec) mysql> select * from atlas_an limit 5; +------+------------+--------+-------------------------------------------------------------------------------------+ | host | epochtime | status | var_list | +------+------------+--------+-------------------------------------------------------------------------------------+ | 0 | 1176911100 | 5 | 43 | | 29 | 1176911100 | 3 | 209 | | 32 | 1176911100 | 5 | 43 | | 123 | 1176911100 | 7 | 206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,229 | | 128 | 1176911100 | 3 | 209 | +------+------------+--------+-------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec) SQL Create Command: CREATE TABLE `atlas_an` ( `host` smallint(5) unsigned NOT NULL default '0', `epochtime` int(10) unsigned NOT NULL default '0', `status` tinyint(3) unsigned NOT NULL default '0', `var_list` varchar(254) default NULL, PRIMARY KEY (`epochtime`,`host`) ); The analysis table provides a snapshot of all the nodes that have a status other than "NOMINAL" for each poll, along with all the variables that are not NOMINAL. Data in the analysis table that is older than 12 days (1000000 seconds) is trimmed to prevent the table from growing excessively over time. Column Descriptions: host: The host number, as indexed by the host column in the node table. epochtime: The unix timestamp of the polling time. status: The highest status of all the variables for the host at polling time. var_list: A comma separated list of variables for the host that have a status other than NOMINAL at polling time. Sample "lookup" table: mysql> describe atlas_lu; +--------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-------------+------+-----+---------+-------+ | value_string | varchar(64) | | PRI | | | | value_int | int(11) | | | 0 | | +--------------+-------------+------+-----+---------+-------+ 2 rows in set (0.00 sec) mysql> select * from atlas_lu limit 5; +--------------------------------+-----------+ | value_string | value_int | +--------------------------------+-----------+ | FALSE | 0 | | TRUE | 16 | | SFP transmitter OK | 16 | | SFP transmitter fault detected | 0 | | SFP signal OK | 16 | +--------------------------------+-----------+ 5 rows in set (0.00 sec) SQL Create Command: CREATE TABLE `atlas_lu` ( `value_string` varchar(64) NOT NULL default '', `value_int` int(11) NOT NULL default '0', PRIMARY KEY (`value_string`) ); Since SNMP values are frequently strings, and skummee only works with numbers, it is necessary to convert strings to numbers for effective SNMP monitoring. Currently, the lookup table is only used for values returned from SNMP variables. Column Descriptions: value_string: String value expected from an SNMP query, with leading and ending whitespace removed. value_int: Integer value to use for the specified string for threshold checking and history. A Sample Email addresses Table: mysql> describe atlas_ea; +---------------+----------------------+------+-----+--------------------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+----------------------+------+-----+--------------------+-------+ | email_address | varchar(255) | | PRI | nobody@example.com | | | policy_number | smallint(5) unsigned | | PRI | 0 | | +---------------+----------------------+------+-----+--------------------+-------+ 2 rows in set (0.00 sec) mysql> select * from atlas_ea; +--------------------+---------------+ | email_address | policy_number | +--------------------+---------------+ | admin1@example.com | 0 | | admin2@example.com | 0 | | admin3@example.com | 0 | | admin3@example.com | 1 | | god@example.com | 1 | +--------------------+---------------+ 5 rows in set (0.00 sec) SQL Create Command: CREATE TABLE `atlas_ea` ( `email_address` varchar(255) NOT NULL default 'nobody@example.com', `policy_number` smallint(5) unsigned NOT NULL default '0', PRIMARY KEY (`policy_number`,`email_address`) ); Column Descriptions: email_addess: The email address of the user policy_number: The policy number to use as referenced from the email policy table. A Sample Email policy table: mysql> describe atlas_ep; +-------------------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+----------------------+------+-----+---------+-------+ | policy_number | smallint(5) unsigned | | PRI | 0 | | | frequency | int(10) unsigned | | | 86400 | | | last_mailed | int(10) unsigned | | | 0 | | | do_count | tinyint(3) unsigned | | | 0 | | | count_non_crit | tinyint(3) unsigned | | | 0 | | | count_threshold | int(10) unsigned | | | 0 | | | count_status_low | tinyint(3) unsigned | | | 0 | | | count_status_high | tinyint(3) unsigned | | | 0 | | | do_percentage | tinyint(3) unsigned | | | 0 | | | per_non_crit | tinyint(3) unsigned | | | 0 | | | per_threshold | tinyint(3) unsigned | | | 0 | | | per_status_low | tinyint(3) unsigned | | | 0 | | | per_status_high | tinyint(3) unsigned | | | 0 | | | do_run_failure | tinyint(3) unsigned | | | 0 | | +-------------------+----------------------+------+-----+---------+-------+ 14 rows in set (0.00 sec) mysql> select * from atlas_ep; +---------------+-----------+-------------+----------+----------------+-----------------+------------------+-------------------+---------------+--------------+---------------+----------------+-----------------+----------------+ | policy_number | frequency | last_mailed | do_count | count_non_crit | count_threshold | count_status_low | count_status_high | do_percentage | per_non_crit | per_threshold | per_status_low | per_status_high | do_run_failure | +---------------+-----------+-------------+----------+----------------+-----------------+------------------+-------------------+---------------+--------------+---------------+----------------+-----------------+----------------+ | 0 | 86400 | 0 | 1 | 0 | 2 | 200 | 255 | 1 | 1 | 35 | 100 | 255 | 0 | | 1 | 3600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | +---------------+-----------+-------------+----------+----------------+-----------------+------------------+-------------------+---------------+--------------+---------------+----------------+-----------------+----------------+ 2 rows in set (0.00 sec) SQL create Command: CREATE TABLE `atlas_ep` ( `policy_number` smallint(5) unsigned NOT NULL default '0', `frequency` int(10) unsigned NOT NULL default '86400', `last_mailed` int(10) unsigned NOT NULL default '0', `do_count` tinyint(3) unsigned NOT NULL default '0', `count_non_crit` tinyint(3) unsigned NOT NULL default '0', `count_threshold` int(10) unsigned NOT NULL default '0', `count_status_low` tinyint(3) unsigned NOT NULL default '0', `count_status_high` tinyint(3) unsigned NOT NULL default '0', `do_percentage` tinyint(3) unsigned NOT NULL default '0', `per_non_crit` tinyint(3) unsigned NOT NULL default '0', `per_threshold` tinyint(3) unsigned NOT NULL default '0', `per_status_low` tinyint(3) unsigned NOT NULL default '0', `per_status_high` tinyint(3) unsigned NOT NULL default '0', `do_run_failure` tinyint(3) unsigned NOT NULL default '0', PRIMARY KEY (`policy_number`) ); The email policy table describes the policies to use for automatic email notification in response to events caused by thresholds being crossed. Column Description: policy_number: The table index frequency: The smallest interval (in seconds) in which emails are sent when the policy is activated. last_mailed: The unix epoch time representing the last mailing for this policy. Used with the frequency value to determine if an email should be sent based on time. do_count: Include a count of nodes to determine if this policy is activated. count_non_crit: Include non-critical nodes when counting. count_threshold: The count of nodes should be equal to or greater than this number for the policy to be activated. count_status_low/count_status_high: The node's status must be between (or included within) the count_status_low and count_status_high values in order to be counted. do_percentage: Include a percentage of nodes to determine if this policy is activated. per_non_crit: Include non-critical nodes when determining percentage. per_threshold: The percentage of nodes should be equal to or greater than this number for the policy to be activated. per_status_low/per_status_high: The node's status must be between (or included within) the per_status_low and per_status_high values in order to be included in the percentage do_run_failure: Activate this policy when skummee fails to run successfully. Given the above examples, policy 0 would translate to: 1) Only mail at 24 hour intervals when this policy is met (86400 seconds). 2) Count nodes that have a status greater than or equal to 200 and less than or equal to 255. 3) Don't include non-critical nodes when counting. 4) If the count is greater than or equal to 2, and the last mailing occurred more than 86400 seconds ago, mail all users in the atlas_ea table having policy 0. 5) Count all nodes that have a status greater than or equal to 100 and less than or equal to 255, including non-critical nodes, divide this by the number of all nodes, and multiply this by 100. If this is greater than or equal to 35, mail all users in the atlas_ea table having policy 0. 6) Don't consider run_failures with this policy. And policy 1: 1) Mail all users in the atlas_ea table that have policy 1 as their policy whenever skummee fails to run, but not more frequently than once every 3600 seconds (1 hour). ---- More on Thresholds Variable status is assigned based on threshold values from the threshold table. Node status is assigned based on its maximum variable status, or if the node doesn't respond in a timeout period (currently set to 10 seconds in the above example), so it is important to create threshold levels that increase numerically as the conditions increase in severity. GETTING STARTED skummee is designed to run out of cron every 5 minutes, based on the way that skummee inserts historical data into RRD files. The RRD files contain 2 RRA's: Once every 5 minutes kept for 1 month, and once every 4 hours kept or 1 year. Running skummee at intervals other than 5 minutes will give unpredictable RRD results. Once the MySQL tables are in place and are populated, skummee is run with the following command line options: skummee [config file] Here is a sample config file, with the # representing a comment line: # Machine machine atlas # MySQL user for skummee user skummee # MySQL password for skummee pass SomePassword An example run would have: skummee -c /etc/skummee.conf It's best to run skummee a few times at the command line before insertion into cron, in order to shake out any configuration issues. The first time skummee is run, skummee builds the RRD files based on their configuration in the MySQL tables, with each node corresponding to 1 RRD file. Each node/RRD file contains the exact number and names of the data sources found in the MySQL tables. In the event that nodes change the number or nature of their data sources (by the user making changes to the MYSQL tables), skummee will detect an inconsistancy between what it polls and the RRD file, and will make the necessary changes on the fly.