Menu

Tree [r4] /
 History

HTTPS access


File Date Author Commit
 contrib 2014-04-14 cunnijd [r4] Add php, update README.
 AUTHORS 2010-11-29 cunnijd [r1] First revision.
 COPYING 2010-11-29 cunnijd [r1] First revision.
 DISCLAIMER 2010-11-29 cunnijd [r1] First revision.
 META 2014-04-14 cunnijd [r3] New release.
 Make-rpm.mk 2010-11-29 cunnijd [r1] First revision.
 Makefile 2014-04-14 cunnijd [r3] New release.
 README 2014-04-14 cunnijd [r4] Add php, update README.
 get_ipmi.c 2010-11-29 cunnijd [r1] First revision.
 get_snmp.c 2010-11-29 cunnijd [r1] First revision.
 get_symbol.cpp 2014-04-14 cunnijd [r3] New release.
 skummee.c 2014-04-14 cunnijd [r3] New release.
 skummee.h 2014-04-14 cunnijd [r3] New release.
 skummee.spec 2014-04-14 cunnijd [r3] New release.

Read Me

INTRODUCTION

  skummee is a package created for "monitoring" hosts in a large-scale
  environment, mostly with SNMP.  Monitoring in this context includes
  threshold status checking/reporting, and historical trending.

  skummee was designed with large clusters in mind, taking advantage of
  the inherent parallel nature of many similarly configured nodes,
  although this tool is able to function in heterogenous environments
  as well.  skummee implements parallelism with process threads, where each
  host is assigned a process for data gathering and storage.

  skummee stores its data in two forms: in MySQL tables indicating host/variable
  status, and in Round Robin Database (RRD) files for historical trending.
  This allows for quick access to overall machine status (MySQL) and 
  comprehensive/graphical trending (RRD), both of which have PHP interfaces
  for web display.


INSTALLATION

  skummee relies on several packages, thus requiring certain library
  installation before skummee is compiled.  Given here are libraries which
  are usually not part of a standard development environment that are
  required by skummee.


  MySQL:

  http://www.mysql.com/


  Net-SNMP:

  http://net-snmp.sourceforge.net/


  RRD Tool:

  http://oss.oetiker.ch/rrdtool/


  SSL:

  http://www.openssl.org/

  
  PROCPS:

  http://procps.sourceforge.net/

 
  FreeIPMI (optional):

  http://www.gnu.org/software/freeipmi/


  NetApp Manageability (NM) SDK (optional):

  http://support.netapp.com/

  skummee depends on freeipmi for the libipmimonitoring library, which is 
  required if out-of-band ipmi monitoring is necessary.  Also, skummee
  may use the Netapp Manageability SDK to monitor NetApp devices.

  After these libraries are installed, skummee can be built by issuing the
  following command in the base skummee source directory.

  make

  This builds the executable file named skummee.


CONFIGURATION

  skummee keeps its configuration in the same MySQL tables where node and
  variable status are stored.  In order for skummee to run, there are ten
  tables that need to be created, five of which must be populated.  Also,
  the "thresh" database, which is the database name used by skummee,
  needs to be created and used prior to table creation.  This is
  accomplished by running at the MySQL prompt:

  mysql> create database thresh;
  mysql> use thresh;

  The tables:

  The meta table: meta
  The nodes table: <machine prefix>_n
  The variables table: <machine prefix>_vl
  The variable thresholds table: <machine prefix>_th
  The node/var mapping table: <machine prefix>_v
  The analysis table: <machine prefix>_an
  The lookup table: <machine prefix>_lu
  The mod table: <machine prefix>_mod
  The email policies table: <machine prefix>_ep
  The email addresses table: <machine prefix>_ea

  The meta table is a listing of all the clusters that will be polled.

  The nodes table contains the list of hosts to be monitored for some
  machine, along with several attributes.

  The variables table contains the list of all possible variables for any
  node in our machine (thus some logical collection of hosts to be
  monitored should have some amount of similarity between themselves for
  this table to have optimal conciseness).

  The node/var mapping table outlines which variables from the variables
  table map to hosts from the nodes table.

  The variable thresholds table lists the various thresholds per
  variable, and has a many-to-one relationship with the variables table.

  The analysis table holds data related to node metrics that have crossed
  thresholds, and the data is kept for a peroid of time.

  The lookup table provides integer values for non-number string values
  that are returned from SNMP polls.

  The mod table contains a list of external scripts that are to be run
  to gather data externally.

  The format of these tables will be given in detail below.
  
  The "machine prefix" is a unique name of a machine,
  so that more than one machine may be monitored by a single management
  host.  For instance, if we have a cluster of machines named "atlas",
  we would name our tables:

  "atlas": atlas_n, atlas_vl, atlas_v, atlas_th, atlas_an ...

  In this case, our machine prefix is "atlas", and, no, the prefix does
  NOT need to contain any of the letters or numbers of the actual machine name.
  
  Continuing with our previous example, here's what the "variable" table could look
  like for our "atlas" machine:


mysql> describe atlas_vl;
+----------+----------------------+------+-----+--------------------+-------+
| Field    | Type                 | Null | Key | Default            | Extra |
+----------+----------------------+------+-----+--------------------+-------+
| var      | smallint(5) unsigned |      | PRI | 0                  |       |
| counter  | tinyint(3) unsigned  |      |     | 0                  |       |
| oid      | varchar(254)         |      |     | .1.3.6.1.2.1.1.3.0 |       |
| des      | varchar(64)          |      |     | Description        |       |
| alias    | varchar(32)          |      |     | alias              |       |
| blurb    | text                 | YES  |     | NULL               |       |
| discrete | tinyint(3) unsigned  |      |     | 0                  |       |
| var_type | tinyint(3) unsigned  |      |     | 0                  |       |
+----------+----------------------+------+-----+--------------------+-------+
8 rows in set (0.00 sec)

mysql>

And the SQL command required to create this table:

CREATE TABLE `atlas_vl` (
  `var` smallint(5) unsigned NOT NULL default '0',
  `counter` tinyint(3) unsigned NOT NULL default '0',
  `oid` varchar(254) NOT NULL default '.1.3.6.1.2.1.1.3.0',
  `des` varchar(64) NOT NULL default 'Description',
  `alias` varchar(32) NOT NULL default 'alias',
  `blurb` text,
  `discrete` tinyint(3) unsigned NOT NULL default '0',
  `var_type` tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (`var`)
);

Some sample data:

mysql> select * from atlas_vl order by 1;
+-----+---------+------------------------------+-------------------------------+---------------+------------------------------------------------------------------------+----------+----------+
| var | counter | oid                          | des                           | alias         | blurb                                                                  | discrete | var_type |
+-----+---------+------------------------------+-------------------------------+---------------+------------------------------------------------------------------------+----------+----------+
|   1 |       0 | .1.3.6.1.4.1.2021.9.1.9.1    | % of / partition used         | /             | Percentage of / Filesystem Usage                                       |        0 |        0 |
|   2 |       0 | .1.3.6.1.4.1.2021.9.1.9.2    | % of /boot partition used     | /boot         | Percentage of /boot Filesystem Usage                                   |        0 |        0 |
|   3 |       0 | .1.3.6.1.4.1.2021.9.1.9.3    | % of /tmp partition used      | /tmp          | Percentage of /tmp Filesystem Usage                                    |        0 |        0 |
|   4 |       0 | .1.3.6.1.4.1.2021.9.1.9.4    | % of /usr partition used      | /usr          | Percentage of /usr Filesystem Usage                                    |        0 |        0 |
|   5 |       0 | .1.3.6.1.4.1.2021.9.1.9.5    | % of /var partition used      | /var          | Percentage of /var Filesystem Usage                                    |        0 |        0 |
|   6 |       0 | .1.3.6.1.4.1.2021.9.1.9.6    | % of /tftpboot partition used | /tftpboot     | Percentage of /tftpboot Filesystem Usage                               |        0 |        0 |
|   7 |       0 | .1.3.6.1.4.1.2021.2.1.100.1  | atd Daemon                    | atd           | atd daemon<br>0 = Process is Running<br>1 = Process is Dead            |        1 |        0 |
|   8 |       0 | .1.3.6.1.4.1.2021.2.1.100.2  | crond Daemon                  | crond         | crond daemon<br>0 = Process is Running<br>1 = Process is Dead          |        1 |        0 |
|   9 |       0 | .1.3.6.1.4.1.2021.2.1.100.3  | ntpd Daemon                   | ntpd          | ntpd daemon<br>0 = Process is Running<br>1 = Process is Dead           |        1 |        0 |
|  10 |       0 | .1.3.6.1.4.1.2021.2.1.100.4  | portmap Daemon                | portmap       | portmap daemon<br>0 = Process is Running<br>1 = Process is Dead        |        1 |        0 |
|  11 |       0 | .1.3.6.1.4.1.2021.2.1.100.5  | rsshd Daemon                  | rsshd         | rsshd daemon<br>0 = Process is Running<br>1 = Process is Dead          |        1 |        0 |
|  12 |       0 | .1.3.6.1.4.1.2021.2.1.100.6  | sshd Daemon                   | sshd          | sshd daemon<br>0 = Process is Running<br>1 = Process is Dead           |        1 |        0 |
|  13 |       0 | .1.3.6.1.4.1.2021.2.1.100.7  | syslog-ng Daemon              | syslog-ng     | syslog-ng daemon<br>0 = Process is Running<br>1 = Process is Dead      |        1 |        0 |
|  14 |       0 | .1.3.6.1.4.1.2021.2.1.100.8  | xinetd Daemon                 | xinetd        | xinetd daemon<br>0 = Process is Running<br>1 = Process is Dead         |        1 |        0 |
|  15 |       0 | .1.3.6.1.4.1.2021.2.1.100.9  | cerebrod Daemon               | cerebrod      | cerebrod daemon<br>0 = Process is Running<br>1 = Process is Dead       |        1 |        0 |
|  16 |       0 | .1.3.6.1.4.1.2021.2.1.100.10 | munged Daemon                 | munged        | munged daemon<br>0 = Process is Running<br>1 = Process is Dead         |        1 |        0 |
|  17 |       0 | .1.3.6.1.4.1.2021.2.1.100.11 | slurmd Daemon                 | slurmd        | slurmd daemon<br>0 = Process is Running<br>1 = Process is Dead         |        1 |        0 |
|  18 |       0 | .1.3.6.1.4.1.2021.2.1.100.12 | spd Daemon                    | spd           | spd daemon<br>0 = Process is Running<br>1 = Process is Dead            |        1 |        0 |
|  19 |       0 | .1.3.6.1.4.1.2021.2.1.100.13 | lrmmond Daemon                | lrmmond       | lrmmond daemon<br>0 = Process is Running<br>1 = Process is Dead        |        1 |        0 |
|  20 |       0 | .1.3.6.1.4.1.2021.2.1.100.14 | lrmrouted Daemon              | lrmrouted     | lrmrouted daemon<br>0 = Process is Running<br>1 = Process is Dead      |        1 |        0 |
|  21 |       0 | .1.3.6.1.4.1.2021.2.1.100.15 | pspd Daemon                   | pspd          | pspd daemon<br>0 = Process is Running<br>1 = Process is Dead           |        1 |        0 |
|  22 |       0 | .1.3.6.1.4.1.2021.2.1.100.16 | cupsd Daemon                  | cupsd         | cupsd daemon<br>0 = Process is Running<br>1 = Process is Dead          |        1 |        0 |
|  23 |       0 | .1.3.6.1.4.1.2021.2.1.100.17 | slurmctld Daemon              | slurmctld     | slurmctld daemon<br>0 = Process is Running<br>1 = Process is Dead      |        1 |        0 |
|  24 |       0 | .1.3.6.1.4.1.2021.2.1.100.18 | conmand Daemon                | conmand       | conmand daemon<br>0 = Process is Running<br>1 = Process is Dead        |        1 |        0 |
|  25 |       0 | .1.3.6.1.4.1.2021.2.1.100.19 | powermand Daemon              | powermand     | powermand daemon<br>0 = Process is Running<br>1 = Process is Dead      |        1 |        0 |
|  26 |       0 | .1.3.6.1.4.1.2021.2.1.100.20 | httpd Daemon                  | httpd         | httpd daemon<br>0 = Process is Running<br>1 = Process is Dead          |        1 |        0 |
|  27 |       0 | .1.3.6.1.4.1.2021.2.1.100.21 | named Daemon                  | named         | named daemon<br>0 = Process is Running<br>1 = Process is Dead          |        1 |        0 |
|  28 |       0 | .1.3.6.1.4.1.2021.2.1.100.22 | dhcpd Daemon                  | dhcpd         | dhcpd daemon<br>0 = Process is Running<br>1 = Process is Dead          |        1 |        0 |
|  29 |       0 | .1.3.6.1.4.1.2021.2.1.100.23 | netdump-server Daemon         | netdump-serve | netdump-server daemon<br>0 = Process is Running<br>1 = Process is Dead |        1 |        0 |
|  30 |       0 | .1.3.6.1.4.1.2021.2.1.100.24 | sendmail Daemon               | sendmail      | sendmail daemon<br>0 = Process is Running<br>1 = Process is Dead       |        1 |        0 |
|  31 |       0 | .1.3.6.1.4.1.2021.2.1.100.25 | nsrexecd Daemon               | nsrexecd      | nsrexecd daemon<br>0 = Process is Running<br>1 = Process is Dead       |        1 |        0 |
|  32 |       0 | .1.3.6.1.4.1.2021.2.1.100.26 | xfs Daemon                    | xfs           | xfs daemon<br>0 = Process is Running<br>1 = Process is Dead            |        1 |        0 |
|  33 |       0 | .1.3.6.1.4.1.2021.10.1.3.1   | Load Average                  | load          | The Current Load Average                                               |        0 |        0 |
|  34 |       0 | .1.3.6.1.4.1.2021.4.4.0      | Available Swap Space          | swap          | The Current Amount of Unused Swap Space in Kilobytes                   |        0 |        0 |
|  35 |       1 | .1.3.6.1.4.1.2021.11.50.0    | User CPU Time                 | UCPU          | Percentage of CPU Time Consumed by User Processes Since the Last Poll  |        0 |        0 |
+-----+---------+------------------------------+-------------------------------+---------------+------------------------------------------------------------------------+----------+----------+
35 rows in set (0.00 sec)

  Column Descriptions:

  var - Simple unique integer for variable identification; Table index; Must be
        sequential, and must start at 1.

  counter - Some SNMP variables are calculated as counter data.  A "1" here
            indicates a counter variable.  Otherwise, a "0" is given.

  oid - The SNMP/IPMI/Symbol OID of the variable.  Also acts as an index for
        scripts run from the "mod" table.

        The IPMI OID is simply the index number returned from an ipmi-sensors
        command from the freeipmi package.  For example:

        # porterj /root > ipmi-sensors
        Caching SDR repository information: /root/.freeipmi/sdr-cache/sdr-cache-porterj.localhost
        Caching SDR record 59 of 59 (current record ID 59) 
        ID | Name             | Type        | Reading    | Units | Event
        3  | BB Inlet Temp    | Temperature | 26.00      | C     | 'OK'
        4  | SSB Temp         | Temperature | 43.00      | C     | 'OK'
        5  | BB BMC Temp      | Temperature | 33.00      | C     | 'OK'
        ...
        28 | BB +1.1V SB      | Voltage     | 1.08       | V     | 'OK'
        29 | BB P3_3V STBY    | Voltage     | 3.25       | V     | 'OK'
        30 | BB 1_1V PCH      | Voltage     | 1.08       | V     | 'OK'
        # porterj /root >

        The OID for "SSB Temp" is 4, and for "BB +1.1V SB", it's 28.

        The Netapp Symbol API OID consists of the format:

        <deviceStat>@<TrayID>@<Location>

        If a device/stat does not have a TrayID, 0 should be use, and Location maps to the device number.
        For example, to monitor the status of the disk located in Tray 45, Location 38:

        drivestatus@45@38

        Or the status of Controller 1:

        controller@0@1

        skummee only supports a subset of the stats provided by the Symbol SDK, which is listed here:

        controller
        volumestatus
        drivestatus
        drivetemp
        drivepfa
        esm
        ups
        minihub
        gbic
        sfp
        interconnectCRU
        alarm
        processorMemoryDimm
        fan
        battery
        powerSupply
        thermalSensor
        drawer
        cacheMemoryDimm
        hostBoard
        volumeRCacheActive
        volumeWCacheActive
        volumeRCacheEnable
        volumeWCacheEnable
        volumeCacheMirrorActive
        volumeCacheMirrorEnable

  des - variable description.  Not necessary for skummee to run, but provides a
        simple descrition for the user interface.

  alias - abbreviation of variable description.  Not necessary for skummee to run, but provides a
        very short descrition for the user interface.

  blurb - Verbose variable description.  Not necessary for skummee to run, but
          provides a verbode description for the user interface.

  discrete - For supplied user interfaces only.  A type of 1 indicates that the
         variable has 2 or more discrete values, to be indicated as Normal, Warning,
         or Critical. Otherwise, a type value of 0 is used.

  var_type - Variable Type.  There are currently 6 types supported:
             0 = SNMP
             1 = IPMI Reading
             2 = INTERNAL
             3 = Reserved
             4 = IPMI State
             5 = IPMI System Event Log
             6 = NetApp Symbol API

         For the 3 different IPMI variabe types:  Type 1, "IPMI Reading", is used for IPMI variables that
         return a gauge type, such as a temperature.  Type 4, "IPMI State", is used for IPMI variables that
         return a discrete "Good" or "Bad" value.  Type 5, "IPMI System Event Log", will return a count of
         IPMI SEL entries for a specified SEL entry type.

  And here's a sample "node" table:

mysql> describe atlas_n;
+---------------+----------------------+------+-----+-----------+-------+
| Field         | Type                 | Null | Key | Default   | Extra |
+---------------+----------------------+------+-----+-----------+-------+
| host          | smallint(5) unsigned |      | PRI | 0         |       |
| status0       | tinyint(3) unsigned  |      |     | 0         |       |
| address       | varchar(16)          |      |     | 127.0.0.1 |       |
| name          | varchar(128)         |      | UNI | hostname  |       |
| community     | varchar(32)          |      |     | public    |       |
| status1       | tinyint(3) unsigned  |      |     | 0         |       |
| crit          | tinyint(3) unsigned  |      |     | 0         |       |
| port          | smallint(5) unsigned |      |     | 161       |       |
| version       | tinyint(3) unsigned  |      |     | 2         |       |
| ign           | tinyint(3) unsigned  |      |     | 0         |       |
| retries       | tinyint(3) unsigned  |      |     | 0         |       |
| ipmi_address  | varchar(16)          |      |     | 127.0.0.1 |       |
| host_user     | varchar(20)          |      |     | user      |       |
| host_password | varchar(20)          |      |     | password  |       |
| max_snmp      | smallint(5) unsigned |      |     | 65535     |       |
| timeout       | smallint(5) unsigned |      |     | 10        |       |
+---------------+----------------------+------+-----+-----------+-------+
16 rows in set (0.00 sec)

SQL create command:

CREATE TABLE `atlas_n` (
  `host` smallint(5) unsigned NOT NULL default '0',
  `status0` tinyint(3) unsigned NOT NULL default '2',
  `address` varchar(16) NOT NULL default '127.0.0.1',
  `name` varchar(128) NOT NULL default 'hostname',
  `community` varchar(32) NOT NULL default 'public',
  `status1` tinyint(3) unsigned NOT NULL default '2',
  `crit` tinyint(3) unsigned NOT NULL default '0',
  `port` smallint(5) unsigned NOT NULL default '161',
  `version` tinyint(3) unsigned NOT NULL default '2',
  `ign` tinyint(3) unsigned NOT NULL default '0',
  `retries` tinyint(3) unsigned NOT NULL default '0',
  `ipmi_address` varchar(16) NOT NULL default '127.0.0.1',
  `host_user` varchar(20) NOT NULL default 'user',
  `host_password` varchar(20) NOT NULL default 'password',
  `max_snmp` smallint(5) unsigned NOT NULL default '65535',
  `timeout` smallint(5) unsigned NOT NULL default '10',
  PRIMARY KEY  (`host`),
  UNIQUE KEY `name_index` (`name`)
);
 
Sample data:

mysql> select * from igs_n where host < 10 order by 1;
+------+---------+--------------+------+-----------+---------+------+------+---------+-----+---------+--------------+-----------+---------------+----------+---------+
| host | status0 | address      | name | community | status1 | crit | port | version | ign | retries | ipmi_address | host_user | host_password | max_snmp | timeout |
+------+---------+--------------+------+-----------+---------+------+------+---------+-----+---------+--------------+-----------+---------------+----------+---------+
|    1 |       2 | 192.168.64.1 | igs1 | public    |       2 |    1 |  161 |       2 |   0 |       1 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    2 |       2 | 192.168.64.2 | igs2 | public    |       2 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    3 |       2 | 192.168.64.3 | igs3 | public    |       2 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    4 |       2 | 192.168.64.4 | igs4 | public    |       2 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    5 |       2 | 192.168.64.5 | igs5 | public    |       2 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    6 |       3 | 192.168.64.6 | igs6 | public    |       3 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    7 |       5 | 192.168.64.7 | igs7 | public    |       5 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    8 |       2 | 192.168.64.8 | igs8 | public    |       2 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
|    9 |       2 | 192.168.64.9 | igs9 | public    |       2 |    0 |  161 |       2 |   0 |       0 | 127.0.0.1    | user      | password      |    65535 |      10 |
+------+---------+--------------+------+-----------+---------+------+------+---------+-----+---------+--------------+-----------+---------------+----------+---------+
9 rows in set (0.00 sec)
  
  Column Descriptions:

  host - Simple unique integer for node identification; Table index

  status0 - Current overall status of the host, which is the greatest status of all
            this host's variables.  Alternates with status1 to provide a 1 poll history.

  address - The IPv4 address of the host.

  name - The node's name.  Usually its hostname, but this is not necessary.  The name
         here is only used for display purposes, and is not used for polling.

  community - The community string used to access this node's SNMP variables.

  status1 - current overall status of the host, which is the greatest status of all
            this host's variables.  Alternates with status0 to provide a 1 poll history.

  crit - Denotes the importance of a node. Binary value: "1" = critical node.
         "0" = not critical.

  port - The port number used by the SNMP server running on the node.

  version - The SNMP version number to use for the SNMP request.  Must be 1 or 2.

  ign - The ignored state.
        A value of 1 indicates that the interface should not include this node.
        A value of 0 indicates that the node should be displayed normally.

  retries - The number of times skummee should retry a network request, be it SNMP or IPMI.

  ipmi_address - The IPMI IPv4 out-of-band address for the node.

  host_user - If the node has IPMI variables, this is the ipmi user.

  host_password -  If the node has IPMI variables, this is the ipmi password.
  
  max_snmp - The maximum number of SNMP variables to bundle per SNMP request.  This should not normally
             be changed from the default of 65535, unless the remote SNMP agent cannot handle large
             SNMP message sizes.

  timeout:  The amount of time in seconds before a network request times out for this node.
        

  "node/var" mapping table:

mysql> describe atlas_v;
+---------------+----------------------+------+-----+---------+-------+
| Field         | Type                 | Null | Key | Default | Extra |
+---------------+----------------------+------+-----+---------+-------+
| host          | smallint(5) unsigned |      | PRI | 0       |       |
| var           | smallint(5) unsigned |      | PRI | 0       |       |
| status0       | tinyint(3) unsigned  |      |     | 2       |       |
| status1       | tinyint(3) unsigned  |      |     | 2       |       |
| internal_poll | double               |      |     | -22222  |       |
| value0        | double               |      |     | -22222  |       |
| value1        | double               |      |     | -22222  |       |
+---------------+----------------------+------+-----+---------+-------+
7 rows in set (0.00 sec)

SQL Create command:

CREATE TABLE `atlas_v` (
  `host` smallint(5) unsigned NOT NULL default '0',
  `var` smallint(5) unsigned NOT NULL default '0',
  `status0` tinyint(3) unsigned NOT NULL default '2',
  `status1` tinyint(3) unsigned NOT NULL default '2',
  `internal_poll` double NOT NULL default '-22222',
  `value0` double NOT NULL default '-22222',
  `value1` double NOT NULL default '-22222',
  PRIMARY KEY  (`host`,`var`)
);
 
mysql> select * from atlas_v order by 1,2;
mysql> select * from porter_v order by 1,2;
+------+-----+---------+---------+---------------+-------------+-------------+
| host | var | status0 | status1 | internal_poll | value0      | value1      |
+------+-----+---------+---------+---------------+-------------+-------------+
|    1 |   1 |       0 |       0 |        -22222 |           0 |           0 |
|    1 |   2 |       0 |       0 |        -22222 |           0 |           0 |
|    1 |   3 |       0 |       0 |        -22222 |           0 |           0 |
|    1 |   4 |       0 |       0 |        -22222 |           0 |           0 |
|    1 |   5 |       0 |       0 |        -22222 |        3.08 |        4.04 |
|    1 |   6 |       0 |       0 |        -22222 |       28885 |       28901 |
|    1 |   7 |       0 |       0 |        -22222 |           3 |           3 |

...

| 1080 | 376 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 377 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 378 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 379 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 380 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 381 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 382 |       0 |       0 |        -22222 |           0 |           0 |
| 1080 | 383 |       0 |       0 |        -22222 |           0 |           0 |
+------+-----+---------+---------+---------------+-------------+-------------+
19520 rows in set (0.03 sec)

mysql>

  Column Descriptions:

  host - Integer identifying the host from the "nodes" table; One of the two table indices

  var - Integer identifying the variable from the "variables" table; One of the two table indices

  status0 - Current variable status. Alternates with status1 to provide a 1 poll history.

  status1 - Current variable status. Alternates with status0 to provide a 1 poll history.

  internal_poll - This is where values provided by external scripts are stored.  skummee will
                  set the values back to NAN's after each poll to ensure the data is current.

  value0 - Current variable value.  Alternates with value1 to provide a 1 poll history.

  value1 - Current variable value.  Alternates with value0 to provide a 1 poll history.


  Sample "Thresh" table:

mysql> describe atlas_th;
+-------+----------------------+------+-----+---------+-------+
| Field | Type                 | Null | Key | Default | Extra |
+-------+----------------------+------+-----+---------+-------+
| var   | smallint(5) unsigned | NO   | MUL | 0       |       | 
| low   | double               | YES  |     | NULL    |       | 
| high  | double               | YES  |     | NULL    |       | 
| state | tinyint(3) unsigned  | NO   |     | 2       |       | 
+-------+----------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> select * from atlas_th where var < 36 order by 1,4 desc;
+-----+------+---------+-------+
| var | low  | high    | state |
+-----+------+---------+-------+
|   1 |   98 |    NULL |   200 |
|   1 |   95 |      98 |   100 |
|   2 |   98 |    NULL |   200 |
|   2 |   95 |      98 |   100 |
|   3 |   98 |    NULL |   200 |
|   3 |   95 |      98 |   100 |
|   4 |   98 |    NULL |   200 |
|   4 |   95 |      98 |   100 |
|   5 |   98 |    NULL |   200 |
|   5 |   95 |      98 |   100 |
|   6 |   98 |    NULL |   200 |
|   6 |   95 |      98 |   100 |
|   7 |    1 |    NULL |   200 |
|   8 |    1 |    NULL |   200 |
|   9 |    1 |    NULL |   200 |
|  10 |    1 |    NULL |   200 |
|  11 |    1 |    NULL |   200 |
|  12 |    1 |    NULL |   200 |
|  13 |    1 |    NULL |   200 |
|  14 |    1 |    NULL |   200 |
|  15 |    1 |    NULL |   100 |
|  16 |    1 |    NULL |   200 |
|  17 |    1 |    NULL |   200 |
|  18 |    1 |    NULL |   200 |
|  19 |    1 |    NULL |   200 |
|  20 |    1 |    NULL |   200 |
|  21 |    1 |    NULL |   200 |
|  22 |    1 |    NULL |   200 |
|  23 |    1 |    NULL |   200 |
|  24 |    1 |    NULL |   200 |
|  25 |    1 |    NULL |   200 |
|  26 |    1 |    NULL |   200 |
|  27 |    1 |    NULL |   200 |
|  28 |    1 |    NULL |   200 |
|  29 |    1 |    NULL |   200 |
|  30 |    1 |    NULL |   200 |
|  31 |    1 |    NULL |   200 |
|  32 |    1 |    NULL |   200 |
|  33 |   32 |    NULL |   100 |
|  34 | NULL | 1000000 |   100 |
+-----+------+---------+-------+
40 rows in set (0.00 sec)

SQL Create command:

CREATE TABLE `atlas_th` (
  `var` smallint(5) unsigned NOT NULL default '0',
  `low` double default NULL,
  `high` double default NULL,
  `state` tinyint(3) unsigned NOT NULL default '2',
  KEY `ind` (`var`,`low`)
)

  Column Descriptions:

  var - Integer identifying the variable from the "variables" table; One of the three table indices.

  low - The lower bound to use for this particular threshold; One of the three table indices.
        A value of NULL indicates that there is no lower bound (the lower bound is negatively
        infinite).

  high - The upper bound to use for this particular threshold. One of the three table indices.
         A value of NULL indicates that there is no upper bound (the upper bound is infinite).

  state - The value to assign to the status of a variable if it is determined that the variable's
          current value is between the low and high values.  State may be 1 of 254 possible values,
	  from 1 to 254 (0 and 255 are reserved for the Nominal and Timeout states, respectively).
	  The higher the state value, the more serious the condition.

  Sample "Meta" table:

mysql> describe meta;
+-----------------+----------------------+------+-----+----------------+-------+
| Field           | Type                 | Null | Key | Default        | Extra |
+-----------------+----------------------+------+-----+----------------+-------+
| name            | varchar(32)          | NO   | PRI |                |       |
| display         | varchar(32)          | NO   |     |                |       |
| rrd_path        | varchar(64)          | NO   |     | /tmp           |       |
| stale           | int(10) unsigned     | NO   |     | 0              |       |
| stat            | tinyint(3) unsigned  | NO   |     | 0              |       |
| priority        | smallint(5) unsigned | NO   |     | 0              |       |
| pid             | int(10) unsigned     | NO   |     | 0              |       |
| hidden          | tinyint(3) unsigned  | NO   |     | 0              |       |
| crit0           | smallint(5) unsigned | NO   |     | 0              |       |
| crit1           | smallint(5) unsigned | NO   |     | 0              |       |
| noncrit0        | smallint(5) unsigned | NO   |     | 0              |       |
| noncrit1        | smallint(5) unsigned | NO   |     | 0              |       |
| nom0            | smallint(5) unsigned | NO   |     | 0              |       |
| nom1            | smallint(5) unsigned | NO   |     | 0              |       |
| critmax         | tinyint(3) unsigned  | NO   |     | 0              |       |
| noncritmax      | tinyint(3) unsigned  | NO   |     | 0              |       |
+-----------------+----------------------+------+-----+----------------+-------+
16 rows in set (0.00 sec)


mysql> select * from meta;
+------+---------+-----------------------+------------+------+----------+-----+--------+-------+-------+----------+----------+------+------+---------+------------+
| name | display | rrd_path              | stale      | stat | priority | pid | hidden | crit0 | crit1 | noncrit0 | noncrit1 | nom0 | nom1 | critmax | noncritmax |
+------+---------+-----------------------+------------+------+----------+-----+--------+-------+-------+----------+----------+------+------+---------+------------+
| hype | Hype    | /var/skummee/hype/rrd | 1233169218 |    0 |        0 |   0 |      0 |     1 |     1 |      133 |      133 |    7 |    7 |      50 |        255 |
| test | Test    | /var/skummee/test/rrd | 1219163418 |    0 |        0 |   0 |      0 |     2 |     2 |       50 |       50 |   60 |   60 |      50 |        100 |
+------+---------+-----------------------+------------+------+----------+-----+--------+-------+-------+----------+----------+------+------+---------+------------+
2 rows in set (0.00 sec)

SQL Create command:

CREATE TABLE `meta` (
  `name` varchar(32) NOT NULL default '',
  `display` varchar(32) NOT NULL default '',
  `rrd_path` varchar(64) NOT NULL default '/tmp',
  `stale` int(10) unsigned NOT NULL default '0',
  `stat` tinyint(3) unsigned NOT NULL default '0',
  `priority` smallint(5) unsigned NOT NULL default '0',
  `pid` int(10) unsigned NOT NULL default '0',
  `hidden` tinyint(3) unsigned NOT NULL default '0',
  `crit0` smallint(5) unsigned NOT NULL default '0',
  `crit1` smallint(5) unsigned NOT NULL default '0',
  `noncrit0` smallint(5) unsigned NOT NULL default '0',
  `noncrit1` smallint(5) unsigned NOT NULL default '0',
  `nom0` smallint(5) unsigned NOT NULL default '0',
  `nom1` smallint(5) unsigned NOT NULL default '0',
  `critmax` tinyint(3) unsigned NOT NULL default '0',
  `noncritmax` tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (`name`)
)

  Column Descriptions:

  name:  The machine name used by skummee to determine the table names (<name>_n, <name_vl>, etc.).

  display:  How the machine will be displayed for the user interface.

  rrd_path:  Path to the RRD files

  stale:  The unix timestamp for when the cluster was last succesfully polled.

  stat: The alternating (between 0 and 1) value that determines which stat column to use in the node and
        mapping tables.

  priority:  This is used to determine the machine display order for the user interface.
  
  pid:  The process ID of any previous skummee instance.  If non-zero, indicates that either a previous instance
        of skummee is still running, or that the last instance of skummee was terminated abnormally.

  hidden:  A value of 1 will hide the machine from the user interface, but will continue to be polled normally.
           A value of 0 is used for normal operation.

  crit0/crit1:  Set by skummee at the end of each poll to be the number of critical nodes that are not nominal.

  noncrit0/noncrit1:  Set by skummee at the end of each poll to be the number of non-critical nodes that are
                      not nominal.

  nom0/nom1:  Set by skummee at the end of each poll to be the number of nodes that are nominal.

  critmax:  Set by skummee at the end of each poll to be the status of the critical node(s) with the
            highest status.

  noncritmax:  Set by skummee at the end of each poll to be the status of the non-critical node(s) with the
            highest status.


  Sample "mod" table:

mysql> describe atlas_mod;
+-----------+------------------+------+-----+---------+-------+
| Field     | Type             | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+-------+
| command   | varchar(255)     |      | PRI |         |       |
| arguments | varchar(255)     |      | PRI |         |       |
| timeout   | int(10) unsigned |      |     | 10      |       |
+-----------+------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> select * from atlas_mod;
+-----------------------------------+-----------+---------+
| command                           | arguments | timeout |
+-----------------------------------+-----------+---------+
| /usr/bin/slurm_failures           | -c 7      |       3 |
| /usr/bin/mpi_failures             |           |       5 |
| /tmp/hello_world                  | --leak=no |      10 |
+-----------------------------------+-----------+---------+
3 rows in set (0.00 sec)

  SQL Create command:

CREATE TABLE `atlas_mod` (
  `command` varchar(255) NOT NULL default '',
  `arguments` varchar(255) NOT NULL default '',
  `timeout` int(10) unsigned NOT NULL default '10',
  PRIMARY KEY  (`command`,`arguments`)
);

  The "mod" table provides a method for script based data gathering.  The output of the script
  must give exactly one value per line, and have the following format:

  nodename1,oid1,value1
  nodename2,oid2,value2
  ...

  Where the node names correspond to the "name" column from the node table, and the oid's correspond
  to the "oid" column from the variable list table.  When skummee runs, It executes the scripts
  in the "mod" table, and stores the output in the "internal_poll" column of the node/var mapping
  table, then reads the value when normal polling for the node is performed, and then the "internal_poll" 
  column is set back to NAN's.

  Column Descriptions:

  command:  Path to executable.

  arguments:  Arguments for the executable in the "command" column.

  timeout:  The amount of time in seconds given for the command to run.  If the timeout value is exceeded,
            the process associated with the command is sent a kill signal.

 
  Sample "analysis" Table:

mysql> describe atlas_an;
+-----------+----------------------+------+-----+---------+-------+
| Field     | Type                 | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| host      | smallint(5) unsigned |      | PRI | 0       |       |
| epochtime | int(10) unsigned     |      | PRI | 0       |       |
| status    | tinyint(3) unsigned  |      |     | 0       |       |
| var_list  | varchar(254)         | YES  |     | NULL    |       |
+-----------+----------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

mysql> select * from atlas_an limit 5;
+------+------------+--------+-------------------------------------------------------------------------------------+
| host | epochtime  | status | var_list                                                                            |
+------+------------+--------+-------------------------------------------------------------------------------------+
|    0 | 1176911100 |      5 | 43                                                                                  |
|   29 | 1176911100 |      3 | 209                                                                                 |
|   32 | 1176911100 |      5 | 43                                                                                  |
|  123 | 1176911100 |      7 | 206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,229 |
|  128 | 1176911100 |      3 | 209                                                                                 |
+------+------------+--------+-------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

SQL Create Command:

CREATE TABLE `atlas_an` (
  `host` smallint(5) unsigned NOT NULL default '0',
  `epochtime` int(10) unsigned NOT NULL default '0',
  `status` tinyint(3) unsigned NOT NULL default '0',
  `var_list` varchar(254) default NULL,
  PRIMARY KEY  (`epochtime`,`host`)
);

  The analysis table provides a snapshot of all the nodes that have a status other than "NOMINAL" for each poll,
  along with all the variables that are not NOMINAL.  Data in the analysis table that is older than 12 days (1000000 seconds)
  is trimmed to prevent the table from growing excessively over time.


  Column Descriptions:

  host:  The host number, as indexed by the host column in the node table.
 
  epochtime:  The unix timestamp of the polling time.

  status:  The highest status of all the variables for the host at polling time.

  var_list:  A comma separated list of variables for the host that have a status other than NOMINAL at polling time.


  Sample "lookup" table:

mysql> describe atlas_lu;
+--------------+-------------+------+-----+---------+-------+
| Field        | Type        | Null | Key | Default | Extra |
+--------------+-------------+------+-----+---------+-------+
| value_string | varchar(64) |      | PRI |         |       |
| value_int    | int(11)     |      |     | 0       |       |
+--------------+-------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql> select * from atlas_lu limit 5;
+--------------------------------+-----------+
| value_string                   | value_int |
+--------------------------------+-----------+
| FALSE                          |         0 |
| TRUE                           |        16 |
| SFP transmitter OK             |        16 |
| SFP transmitter fault detected |         0 |
| SFP signal OK                  |        16 |
+--------------------------------+-----------+
5 rows in set (0.00 sec)

  SQL Create Command:

CREATE TABLE `atlas_lu` (
  `value_string` varchar(64) NOT NULL default '',
  `value_int` int(11) NOT NULL default '0',
  PRIMARY KEY  (`value_string`)
);


  Since SNMP values are frequently strings, and skummee only works with
  numbers, it is necessary to convert strings to numbers for effective 
  SNMP monitoring.  Currently, the lookup table is only used for values
  returned from SNMP variables.

  Column Descriptions:

  value_string:  String value expected from an SNMP query, with leading and ending
                 whitespace removed.

  value_int:  Integer value to use for the specified string for threshold checking
              and history.


A Sample Email addresses Table:

mysql> describe atlas_ea;
+---------------+----------------------+------+-----+--------------------+-------+
| Field         | Type                 | Null | Key | Default            | Extra |
+---------------+----------------------+------+-----+--------------------+-------+
| email_address | varchar(255)         |      | PRI | nobody@example.com |       |
| policy_number | smallint(5) unsigned |      | PRI | 0                  |       |
+---------------+----------------------+------+-----+--------------------+-------+
2 rows in set (0.00 sec)

mysql> select * from atlas_ea;
+--------------------+---------------+
| email_address      | policy_number |
+--------------------+---------------+
| admin1@example.com |             0 |
| admin2@example.com |             0 |
| admin3@example.com |             0 |
| admin3@example.com |             1 |
| god@example.com    |             1 |
+--------------------+---------------+
5 rows in set (0.00 sec)

SQL Create Command:

CREATE TABLE `atlas_ea` (
  `email_address` varchar(255) NOT NULL default 'nobody@example.com',
  `policy_number` smallint(5) unsigned NOT NULL default '0',
  PRIMARY KEY  (`policy_number`,`email_address`)
);

  Column Descriptions:

  email_addess:  The email address of the user

  policy_number:  The policy number to use as referenced from the email policy table.


A Sample Email policy table:

mysql> describe atlas_ep;
+-------------------+----------------------+------+-----+---------+-------+
| Field             | Type                 | Null | Key | Default | Extra |
+-------------------+----------------------+------+-----+---------+-------+
| policy_number     | smallint(5) unsigned |      | PRI | 0       |       |
| frequency         | int(10) unsigned     |      |     | 86400   |       |
| last_mailed       | int(10) unsigned     |      |     | 0       |       |
| do_count          | tinyint(3) unsigned  |      |     | 0       |       |
| count_non_crit    | tinyint(3) unsigned  |      |     | 0       |       |
| count_threshold   | int(10) unsigned     |      |     | 0       |       |
| count_status_low  | tinyint(3) unsigned  |      |     | 0       |       |
| count_status_high | tinyint(3) unsigned  |      |     | 0       |       |
| do_percentage     | tinyint(3) unsigned  |      |     | 0       |       |
| per_non_crit      | tinyint(3) unsigned  |      |     | 0       |       |
| per_threshold     | tinyint(3) unsigned  |      |     | 0       |       |
| per_status_low    | tinyint(3) unsigned  |      |     | 0       |       |
| per_status_high   | tinyint(3) unsigned  |      |     | 0       |       |
| do_run_failure    | tinyint(3) unsigned  |      |     | 0       |       |
+-------------------+----------------------+------+-----+---------+-------+
14 rows in set (0.00 sec)

mysql> select * from atlas_ep;
+---------------+-----------+-------------+----------+----------------+-----------------+------------------+-------------------+---------------+--------------+---------------+----------------+-----------------+----------------+
| policy_number | frequency | last_mailed | do_count | count_non_crit | count_threshold | count_status_low | count_status_high | do_percentage | per_non_crit | per_threshold | per_status_low | per_status_high | do_run_failure |
+---------------+-----------+-------------+----------+----------------+-----------------+------------------+-------------------+---------------+--------------+---------------+----------------+-----------------+----------------+
|             0 |     86400 |           0 |        1 |              0 |               2 |              200 |               255 |             1 |            1 |            35 |            100 |             255 |              0 |
|             1 |      3600 |           0 |        0 |              0 |               0 |                0 |                 0 |             0 |            0 |             0 |              0 |               0 |              1 |
+---------------+-----------+-------------+----------+----------------+-----------------+------------------+-------------------+---------------+--------------+---------------+----------------+-----------------+----------------+
2 rows in set (0.00 sec)

SQL create Command:

CREATE TABLE `atlas_ep` (
  `policy_number` smallint(5) unsigned NOT NULL default '0',
  `frequency` int(10) unsigned NOT NULL default '86400',
  `last_mailed` int(10) unsigned NOT NULL default '0',
  `do_count` tinyint(3) unsigned NOT NULL default '0',
  `count_non_crit` tinyint(3) unsigned NOT NULL default '0',
  `count_threshold` int(10) unsigned NOT NULL default '0',
  `count_status_low` tinyint(3) unsigned NOT NULL default '0',
  `count_status_high` tinyint(3) unsigned NOT NULL default '0',
  `do_percentage` tinyint(3) unsigned NOT NULL default '0',
  `per_non_crit` tinyint(3) unsigned NOT NULL default '0',
  `per_threshold` tinyint(3) unsigned NOT NULL default '0',
  `per_status_low` tinyint(3) unsigned NOT NULL default '0',
  `per_status_high` tinyint(3) unsigned NOT NULL default '0',
  `do_run_failure` tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (`policy_number`)
);

  The email policy table describes the policies to use for automatic email notification
  in response to events caused by thresholds being crossed.

  Column Description:

  policy_number:  The table index

  frequency:  The smallest interval (in seconds) in which emails are sent when the policy
              is activated.

  last_mailed:  The unix epoch time representing the last mailing for this policy.  Used
                with the frequency value to determine if an email should be sent based on
                time.

  do_count:  Include a count of nodes to determine if this policy is activated.

  count_non_crit:  Include non-critical nodes when counting.

  count_threshold:  The count of nodes should be equal to or greater than this number
                    for the policy to be activated.

  count_status_low/count_status_high:  The node's status must be between (or included within)
                                       the count_status_low and count_status_high values in
                                       order to be counted.

  do_percentage:  Include a percentage of nodes to determine if this policy is activated.

  per_non_crit:  Include non-critical nodes when determining percentage.

  per_threshold:  The percentage of nodes should be equal to or greater than this number
                  for the policy to be activated.

  per_status_low/per_status_high:  The node's status must be between (or included within)
                                   the per_status_low and per_status_high values in
                                   order to be included in the percentage

  do_run_failure:  Activate this policy when skummee fails to run successfully.


Given the above examples, policy 0 would translate to:

1)  Only mail at 24 hour intervals when this policy is met (86400 seconds).
2)  Count nodes that have a status greater than or equal to 200 and less than or equal to 255.
3)  Don't include non-critical nodes when counting.
4)  If the count is greater than or equal to 2, and the last mailing occurred more than 86400
    seconds ago, mail all users in the atlas_ea table having policy 0.
5)  Count all nodes that have a status greater than or equal to 100 and less than or equal to
    255, including non-critical nodes, divide this by the number of all nodes, and multiply
    this by 100.  If this is greater than or equal to 35, mail all users in the atlas_ea table
    having policy 0.
6)  Don't consider run_failures with this policy.

And policy 1:

1)  Mail all users in the atlas_ea table that have policy 1 as their policy whenever skummee
    fails to run, but not more frequently than once every 3600 seconds (1 hour).

----

  More on Thresholds

  Variable status is assigned based on threshold values from the threshold table.
  Node status is assigned based on its maximum variable status, or if the node doesn't respond in a timeout
  period (currently set to 10 seconds in the above example), so it is important to create threshold
  levels that increase numerically as the conditions increase in severity.


GETTING STARTED

  skummee is designed to run out of cron every 5 minutes, based on the way that skummee inserts historical data
  into RRD files.  The RRD files contain 2 RRA's: Once every 5 minutes kept for 1 month, and once every 4 hours
  kept or 1 year.  Running skummee at intervals other than 5 minutes will give unpredictable RRD results.

  Once the MySQL tables are in place and are populated, skummee is run with the following
  command line options:

  skummee [config file]

  Here is a sample config file, with the # representing a comment line:

  # Machine
  machine atlas

  # MySQL user for skummee
  user  skummee

  # MySQL password for skummee
  pass SomePassword

  An example run would have:

  skummee -c /etc/skummee.conf

  It's best to run skummee a few times at the command line before insertion into cron, in order to shake out
  any configuration issues.

  The first time skummee is run, skummee builds the RRD files
  based on their configuration in the MySQL tables, with each node corresponding to 1 RRD file.  Each
  node/RRD file contains the exact number and names of the data sources found in the MySQL tables.
  In the event that nodes change the number or nature of their data sources (by the user making changes
  to the MYSQL tables), skummee will detect an inconsistancy between what it polls and the RRD file, and
  will make the necessary changes on the fly.
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.