Easy Infrastructure Monitoring Wiki

Easily monitor network infrastructure services

Brought to you by: pvf

Understanding Alerts

Labels: configuration (2) alerts (1)

Authors:

Alerts

Easy Infrastructure Monitoring distinguishes between 4 alert states:

Warning : configured usually via threshold values, relevant to the user and service. For example, generate a warning if the CPU usage goes above 20%.
Error: configured usually via threshold values, relevant to the user and service. For example, generate an error if the CPU usage goes above 70%.
Fault: this state is usually entered when there is no possibility to actually perform a service check (no connection to the remote host, timeout while checking).
Recovery: this state indicates no alert, but is entered only after one of the above alert states.

For each service, an alert configuration object can be specified as follows (the example contains a maximal configuration, most likely in a real configuration not all the options will be present):

"alert":{
    "warn": {
            "interval":60,
            "action":"print('ALERT WARN ON SERVICE xxx');",
            "thresholdHigherThan":20,
            "thresholdLowerThan":0
    },
    "error": {
            "interval":60,
            "action":"print('ALERT ERROR ON SERVICE xxx');",
            "thresholdHigherThan":70,
            "thresholdLowerThan":0
    },
    "fault": {
            "interval":60,
            "action":"print('ALERT FAULT ON SERVICE xxx');"
    },
    "recovery": {
            "interval":60,
            "action":"print('ALERT RECOVERY ON SERVICE xxx');"
    }    
}

Each sub-section in the configuration corresponds to one of the possible alert states, as described above. The "warn" and "error" states accept specifying threshold values for triggering the respective state. The "fault" and "recovery" states are entered atomatically, therefore no thresholds make sense here.

Usually, a working configuration will use either thresholdHigherThan or thresholdLowerThan, but not both. They are included in the example only to show the possible configuration options.

Only the needed alert states needs configuration. If certain alert states are not used, they may be missing from the file.

The interval option specifies a minimal interval (in minutes) between the state's action can be re-triggered. This is a protection measure for the situation where a service is in a "flapping" state (entering in a quick succession of error/recovery states).

The action option specifies a script that gets executed when the corresponding state is entered. This script is written in JavaScript. The [Scripts] section gives more information about scripting.

Host alerts can be defined in the same way, except this time there are no threshold options available.

An alert object for a host looks like this:

"alert":{
    "warn": {
            "interval":60,
            "action":"print('ALERT WARN ON HOST xxx');"
    },
    "error": {
            "interval":60,
            "action":"print('ALERT ERROR ON HOST xxx');"
    },
    "fault": {
            "interval":60,
            "action":"print('ALERT FAULT ON HOST xxx');"
    },
    "recovery": {
            "interval":60,
            "action":"print('ALERT RECOVERY ON HOST xxx');"
    }    
}

A similar configuration can be defined at top level.

"alert":{
    "warn": {
            "interval":60,
            "action":"print('GENERAL ALERT WARN');"
    },
    "error": {
            "interval":60,
            "action":"print('GENERAL ALERT ERROR');"
    },
    "fault": {
            "interval":60,
            "action":"print('GENERAL ALERT FAULT');"
    },
    "recovery": {
            "interval":60,
            "action":"print('GENERAL ALERT RECOVERY');"
    }    
}

Wiki: Application configuration
Wiki: Scripts

Easy Infrastructure Monitoring Wiki

Easily monitor network infrastructure services

Understanding Alerts

Alerts

Related