http://jira.hyperic.com/browse/HHQ-4248
The problem is intermittent but because it relates to alert subsystem its critical. The alert definition looks like below:
{noformat}
If Condition: Availability = 0.0%
Enable Action(s): Each time conditions are met.
Generate one alert and then disable alert definition until fixed
{noformat}
Recovery alert for above primary alert def looks like below:
{noformat}
If Condition: Availability = 100.0%
Recovery Alert: for Linux Down
Enable Action(s): Each time conditions are met.
{noformat}
See the attach screenshot for the sequence in which alerts were fired. There were 2 primary alerts fired without a recovery in between them. Below are same data from sql queries. Notice how the recovery alert was created before second primary alert fired in the database though primarity alert's ctime is before recovery alert.
select FROM_UNIXTIME(r.startime/1000), FROM_UNIXTIME(r.endtime/1000), r.availval from EAM_PLATFORM p, EAM_MEASUREMENT m, HQ_AVAIL_DATA_RLE r
where p.id=10100
and p.resource_id=m.resource_id
and m.template_id=10816
and r.measurement_id=m.id;
FROM_UNIXTIME(r.startime/1000) FROM_UNIXTIME(r.endtime/1000) AVAILVAL
2010-08-26 10:52:00.0 2010-08-26 13:52:00.0 1
2010-08-26 13:52:00.0 null 0
select a.id, a.ctime, a.fixed, d.name, d.active, d.enabled, s.description from EAM_ALERT a, EAM_ALERT_DEFINITION d, EAM_PLATFORM s
where a.alert_definition_id=d.id
and s.resource_id=d.resource_id
and s.id=10100
order by a.id;
ID CTIME FIXED NAME ACTIVE ENABLED DESCRIPTION
10104 26-Aug-2010 13:52:00 1 Linux Down 1 0 CentOS 5.2
10106 26-Aug-2010 14:17:00 1 Linux Recovery 1 1 CentOS 5.2
10107 26-Aug-2010 13:55:00 0 Linux Down 1 0 CentOS 5.2
Anonymous