Executive Summary
The Event Correlation System will react and will make decisions based on the result of processing events coming from the Context System.
Therefore, from the information that can be acquired from the target application, the runtime environment and the OS, the Context System will send the events collected from the monitoring environment. From a functional point of view, this component it is in charge of drawing high-level conclusions about the events flowing in the system and it will trigger actions with the information provided by the most relevant events. These high-level conclusions should enrich the maintenance environment with valuable information for the developers. In addition to that, it might use the information across the component to learn from it.
Concrete objectives derived from these roles are:
- ERROR DETECTION:The capability to detect a fault in the right time, in other words, soon enough to take advantage of the information that something wrong has occurred, and to provide information to software maintainers regarding the error, which in most cases, these errors will not be trivial, either for the symptoms to detect them, or because of their causes.
- Cause identification: In some cases, to be able to identify the cause of the problem and, in some cases, even provide a solution recommendation.
- Pattern recognition: One of the more challenging objectives is pattern recognition, in other words, the ability to identify new patterns of error or abnormal situations.
Services Provided
-
To Server Data Store (and, in a second level, this data is read by the Maintenance Engineering Support System) Error detection and other features will be asynchronous, since this element is event-driven, in other words, the system is able to provide information when the events occur. Hence, a first approach would be sending this information to the Server Data Store when the Event Correlation System detects it. After that, this information will be available for other components to be requested.
- New ErrorReport: detected error. Associated information: Context Events that provided the information to detect the error. Additionally, it will share the mechanism used to detect the error, in other words, if it has been detected based on information in the knowledge base (detection rules) or if it has been detected through pattern recognition.
- Given an ErrorReport id, give a list of possible causes
- Given a cause associated to an ErrorReport id, provide information about a possible solution
- Given a symptom to be mitigated (if the error cannot be solved through the elimination of the cause), provide information about a possible workaround
- Given an ErrorReport id, provide a small list of similar problems (data aggregation)
-
To Server Self-Healing System
- In case that a concrete error is detected and a patch is not associated for that error type, a control objective can be sent (with information of the symptom to be mitigated)
Services Needed
- From Context System: (through Communication System)
- It must actively send context events (or groups of events) from the target application, the runtime environment and the OS, that have been acquired from the monitoring environment.
- These context events are different depending on the origin (user interaction, system events, runtime environment, application) and nature (text input, button click, system configuration change, resource value over threshold, JVM parameter value, log file change,...) of the information. The model of the information for each one of these types of context events must be shared by the Context System and the Event Correlation System (e.g. using the same ontology).
Open Questions and Doubts
- In case that a concrete error is detected and a patch is not associated for that error type, a control objective must be sent to the Server Self-Healing System. The question here is: How should be this control objective specified? By signaling the cause of the error? the symptoms? the functionality to be disabled to allow other functionalities of the application to work?
- The event correlation system continuosly receives context events and, after correlation of these events, it provides information regarding detection of errors. After that, for each error report, it will evaluate rules to provide additional information to it: context events associated, possible causes, possible solutions, and so on. Thus, it is a pro-active component, it will provide information to the Server Data Store, as soon as context information comes and the rules are automatically fired. Does any component needs to call the Event Correlation System or should it send the information to the Server Data Store, in order to be consumed by the final component (Maintenance Environment Bridge)?
- "Real-Time" Correlation
What does "real-time" mean (at least we have the delay of sending to server, correlating in server, sending to client)? For which scenarios do we need it? The closest concept would be Right-Time. FastFix needs Error detection as soon as it can be detected on the server side, in the right time, or in other words, as soon as possible to provide this information to the system, allowing failure conditions to be detected from the information provided by the sensors. In most cases, errors will not be trivial to detect, that is the reason for the need of a right-time error detection.
Is the Event correlation system running fully on the server or partly also on the client (to allow fast reaction to specific situations)?
- Implementation of rule-based system
Do we need to integrate rules and ontologies ourselves (Drools + Custom operators) or do existing solutions (SWRL) suffice?
Shared Data Structures
- Context event, Issue, Cause, Symptom, Solution, Workaround, ErrorReport (which will be an aggregation of the previous elements: issue, its cause, the context events that allowed its detection, possible solutions, ...), ControlObjective
Component-specific Errors
see [Component Specific Error Types]
Limitations
see [Component Limitations]