The purpose of this page is to collect information on error types from the perspective of each FastFix component. Questions such as "Which errors can be dealt with by your component particularly well?" (e.g. for for Patch Generation Component: Errors that do not change state of application) and "Which errors are relevant from research point of view for this component?" (e.g. for Fault Replication Component: concurrency errors) are answered. This information is used when deciding about and refining the final FastFix scenarios (additionally to the information about error type relevance).
Which types of error can be handled by the component particularly well?
What are characteristic properties of these error types?
Characteristics: Errors that can only be reproduced when all user actions before ocurrence of the error are known (e.g. Double clicking on a button instead of a single click)
Examples: Unallowed user input, Unallowed function (a function cannot be applied in current context)
Characteristics: Current resource usage can be monitored, maximum values of resource are known
Examples: High CPU usage in system due to other process, memory shortage
What errors can the component not deal with?
Which types of errors are relevant from a research point of view for this component?
The main purpose is to collect information used by event correlation, fault replication and patch generation components. It does not address or handle specific errors itself. Three types of information is targeted: user interactions (text entered, mouse actions, meaningful functions executed), application execution (methods called, high level events like "Run spellchecker") and application configuration. The information is collected using sensors that instrument the target application. Several types of instrumentation such as bytecode injection, real-time log file analysis, or incorporation of source code is used.
Which types of error can be handled by the component particularly well?
What are characteristic properties of these error types?
What errors can the component not deal with?
Which types of errors are relevant from a research point of view for this component?
The most important error types that the fault replication system is targeting are errors that cause unhandled exceptions in stand-alone Java applications. However, the fault replication system could also replicate application executions without exceptions. For that it needs to rely on some external mechanism notifying it that an error report for an apparently well-behaved application but for which e.g. the context system or the correlation engine detect a problem. To understand the characteristic properties of these error types, we highlight the limitation of our current approach.
The fault replication system has two limitations:
Which types of errors are relevant from a research point of view for this component?
tbd.
Which types of error can be handled by the component particularly well?
Regarding the Event Correlation System, this question can be answered with just one word: knowledge. Any type of error that can be detected through its symptoms can be handled by the Event Correlation System, if we know exactly which are its symptoms. Error detection will be performed based on the evaluation of conditions in detection rules, each one of these conditions can be established as a symptom. As soon as these conditions are simultaneuosly true, the error will be considered as effective. Nevertheless, this knowledge can not be entirely specified in the rule itself, but also in the knowledge represented in an ontology, allowing for more general rules (i.e. there is no need for one rule for each error type).
In other words, if we know deeply enough the error, the Event Correlation System can cope with it at detection level.
What are characteristic properties of these error types?
Characteristics: These errors are especially interesting from the correlation point of view, since the information of the cause is strongly related with the current action of the user, and the configuration of the system and the application, right before the occurrence of the error (e.g. input validation, sql injection)
Characteristics: Usually configuration changes may be associated to specific parts of the application. Hence, if the error appears while the application is performing any action associated with the same package (or module), the Event Correlation System can associate a previous change in configuration with any error in the corresponding module.
Characteristics: If we have information regarding the time when several methods are fired, the EC system can detect the precise moment when the race condition sequence has been executed in the application. Race condition details are a pre-requisite though.
Characteristics: The detection of lack of error handling is perfectly possible with information recovered from log files.
Characteristics: These kind of errors is especially well suited for Event Correlation, since rules allow the definition of conditions based on boundary cases (e.g. entered number greater than 100, or exactly 100). These conditions can be mixed with errors and crashes when these error conditions are effective (and also other rules demonstrating that not-boundary cases are not firing the error).
Characteristics: If the expected flow is known, EC can detect any deviation from the expected flow.
Characteristics: To detect if a program is trying to do too much in too little time, the key concept is what is too much. Nevertheless, if we establish some criteria and we correlate resource errors with runtime environment parameters and configuration, EC can detect a wide range of situations.
Characteristics: Detection of errors coming from devices is an affordable task, if we know what kind of actions are associated with these errors.
Characteristics: Configuration changes, list of libraries used and compatibility information can provide enough knowledge to deal with this kind of error.
What errors can the component not deal with?
Which types of errors are relevant from a research point of view for this component?
The error types that are suitable for event correlation are the ones whose nature we know (more or less deeply). Hence, from a research point of view, the errors that we don't have much information but empirical observation are the more suitable to be considered as target for pattern recognition.
These relevant errors should be frequent enough to apply for techniques like frequent pattern mining, gathering information from different occurrences of the error itself (in different periods of time or even different environments).
Which types of error can be handled by the component particularly well?
What are characteristic properties of these error types?
The server self-healing component is more "symptoms oriented" than "error oriented". It is therefore easier to describe the type of symptoms that this component is able to deal with. This is an important distinction as 1) different types of errors can lead to similar symptoms and 2) there is no clear knowledge about what errors lead to what symptoms.
As a general comment, the self-healing approach focuses on characterization of "bad" behaviors. This takes into account the dynamic of the system, e.g. "Traces where method1 is called before method2 are bad" (control objective). It also takes into account the possible behaviors of the system, e.g. "method1 may not be followed by method2" (model of the system behaviors).
The technique used, i.e. supervisory control theory, is event based and is well established in this case. Works considering system variables exist but are not very common and are quite limited, e.g. only considering boolean and integer on finite domains.
The following table provides a list of strength and weaknesses for the self-healing component. Note that dealing with the weaknesses is sometimes possible. However this is usually associated with scalability issues. This table shows for instance that the models considered in the self-healing component are relevant in order to model the dynamic of the system, i.e. runtime behaviors, as well as concurrency and control flow, i.e. branching and loops. It also indicates that these models suffer from the state explosion problem. This entails that in order to be used, these models are often partial and only represents an abstraction of the whole system, i.e. the level of granularity of the model cannot be low, e.g. the states of the model cannot represent all the states of a program.
| Strengths | Weaknesses | |
|---|---|---|
| System dynamic | state explosion | |
| Model:FSM | concurrency | Partial modeling |
| control flow | Abstract system states | |
| Deal with symptoms | No new behaviors | |
| Technique:Supervisory Control | Automation | Doesn't deal well with variables |
| No diagnosis |
Wiki: Application Bridge
Wiki: Client Communication System
Wiki: Client Data Store
Wiki: Client Ontologies
Wiki: Client Self-Healing System
Wiki: Component Limitations
Wiki: Context System
Wiki: Error Reporting and User Feedback System
Wiki: Event Correlation System
Wiki: Fault Replication System
Wiki: Home
Wiki: Maintenance Engineering UI System
Wiki: Maintenance Environment Bridge
Wiki: Server Communication System
Wiki: Server Data Store
Wiki: Server Ontologies
Wiki: Server Self-Healing System