FastFix Remote Software Maintenance Wiki

Monitoring Control for Remote Software Maintenance

Status: Alpha

Brought to you by: bgaudin, dadecal, euskaraz, jog2pt, and 4 others

Component Specific Error Types

Motivation for "Component specific error types"

The purpose of this page is to collect information on error types from the perspective of each FastFix component. Questions such as "Which errors can be dealt with by your component particularly well?" (e.g. for for Patch Generation Component: Errors that do not change state of application) and "Which errors are relevant from research point of view for this component?" (e.g. for Fault Replication Component: concurrency errors) are answered. This information is used when deciding about and refining the final FastFix scenarios (additionally to the information about error type relevance).

FastFix Client

Context System

Well suited error types

Which types of error can be handled by the component particularly well?
What are characteristic properties of these error types?

Errors caused by user behaviour

Characteristics: Errors that can only be reproduced when all user actions before ocurrence of the error are known (e.g. Double clicking on a button instead of a single click)
Examples: Unallowed user input, Unallowed function (a function cannot be applied in current context)

Errors related to resource conflicts or shortages in infrastructure

Characteristics: Current resource usage can be monitored, maximum values of resource are known
Examples: High CPU usage in system due to other process, memory shortage

Errors that the component has problems with

What errors can the component not deal with?

Coding bugs
In General: All errors that cannot be diagnosed by dynamic analysis or observation of user behavior

Relevant errors from research point of view

Which types of errors are relevant from a research point of view for this component?

Errors caused by user behavior

Summary

The main purpose is to collect information used by event correlation, fault replication and patch generation components. It does not address or handle specific errors itself. Three types of information is targeted: user interactions (text entered, mouse actions, meaningful functions executed), application execution (methods called, high level events like "Run spellchecker") and application configuration. The information is collected using sensors that instrument the target application. Several types of instrumentation such as bytecode injection, real-time log file analysis, or incorporation of source code is used.

Client Self-Healing System

Well suited error types

Which types of error can be handled by the component particularly well?
What are characteristic properties of these error types?

see the server part

Errors that the component has problems with

What errors can the component not deal with?

see the server part

Relevant errors from research point of view

Which types of errors are relevant from a research point of view for this component?

see the server part

Summary

see the server part

FastFix Server

Fault Replication Component

Well suited error types

The most important error types that the fault replication system is targeting are errors that cause unhandled exceptions in stand-alone Java applications. However, the fault replication system could also replicate application executions without exceptions. For that it needs to rely on some external mechanism notifying it that an error report for an apparently well-behaved application but for which e.g. the context system or the correlation engine detect a problem. To understand the characteristic properties of these error types, we highlight the limitation of our current approach.

Errors that the component has problems with

The fault replication system has two limitations:

It is unable to replicate faults that happen outside Java applications, i.e. a problem with the JVM itself, a non-Java application or the operating system.
The fault replication system will not trigger the generation of an error report if the observed applications do not cause an exception. Without an exception, an external component or an call made from the application have to be the triggers for the error reports.

Relevant errors from research point of view

Which types of errors are relevant from a research point of view for this component?

The errors which are most relevant for research are errors which require the user to input data and errors which involve multiple threads.

Summary

tbd.

Event Correlation Component

Well suited error types

Which types of error can be handled by the component particularly well?

Regarding the Event Correlation System, this question can be answered with just one word: knowledge. Any type of error that can be detected through its symptoms can be handled by the Event Correlation System, if we know exactly which are its symptoms. Error detection will be performed based on the evaluation of conditions in detection rules, each one of these conditions can be established as a symptom. As soon as these conditions are simultaneuosly true, the error will be considered as effective. Nevertheless, this knowledge can not be entirely specified in the rule itself, but also in the knowledge represented in an ontology, allowing for more general rules (i.e. there is no need for one rule for each error type).

In other words, if we know deeply enough the error, the Event Correlation System can cope with it at detection level.

What are characteristic properties of these error types?

Error associated with user interface

Characteristics: These errors are especially interesting from the correlation point of view, since the information of the cause is strongly related with the current action of the user, and the configuration of the system and the application, right before the occurrence of the error (e.g. input validation, sql injection)

Configuration error

Characteristics: Usually configuration changes may be associated to specific parts of the application. Hence, if the error appears while the application is performing any action associated with the same package (or module), the Event Correlation System can associate a previous change in configuration with any error in the corresponding module.

Race conditions

Characteristics: If we have information regarding the time when several methods are fired, the EC system can detect the precise moment when the race condition sequence has been executed in the application. Race condition details are a pre-requisite though.

Error handling:

Characteristics: The detection of lack of error handling is perfectly possible with information recovered from log files.

Boundary-related errors

Characteristics: These kind of errors is especially well suited for Event Correlation, since rules allow the definition of conditions based on boundary cases (e.g. entered number greater than 100, or exactly 100). These conditions can be mixed with errors and crashes when these error conditions are effective (and also other rules demonstrating that not-boundary cases are not firing the error).

Control flow errors

Characteristics: If the expected flow is known, EC can detect any deviation from the expected flow.

Load conditions

Characteristics: To detect if a program is trying to do too much in too little time, the key concept is what is too much. Nevertheless, if we establish some criteria and we correlate resource errors with runtime environment parameters and configuration, EC can detect a wide range of situations.

Hardware

Characteristics: Detection of errors coming from devices is an affordable task, if we know what kind of actions are associated with these errors.

Source, version and id control

Characteristics: Configuration changes, list of libraries used and compatibility information can provide enough knowledge to deal with this kind of error.

Errors that the component has problems with

What errors can the component not deal with?

tbd.

Relevant errors from research point of view

Which types of errors are relevant from a research point of view for this component?

The error types that are suitable for event correlation are the ones whose nature we know (more or less deeply). Hence, from a research point of view, the errors that we don't have much information but empirical observation are the more suitable to be considered as target for pattern recognition.

These relevant errors should be frequent enough to apply for techniques like frequent pattern mining, gathering information from different occurrences of the error itself (in different periods of time or even different environments).

Summary

tbd.

Server Self-Healing Component

Well suited error types

Which types of error can be handled by the component particularly well?
What are characteristic properties of these error types?

The server self-healing component is more "symptoms oriented" than "error oriented". It is therefore easier to describe the type of symptoms that this component is able to deal with. This is an important distinction as 1) different types of errors can lead to similar symptoms and 2) there is no clear knowledge about what errors lead to what symptoms.
As a general comment, the self-healing approach focuses on characterization of "bad" behaviors. This takes into account the dynamic of the system, e.g. "Traces where method1 is called before method2 are bad" (control objective). It also takes into account the possible behaviors of the system, e.g. "method1 may not be followed by method2" (model of the system behaviors).
The technique used, i.e. supervisory control theory, is event based and is well established in this case. Works considering system variables exist but are not very common and are quite limited, e.g. only considering boolean and integer on finite domains.

The following table provides a list of strength and weaknesses for the self-healing component. Note that dealing with the weaknesses is sometimes possible. However this is usually associated with scalability issues. This table shows for instance that the models considered in the self-healing component are relevant in order to model the dynamic of the system, i.e. runtime behaviors, as well as concurrency and control flow, i.e. branching and loops. It also indicates that these models suffer from the state explosion problem. This entails that in order to be used, these models are often partial and only represents an abstraction of the whole system, i.e. the level of granularity of the model cannot be low, e.g. the states of the model cannot represent all the states of a program.

	Strengths	Weaknesses
	System dynamic	state explosion
Model:FSM	concurrency	Partial modeling
	control flow	Abstract system states
	Deal with symptoms	No new behaviors
Technique:Supervisory Control	Automation	Doesn't deal well with variables
		No diagnosis

Wiki: Application Bridge
Wiki: Client Communication System
Wiki: Client Data Store
Wiki: Client Ontologies
Wiki: Client Self-Healing System
Wiki: Component Limitations
Wiki: Context System
Wiki: Error Reporting and User Feedback System
Wiki: Event Correlation System
Wiki: Fault Replication System
Wiki: Home
Wiki: Maintenance Engineering UI System
Wiki: Maintenance Environment Bridge
Wiki: Server Communication System
Wiki: Server Data Store
Wiki: Server Ontologies
Wiki: Server Self-Healing System

FastFix Remote Software Maintenance Wiki

Monitoring Control for Remote Software Maintenance

Component Specific Error Types

Motivation for "Component specific error types"

FastFix Client

Context System

Well suited error types

Errors that the component has problems with

Relevant errors from research point of view

Summary

Client Self-Healing System

Well suited error types

Errors that the component has problems with

Relevant errors from research point of view

Summary

FastFix Server

Fault Replication Component

Well suited error types

Errors that the component has problems with

Relevant errors from research point of view

Summary

Event Correlation Component

Well suited error types

Errors that the component has problems with

Relevant errors from research point of view

Summary

Server Self-Healing Component

Well suited error types

Related