Handle AIS error codes properly

#1607 Handle AIS error codes properly

Milestone: future

Status: assigned

Owner: Anders Widell

Labels: None

Type: enhancement

Component: osaf

Part: -

Version:

Priority: major

Blocker:

Updated: 2016-08-29

Created: 2015-11-20

Creator: Anders Widell

Private: No

There is a flora of AIS error codes defined in saAis.h that an API user is supposed to handle in an appropriate way, but currently, the OpenSAF services themselves do not internally handle these error codes properly. This ticket proposes a general improvement / cleanup of the code where we are (or in moste cases: are not) handling AIS error codes in the OpenSAF services. The proposal is also to also add common library helper functions for the AIS eror handling mechanism, to minimize code duplication.

Examples of error codes and how to handle them:

SA_AIS_ERR_TRY_AGAIN: Retry the function
SA_AIS_ERR_NO_RESOURCES: Similar to SA_AIS_ERR_TRY_AGAIN
SA_AIS_ERR_TIMEOUT: Retry if the function is idempotent. If the function isn't idempotent, we have to judge from case to case if it should be retried or not.
SA_AIS_ERR_BAD_HANDLE: Initialize a new handle (and possibly also do other things like setting OI implementer in case of an OI handle). Retry with the new handle. In the case of an IMM CCB handle, an incomplete IMM transaction may have to be "replayed".
SA_AIS_ERR_FAILED_OPERATION: When applying an IMM transaction, this code is returned when the transaction was aborted. It can be returned both in the case of a validation error and in the case of a resource error. To distinguish between the two causes, use the new functionality introduced in ticket [#744]. If it was a resource abort, retry by replaying the whole transaction.

For how long should we keep retrying?

It is very difficult to set a maximum time limt for how long we need to keep retrying before we give up, as can be seen for example in ticket [#1582]. It is also in many cases difficult to decide what to do when we give up. Sometimes, we can just skip the action and continue anyway. An example of this case would be logging; logging a message is normally not vital to the function of the system. In those cases, we should only retry for a short while (or not at all), and then give up the operation and continue in the same was as if it was successful. However, in many cases the operation cannot be skipped. Restarting the calling process is unlikely to help, since the AIS call is failing because some other OpenSAF service (possibly on on another node) is unresponsive. Therefore, the proposal is that in these cases where the operation is vital, we should keep retrying forever and let higher-level monitoring (NID or AMF helathcheck) detect and recover hanging processes. For debugging purposes, we should however log a message to syslog to indicate where we are stuck in a retry loop. This logging should be by the common helper functions.

Mathi Naickan - 2016-05-04

Milestone: 5.0.FC --> 5.1.FC
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2016-08-10

===============================================
On 8/8/2016 7:06 PM, Lennart Lund wrote:

Return Values : NCSCC_RC_SUCCESS/NCSCC_RC_FAILURE

Notes : None

*/
[Lennart] NOTE: This function is not a NCS function and should therefore not use NCS error codes. Unfortunate this is commonly done all over the code but it would be good not to add more of this!

===============================================

This ticket should also consider to address Lennart comment

Last edit: A V Mahesh (AVM) 2016-08-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2016-08-29

Milestone: 5.1.FC --> 5.2.FC
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2017-01-30

Milestone: 5.2.FC --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Handle AIS error codes properly

Milestone

Searches

Help

#1607 Handle AIS error codes properly

For how long should we keep retrying?

Related

Discussion