There is a flora of AIS error codes defined in saAis.h that an API user is supposed to handle in an appropriate way, but currently, the OpenSAF services themselves do not internally handle these error codes properly. This ticket proposes a general improvement / cleanup of the code where we are (or in moste cases: are not) handling AIS error codes in the OpenSAF services. The proposal is also to also add common library helper functions for the AIS eror handling mechanism, to minimize code duplication.
Examples of error codes and how to handle them:
It is very difficult to set a maximum time limt for how long we need to keep retrying before we give up, as can be seen for example in ticket [#1582]. It is also in many cases difficult to decide what to do when we give up. Sometimes, we can just skip the action and continue anyway. An example of this case would be logging; logging a message is normally not vital to the function of the system. In those cases, we should only retry for a short while (or not at all), and then give up the operation and continue in the same was as if it was successful. However, in many cases the operation cannot be skipped. Restarting the calling process is unlikely to help, since the AIS call is failing because some other OpenSAF service (possibly on on another node) is unresponsive. Therefore, the proposal is that in these cases where the operation is vital, we should keep retrying forever and let higher-level monitoring (NID or AMF helathcheck) detect and recover hanging processes. For debugging purposes, we should however log a message to syslog to indicate where we are stuck in a retry loop. This logging should be by the common helper functions.
Tickets: #1582
Tickets: #1632
Tickets: #1648
Tickets: #1821
Tickets: #1833
Tickets: #744
===============================================
On 8/8/2016 7:06 PM, Lennart Lund wrote:
This ticket should also consider to address Lennart comment
Last edit: A V Mahesh (AVM) 2016-08-10