Menu

#1607 Handle AIS error codes properly

future
assigned
None
enhancement
osaf
-
major
2016-08-29
2015-11-20
No

There is a flora of AIS error codes defined in saAis.h that an API user is supposed to handle in an appropriate way, but currently, the OpenSAF services themselves do not internally handle these error codes properly. This ticket proposes a general improvement / cleanup of the code where we are (or in moste cases: are not) handling AIS error codes in the OpenSAF services. The proposal is also to also add common library helper functions for the AIS eror handling mechanism, to minimize code duplication.

Examples of error codes and how to handle them:

  • SA_AIS_ERR_TRY_AGAIN: Retry the function
  • SA_AIS_ERR_NO_RESOURCES: Similar to SA_AIS_ERR_TRY_AGAIN
  • SA_AIS_ERR_TIMEOUT: Retry if the function is idempotent. If the function isn't idempotent, we have to judge from case to case if it should be retried or not.
  • SA_AIS_ERR_BAD_HANDLE: Initialize a new handle (and possibly also do other things like setting OI implementer in case of an OI handle). Retry with the new handle. In the case of an IMM CCB handle, an incomplete IMM transaction may have to be "replayed".
  • SA_AIS_ERR_FAILED_OPERATION: When applying an IMM transaction, this code is returned when the transaction was aborted. It can be returned both in the case of a validation error and in the case of a resource error. To distinguish between the two causes, use the new functionality introduced in ticket [#744]. If it was a resource abort, retry by replaying the whole transaction.

For how long should we keep retrying?

It is very difficult to set a maximum time limt for how long we need to keep retrying before we give up, as can be seen for example in ticket [#1582]. It is also in many cases difficult to decide what to do when we give up. Sometimes, we can just skip the action and continue anyway. An example of this case would be logging; logging a message is normally not vital to the function of the system. In those cases, we should only retry for a short while (or not at all), and then give up the operation and continue in the same was as if it was successful. However, in many cases the operation cannot be skipped. Restarting the calling process is unlikely to help, since the AIS call is failing because some other OpenSAF service (possibly on on another node) is unresponsive. Therefore, the proposal is that in these cases where the operation is vital, we should keep retrying forever and let higher-level monitoring (NID or AMF helathcheck) detect and recover hanging processes. For debugging purposes, we should however log a message to syslog to indicate where we are stuck in a retry loop. This logging should be by the common helper functions.

Related

Tickets: #1582
Tickets: #1632
Tickets: #1648
Tickets: #1821
Tickets: #1833
Tickets: #744

Discussion

  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 5.0.FC --> 5.1.FC
     
  • A V Mahesh (AVM)

    ===============================================
    On 8/8/2016 7:06 PM, Lennart Lund wrote:

      • Return Values : NCSCC_RC_SUCCESS/NCSCC_RC_FAILURE
      • Notes : None
    • */
      [Lennart] NOTE: This function is not a NCS function and should therefore not use NCS error codes. Unfortunate this is commonly done all over the code but it would be good not to add more of this!

    ===============================================

    This ticket should also consider to address Lennart comment

     

    Last edit: A V Mahesh (AVM) 2016-08-10
  • Anders Widell

    Anders Widell - 2016-08-29
    • Milestone: 5.1.FC --> 5.2.FC
     
  • Anders Widell

    Anders Widell - 2017-01-30
    • Milestone: 5.2.FC --> future
     

Log in to post a comment.