Menu

#2133 AMF: Rollback admin shutdown/lock SI operation if node failover

5.2.FC
fixed
None
enhancement
amf
d
major
2017-02-21
2016-10-20
No

In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs
- For 2N: the SI Admin state is rollbacked to UNLOCK
- For Nway: the SI Admin state moves to LOCKED
- In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK

My question is whether the result of these scenario should be consistent? And what's the expected outcome?
Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK

Related

Tickets: #2133
Wiki: NEWS-5.2.0

Discussion

  • Praveen

    Praveen - 2016-10-21

    Hi Minh,
    Lock operation on SI is in deviation from spec for all redundancy model. Spec does not talk about giving quiesced callback for lock operation (section 10.4 page 404). There is a ticket for this already.
    I just checked the code, in SI lock case, for 2N model AMFD gives removal and quiesced assignment simultaneously. But for NpM and N-Way, it gives only quiesced assignment initially and then removal of all assignments after successfully receiving the quiesced response. So in these two models,I think, operations get reverted back if quiesced state fails. But I do not know in 2N case it was same from the beginning or got modified in some fix.

    Shutdown operation sequence is same in all these three red models as per spec. In case of 2N model, there are different behaviuors in case of faults and in case of SIdeps. Some of these are documented in AMF PR doc section 3.6.2 conformance table. There are atleast 2 tickets related to inconsistency.

    I think in case of lock operation on SI, spec does not talk about reverting it back. However there is one diagram at page 83 where quiesced to active migration is shown. But I think this must be the case of shutdown or si-swap operation. For shutdown operation, there is a guidance in spec chapter 9 section 9.1, whcih says:
    "Note that the shutdown administrative operation is non-blocking, which means that it
    may complete while the actual procedure to shut down the target entity is still in
    progress (that is, the entity has not yet reached the locked administrative state). As
    soon as the shutdown administrative operation completes, other administrative operations
    such as lock or unlock can be invoked; they can interrupt the shutdown procedure
    and force the target entity into a locked or an unlocked administrative state."

    I think we should do following things for consistent behaviours:
    1) For lock opeation on SI, we should not revert it back irrespective of faults.
    2) For shutdown operation on SIalso, we should not revert it back in case some fault occurs.

    What do you think?
    Also I wouid like other maintainers to comment on it.

    Thanks,
    Praveen

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-24

    Hi Praveen,

    I agree for both 1) and 2) that lock and shutdown SI operations should not be reverted in case of fault. The reason (I think) is when an operation lock/shutdown SI is issued, that likely means application denies providing service, which could involve in some kinds of releasing resource, closing connection, ...So revert back to UNLOCKED with active assignment will highly force application to continue providing service, that could end up many unhandled cases at applications.

    At page 83, a migration from quiesced to active, I think it's for failover during si-swap, where an error happens at current STANDBY SU after quiesced ACTIVE SU.

    I also would like to listen to other maintainers.

    Thanks,
    Minh

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-10

    The lock operation is not consistent behavior between SGs in scenario of failover during lock command. Mark this ticket as defect for future

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-10
    • summary: AMF: Rollback admin shutdown SI operation if node failover --> AMF: Rollback admin shutdown/lock SI operation if node failover
    • Type: discussion --> defect
    • Milestone: 5.2.FC --> future
     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-18

    Here is update for 2N:
    1. In case of Comp-f/o,
    - with SI Dep configured, admin state is locked
    - without SI Dep configured, admin state is unlocked.
    2. In case of SU f/o:
    - with or without SI Dep configured, admin state is unlocked.
    - with single SI assignment, the admin state is locked.
    3. In case of node f/o: same as SU f/o.

    So, I think that in different situation, Amf can mark the differant state depending upon how it can recover from faults. But, when the admin state is marked UNLOCKED then the admin op result should be sent as TIMEOUT.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-18

    I will check the admin op return code and will try to fix them.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-18

    Please comment.

     
  • Minh Hon Chau

    Minh Hon Chau - 2017-01-19

    Hi Nagu,

    The section 3.6.2 in Compliance Table in AMF PR says:

    In 2N model, deviations related to shutdown operation on SI:
    1) During shutdown operation on SI if component faults and it leads to component-failover, sufailover or nodefailover, then shutdown operation will not be completed and SI will remain in assigned state.
    2) There is, however, one deviation when SI dependency is configured and component's fault leads to component-failover. In this case SI will go to locked state with no assignments.

    It looks only one case, which SI dependency is configured and component-failover is escalated, then the SI will go to LOCKED. All other cases, the SI remains ASSIGNED state, I assume the SI should be UNLOCKED to have ASSIGNED state

    Comparing to your findings, is there something we have to do with "single SI assignment, the admin state is locked." for su f/o and node f/o?

    I think it's good idea that we return other codes (TIMEOUT,...?) in case error escalation that rollback the shutdown command.

    Another question, do you know use case's motivation or technical problem behind that we had this deviation/inconsistency? It's good to document in README at least for code maintainance purpose (or to know what to do in similar cases for other SGs)

    thanks,
    Minh

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-23
    • status: unassigned --> accepted
    • assigned_to: Nagendra Kumar
    • Milestone: future --> 5.2.FC
     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-23

    Comparing to your findings, is there something we have to do with "single SI assignment, the admin state is locked." for su f/o and node f/o?
    No, I think, if it is single SI assignment, we can return Success and mark admin state as Locked.
    Another question, do you know use case's motivation or technical problem behind that we had this deviation/inconsistency?
    It is for ease of flow. Like for single SI, we can easily mark locked and remove the assignments from Act SU. For SIs having two assignments, 2N red model is reusing SU switchover codes, it leaves Si in unlocked state.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-23

    I will provied fix of 2N red model for 5.2 release. The fix would be to return TIMEOUT for failure of admin shutdown cases when shutdown admin op gets reverted and admin state is rolled back to Unlocked.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-01-30

    I think it's good idea that we return other codes (TIMEOUT,...?) in case error escalation that rollback the shutdown command.
    I think it is better to return TRY_AGAIN as it gives some margin for Error occured in Specs.
    "SA_AIS_ERR_TRY_AGAIN - The service cannot be provided at this time. The client
    may retry later. This error generally should be returned when the requested action is
    valid but not currently possible, probably because another operation is acting upon
    the logical entity on which the administrative operation is invoked. Such an operation
    can be another administrative operation or an error recovery initiated by the Availability
    Management Framework."

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-01

    Any comment ? I am preparing the patch with TRY_AGAIN.

     
    • Minh Hon Chau

      Minh Hon Chau - 2017-02-02

      Hi Nagu,

      I prefer to not rollback the operations (as commented by Praveen earlier) if the rollback is due to internal implementation, not from a specific use case. Anyway if we have no way to correct it, then we have to accept it. I don't have a clear indication on which error code should be returned, both TRY_AGAIN and TIMEOUT seems ok since the caller will have to retry the operation.

      Thanks,
      Minh

       
      • Nagendra Kumar

        Nagendra Kumar - 2017-02-02

        Hi Minh,

                    I also wouldn't prefer to rollback but because of internal implementation, I am just reusing the code. I am preferring return code as TRY_AGAIN because the error has occurred and the operation can't be completed.
        

        Thanks

        -Nagu

        From: Minh Hon Chau [mailto:minh-chau@users.sf.net]
        Sent: 02 February 2017 09:56
        To: opensaf-tickets@lists.sourceforge.net
        Subject: [tickets] [opensaf:tickets] Re: #2133 AMF: Rollback admin shutdown/lock SI operation if node failover

        Hi Nagu,

        I prefer to not rollback the operations (as commented by Praveen earlier) if the rollback is due to internal implementation, not from a specific use case. Anyway if we have no way to correct it, then we have to accept it. I don't have a clear indication on which error code should be returned, both TRY_AGAIN and TIMEOUT seems ok since the caller will have to retry the operation.

        Thanks,
        Minh


        HYPERLINK "https://sourceforge.net/p/opensaf/tickets/2133/"[tickets:#2133] AMF: Rollback admin shutdown/lock SI operation if node failover

        Status: accepted
        Milestone: 5.2.FC
        Created: Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau
        Last Updated: Wed Feb 01, 2017 08:50 AM UTC
        Owner: Nagendra Kumar

        In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs
        - For 2N: the SI Admin state is rollbacked to UNLOCK
        - For Nway: the SI Admin state moves to LOCKED
        - In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK

        My question is whether the result of these scenario should be consistent? And what's the expected outcome?
        Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK


        Sent from sourceforge.net because HYPERLINK "mailto:opensaf-tickets@lists.sourceforge.net"opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/

        To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

         
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-01
    • Type: defect --> enhancement
     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-01

    Changing to enhancement as it changes return types of few admin operation mentioned above.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-02

    Hi Minh,
    I also wouldn’t prefer to rollback but because of internal implementation, I am just reusing the code. I am preferring return code as TRY_AGAIN because the error has occurred and the operation can’t be completed.

    Thanks
    -Nagu

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-07
    • status: accepted --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-07

    Sent patch for review with the above implementation.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-21
    • status: review --> fixed
     
  • Nagendra Kumar

    Nagendra Kumar - 2017-02-21

    changeset: 8592:f13798019501
    tag: tip
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue Feb 21 10:01:48 2017 +0530
    summary: amfd: return TRY_AGAIN on rollback of shutdown admin op [#2133]

    [staging:f13798]

    Documentation Changes:
    changeset: 208:f21e52b1f0d1
    tag: tip
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue Feb 21 10:28:42 2017 +0530
    summary: amf: deviations on SI shutdown [#2133]

    [staging:f21e52]

     

    Related

    Commit: [f13798]
    Tickets: #2133


Log in to post a comment.