OpenSAF / Tickets / #2133 AMF: Rollback admin shutdown/lock SI operation if node failover

Praveen - 2016-10-21

Hi Minh,
Lock operation on SI is in deviation from spec for all redundancy model. Spec does not talk about giving quiesced callback for lock operation (section 10.4 page 404). There is a ticket for this already.
I just checked the code, in SI lock case, for 2N model AMFD gives removal and quiesced assignment simultaneously. But for NpM and N-Way, it gives only quiesced assignment initially and then removal of all assignments after successfully receiving the quiesced response. So in these two models,I think, operations get reverted back if quiesced state fails. But I do not know in 2N case it was same from the beginning or got modified in some fix.

Shutdown operation sequence is same in all these three red models as per spec. In case of 2N model, there are different behaviuors in case of faults and in case of SIdeps. Some of these are documented in AMF PR doc section 3.6.2 conformance table. There are atleast 2 tickets related to inconsistency.

I think in case of lock operation on SI, spec does not talk about reverting it back. However there is one diagram at page 83 where quiesced to active migration is shown. But I think this must be the case of shutdown or si-swap operation. For shutdown operation, there is a guidance in spec chapter 9 section 9.1, whcih says:
"Note that the shutdown administrative operation is non-blocking, which means that it
may complete while the actual procedure to shut down the target entity is still in
progress (that is, the entity has not yet reached the locked administrative state). As
soon as the shutdown administrative operation completes, other administrative operations
such as lock or unlock can be invoked; they can interrupt the shutdown procedure
and force the target entity into a locked or an unlocked administrative state."

I think we should do following things for consistent behaviours:
1) For lock opeation on SI, we should not revert it back irrespective of faults.
2) For shutdown operation on SIalso, we should not revert it back in case some fault occurs.

What do you think?
Also I wouid like other maintainers to comment on it.

Thanks,
Praveen

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2016-10-24

Hi Praveen,

I agree for both 1) and 2) that lock and shutdown SI operations should not be reverted in case of fault. The reason (I think) is when an operation lock/shutdown SI is issued, that likely means application denies providing service, which could involve in some kinds of releasing resource, closing connection, ...So revert back to UNLOCKED with active assignment will highly force application to continue providing service, that could end up many unhandled cases at applications.

At page 83, a migration from quiesced to active, I think it's for failover during si-swap, where an error happens at current STANDBY SU after quiesced ACTIVE SU.

I also would like to listen to other maintainers.

Thanks,
Minh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2016-11-10

The lock operation is not consistent behavior between SGs in scenario of failover during lock command. Mark this ticket as defect for future

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2016-11-10

summary: AMF: Rollback admin shutdown SI operation if node failover --> AMF: Rollback admin shutdown/lock SI operation if node failover

Type: discussion --> defect

Milestone: 5.2.FC --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-18

Here is update for 2N:
1. In case of Comp-f/o,
- with SI Dep configured, admin state is locked
- without SI Dep configured, admin state is unlocked.
2. In case of SU f/o:
- with or without SI Dep configured, admin state is unlocked.
- with single SI assignment, the admin state is locked.
3. In case of node f/o: same as SU f/o.

So, I think that in different situation, Amf can mark the differant state depending upon how it can recover from faults. But, when the admin state is marked UNLOCKED then the admin op result should be sent as TIMEOUT.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-18

I will check the admin op return code and will try to fix them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-18

Please comment.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-01-19

Hi Nagu,

The section 3.6.2 in Compliance Table in AMF PR says:

In 2N model, deviations related to shutdown operation on SI:
1) During shutdown operation on SI if component faults and it leads to component-failover, sufailover or nodefailover, then shutdown operation will not be completed and SI will remain in assigned state.
2) There is, however, one deviation when SI dependency is configured and component's fault leads to component-failover. In this case SI will go to locked state with no assignments.

It looks only one case, which SI dependency is configured and component-failover is escalated, then the SI will go to LOCKED. All other cases, the SI remains ASSIGNED state, I assume the SI should be UNLOCKED to have ASSIGNED state

Comparing to your findings, is there something we have to do with "single SI assignment, the admin state is locked." for su f/o and node f/o?

I think it's good idea that we return other codes (TIMEOUT,...?) in case error escalation that rollback the shutdown command.

Another question, do you know use case's motivation or technical problem behind that we had this deviation/inconsistency? It's good to document in README at least for code maintainance purpose (or to know what to do in similar cases for other SGs)

thanks,
Minh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-23

status: unassigned --> accepted

assigned_to: Nagendra Kumar

Milestone: future --> 5.2.FC
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-23

Comparing to your findings, is there something we have to do with "single SI assignment, the admin state is locked." for su f/o and node f/o?
No, I think, if it is single SI assignment, we can return Success and mark admin state as Locked.
Another question, do you know use case's motivation or technical problem behind that we had this deviation/inconsistency?
It is for ease of flow. Like for single SI, we can easily mark locked and remove the assignments from Act SU. For SIs having two assignments, 2N red model is reusing SU switchover codes, it leaves Si in unlocked state.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-23

I will provied fix of 2N red model for 5.2 release. The fix would be to return TIMEOUT for failure of admin shutdown cases when shutdown admin op gets reverted and admin state is rolled back to Unlocked.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-01-30

I think it's good idea that we return other codes (TIMEOUT,...?) in case error escalation that rollback the shutdown command.
I think it is better to return TRY_AGAIN as it gives some margin for Error occured in Specs.
"SA_AIS_ERR_TRY_AGAIN - The service cannot be provided at this time. The client
may retry later. This error generally should be returned when the requested action is
valid but not currently possible, probably because another operation is acting upon
the logical entity on which the administrative operation is invoked. Such an operation
can be another administrative operation or an error recovery initiated by the Availability
Management Framework."

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-01

Any comment ? I am preparing the patch with TRY_AGAIN.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Minh Hon Chau - 2017-02-02
  
  Hi Nagu,
  
  I prefer to not rollback the operations (as commented by Praveen earlier) if the rollback is due to internal implementation, not from a specific use case. Anyway if we have no way to correct it, then we have to accept it. I don't have a clear indication on which error code should be returned, both TRY_AGAIN and TIMEOUT seems ok since the caller will have to retry the operation.
  
  Thanks,
  Minh
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nagendra Kumar - 2017-02-02
    
    Hi Minh,
    
    I also wouldn't prefer to rollback but because of internal implementation, I am just reusing the code. I am preferring return code as TRY_AGAIN because the error has occurred and the operation can't be completed.
    
    Thanks
    
    -Nagu
    
    From: Minh Hon Chau [mailto:minh-chau@users.sf.net]
    Sent: 02 February 2017 09:56
    To: opensaf-tickets@lists.sourceforge.net
    Subject: [tickets] [opensaf:tickets] Re: #2133 AMF: Rollback admin shutdown/lock SI operation if node failover
    
    Hi Nagu,
    
    I prefer to not rollback the operations (as commented by Praveen earlier) if the rollback is due to internal implementation, not from a specific use case. Anyway if we have no way to correct it, then we have to accept it. I don't have a clear indication on which error code should be returned, both TRY_AGAIN and TIMEOUT seems ok since the caller will have to retry the operation.
    
    Thanks,
    Minh
    
    HYPERLINK "https://sourceforge.net/p/opensaf/tickets/2133/"[tickets:#2133] AMF: Rollback admin shutdown/lock SI operation if node failover
    
    Status: accepted
    Milestone: 5.2.FC
    Created: Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau
    Last Updated: Wed Feb 01, 2017 08:50 AM UTC
    Owner: Nagendra Kumar
    
    In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs
    - For 2N: the SI Admin state is rollbacked to UNLOCK
    - For Nway: the SI Admin state moves to LOCKED
    - In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK
    
    My question is whether the result of these scenario should be consistent? And what's the expected outcome?
    Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK
    
    Sent from sourceforge.net because HYPERLINK "mailto:opensaf-tickets@lists.sourceforge.net"opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/
    
    To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-01

Type: defect --> enhancement
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-01

Changing to enhancement as it changes return types of few admin operation mentioned above.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-02

Hi Minh,
I also wouldn’t prefer to rollback but because of internal implementation, I am just reusing the code. I am preferring return code as TRY_AGAIN because the error has occurred and the operation can’t be completed.

Thanks
-Nagu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-07

status: accepted --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-07

Sent patch for review with the above implementation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-21

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-02-21

changeset: 8592:f13798019501
tag: tip
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue Feb 21 10:01:48 2017 +0530
summary: amfd: return TRY_AGAIN on rollback of shutdown admin op [#2133]

[staging:f13798]

Documentation Changes:
changeset: 208:f21e52b1f0d1
tag: tip
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue Feb 21 10:28:42 2017 +0530
summary: amf: deviations on SI shutdown [#2133]

[staging:f21e52]

Related

Commit: [f13798]
Tickets: ~~#2133~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AMF: Rollback admin shutdown/lock SI operation if node failover

Milestone

Searches

Help

#2133 AMF: Rollback admin shutdown/lock SI operation if node failover

Related

Discussion

Related