In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs
- For 2N: the SI Admin state is rollbacked to UNLOCK
- For Nway: the SI Admin state moves to LOCKED
- In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK
My question is whether the result of these scenario should be consistent? And what's the expected outcome?
Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK
Hi Minh,
Lock operation on SI is in deviation from spec for all redundancy model. Spec does not talk about giving quiesced callback for lock operation (section 10.4 page 404). There is a ticket for this already.
I just checked the code, in SI lock case, for 2N model AMFD gives removal and quiesced assignment simultaneously. But for NpM and N-Way, it gives only quiesced assignment initially and then removal of all assignments after successfully receiving the quiesced response. So in these two models,I think, operations get reverted back if quiesced state fails. But I do not know in 2N case it was same from the beginning or got modified in some fix.
Shutdown operation sequence is same in all these three red models as per spec. In case of 2N model, there are different behaviuors in case of faults and in case of SIdeps. Some of these are documented in AMF PR doc section 3.6.2 conformance table. There are atleast 2 tickets related to inconsistency.
I think in case of lock operation on SI, spec does not talk about reverting it back. However there is one diagram at page 83 where quiesced to active migration is shown. But I think this must be the case of shutdown or si-swap operation. For shutdown operation, there is a guidance in spec chapter 9 section 9.1, whcih says:
"Note that the shutdown administrative operation is non-blocking, which means that it
may complete while the actual procedure to shut down the target entity is still in
progress (that is, the entity has not yet reached the locked administrative state). As
soon as the shutdown administrative operation completes, other administrative operations
such as lock or unlock can be invoked; they can interrupt the shutdown procedure
and force the target entity into a locked or an unlocked administrative state."
I think we should do following things for consistent behaviours:
1) For lock opeation on SI, we should not revert it back irrespective of faults.
2) For shutdown operation on SIalso, we should not revert it back in case some fault occurs.
What do you think?
Also I wouid like other maintainers to comment on it.
Thanks,
Praveen
Hi Praveen,
I agree for both 1) and 2) that lock and shutdown SI operations should not be reverted in case of fault. The reason (I think) is when an operation lock/shutdown SI is issued, that likely means application denies providing service, which could involve in some kinds of releasing resource, closing connection, ...So revert back to UNLOCKED with active assignment will highly force application to continue providing service, that could end up many unhandled cases at applications.
At page 83, a migration from quiesced to active, I think it's for failover during si-swap, where an error happens at current STANDBY SU after quiesced ACTIVE SU.
I also would like to listen to other maintainers.
Thanks,
Minh
The lock operation is not consistent behavior between SGs in scenario of failover during lock command. Mark this ticket as defect for future
Here is update for 2N:
1. In case of Comp-f/o,
- with SI Dep configured, admin state is locked
- without SI Dep configured, admin state is unlocked.
2. In case of SU f/o:
- with or without SI Dep configured, admin state is unlocked.
- with single SI assignment, the admin state is locked.
3. In case of node f/o: same as SU f/o.
So, I think that in different situation, Amf can mark the differant state depending upon how it can recover from faults. But, when the admin state is marked UNLOCKED then the admin op result should be sent as TIMEOUT.
I will check the admin op return code and will try to fix them.
Please comment.
Hi Nagu,
The section 3.6.2 in Compliance Table in AMF PR says:
In 2N model, deviations related to shutdown operation on SI:
1) During shutdown operation on SI if component faults and it leads to component-failover, sufailover or nodefailover, then shutdown operation will not be completed and SI will remain in assigned state.
2) There is, however, one deviation when SI dependency is configured and component's fault leads to component-failover. In this case SI will go to locked state with no assignments.
It looks only one case, which SI dependency is configured and component-failover is escalated, then the SI will go to LOCKED. All other cases, the SI remains ASSIGNED state, I assume the SI should be UNLOCKED to have ASSIGNED state
Comparing to your findings, is there something we have to do with "single SI assignment, the admin state is locked." for su f/o and node f/o?
I think it's good idea that we return other codes (TIMEOUT,...?) in case error escalation that rollback the shutdown command.
Another question, do you know use case's motivation or technical problem behind that we had this deviation/inconsistency? It's good to document in README at least for code maintainance purpose (or to know what to do in similar cases for other SGs)
thanks,
Minh
I will provied fix of 2N red model for 5.2 release. The fix would be to return TIMEOUT for failure of admin shutdown cases when shutdown admin op gets reverted and admin state is rolled back to Unlocked.
Any comment ? I am preparing the patch with TRY_AGAIN.
Hi Nagu,
I prefer to not rollback the operations (as commented by Praveen earlier) if the rollback is due to internal implementation, not from a specific use case. Anyway if we have no way to correct it, then we have to accept it. I don't have a clear indication on which error code should be returned, both TRY_AGAIN and TIMEOUT seems ok since the caller will have to retry the operation.
Thanks,
Minh
Hi Minh,
Thanks
-Nagu
From: Minh Hon Chau [mailto:minh-chau@users.sf.net]
Sent: 02 February 2017 09:56
To: opensaf-tickets@lists.sourceforge.net
Subject: [tickets] [opensaf:tickets] Re: #2133 AMF: Rollback admin shutdown/lock SI operation if node failover
Hi Nagu,
I prefer to not rollback the operations (as commented by Praveen earlier) if the rollback is due to internal implementation, not from a specific use case. Anyway if we have no way to correct it, then we have to accept it. I don't have a clear indication on which error code should be returned, both TRY_AGAIN and TIMEOUT seems ok since the caller will have to retry the operation.
Thanks,
Minh
HYPERLINK "https://sourceforge.net/p/opensaf/tickets/2133/"[tickets:#2133] AMF: Rollback admin shutdown/lock SI operation if node failover
Status: accepted
Milestone: 5.2.FC
Created: Thu Oct 20, 2016 06:49 PM UTC by Minh Hon Chau
Last Updated: Wed Feb 01, 2017 08:50 AM UTC
Owner: Nagendra Kumar
In scenario of shut down SI, delay QUIESCING csi callback, then reboot the node that hosting SU having pending this csi callback. The result of this operation looks differently between SGs
- For 2N: the SI Admin state is rollbacked to UNLOCK
- For Nway: the SI Admin state moves to LOCKED
- In NpM: Haven't tested just browsing SG_NPM::node_fail_si_oper, looks SI Admin states rollbacks to UNLOCK
My question is whether the result of these scenario should be consistent? And what's the expected outcome?
Also, the handling of node_fail_si_oper for admin lock is not consistent. For 2N, Admin state remains LOCKED, NpM rollbacks to UNLOCK
Sent from sourceforge.net because HYPERLINK "mailto:opensaf-tickets@lists.sourceforge.net"opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Changing to enhancement as it changes return types of few admin operation mentioned above.
Hi Minh,
I also wouldn’t prefer to rollback but because of internal implementation, I am just reusing the code. I am preferring return code as TRY_AGAIN because the error has occurred and the operation can’t be completed.
Thanks
-Nagu
Sent patch for review with the above implementation.
changeset: 8592:f13798019501
tag: tip
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue Feb 21 10:01:48 2017 +0530
summary: amfd: return TRY_AGAIN on rollback of shutdown admin op [#2133]
[staging:f13798]
Documentation Changes:
changeset: 208:f21e52b1f0d1
tag: tip
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue Feb 21 10:28:42 2017 +0530
summary: amf: deviations on SI shutdown [#2133]
[staging:f21e52]
Related
Commit: [f13798]
Tickets:
#2133