During unlock nodegroup operation, if a su has active assignment to fail over, amfd will send assingment susi message to move ha state from standby assignment of the other su to active. This susi message is sent but the su_cnt_admin_oper is not increased. However, when amfd receives assignment response, amfd decreases this su_cnt_admin_oper. At the end, the su_cnt_admin_oper is ~0, thus the admin operation timeout.
Below is outlined amfd trace
Su failover
Sep 11 7:51:09.337401 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:3379] >> nodefail: 'safSu=PL-3,safSg=2N,safApp=ABC-sv.SCD', 0
Sep 11 7:51:09.337576 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:2443] >> avdsgsusimodsnd: 'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD', state 1
Sep 11 7:51:09.337580 osafamfd [8907:8907:../../opensaf/src/amf/amfd/su.cc:2601] >> anysusifsmin: SU:'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD', checkfsm:1
Sep 11 7:51:09.337583 osafamfd [8907:8907:../../opensaf/src/amf/amfd/su.cc:2606] TR SUSI:'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD,safSi=sv.SCD-2N-1,safApp=ABC-sv.SCD', fsm:'3'
susi msg is sent but su_cnt_admin_oper is not increased
*Sep 11 7:51:09.338296 osafamfd [8907:8907:../../opensaf/src/amf/amfd/util.cc:0693] >> avdsndsusimsg
Sep 11 7:51:09.338310 osafamfd [8907:8907:../../opensaf/src/amf/amfd/util.cc:0967] TR Sending 15 to 2040f
...
Sep 11 7:51:09.338457 osafamfd [8907:8907:../../opensaf/src/amf/amfd/util.cc:0978] << avdsndsusimsg *
receive susi response, but su_cnt_admin_oper is decreased, it's 0. But amfd still waits for another susi response
Sep 11 7:51:09.359295 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1172] >> avdsusiassignevh: id:745, node:2040f, act:5, 'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD', '', ha:1, err:1, single:0
Sep 11 7:51:09.359600 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:2433] >> susisuccess: 'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD' act=5, hastate=1, sgfsmstate=1
Sep 11 7:51:09.359605 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:1498] >> susisuccesssgrealign: 'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD' act=5, state=1
Sep 11 7:51:09.359609 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg.cc:1847] TR safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD found in safSg=2N,safApp=ABC-sv.SCD
Sep 11 7:51:09.359615 osafamfd [8907:8907:../../opensaf/src/amf/amfd/siass.cc:0120] TR HA State ACTIVE of safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD for safSi=sv.SCD-2N-1,safApp=ABC-sv.SCD
...
Sep 11 7:51:09.362551 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:2648] << susisuccess: rc:1
Sep 11 7:51:09.362585 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1682] TR node:'safAmfNode=PL-4,safAmfCluster=myAmfCluster', sucntadminoper:0
Sep 11 7:51:09.362594 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0359] >> processsusiresponseforng: 'safSu=PL-4,safSg=2N,safApp=ABC-sv.SCD'
Sep 11 7:51:09.362599 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0430] TR nodeoperlist size:2
Sep 11 7:51:09.362603 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0458] << processsusiresponseforng
another susi response comes, the su_cnt_admin_oper decreases as normall, but its value now ~0. PL-4 is removed out of the list
Sep 11 7:51:09.670160 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1172] >> avdsusiassignevh: id:748, node:2040f, act:2, 'safSu=PL-4,safSg=NWayActive,safApp=ABC-ORE', 'safSi=All-NWayActive,safApp=ABC-ORE', ha:1, err:1, single:0
Sep 11 7:51:09.670414 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgnwayactfsm.cc:0599] - >> susisuccess: 1
Sep 11 7:51:09.670518 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgnwayactfsm.cc:1088] << susisuccess
Sep 11 7:51:09.670524 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1682] TR node:'safAmfNode=PL-4,safAmfCluster=myAmfCluster', sucntadminoper:4294967295
Sep 11 7:51:09.670528 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0359] >> processsusiresponseforng: 'safSu=PL-4,safSg=NWayActive,safApp=ABC-ORE'
Sep 11 7:51:09.670532 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0458] << processsusiresponseforng
amf continues create another susi assignment, PL-4 is added back to list, su_cnt_admin_oper increase by one, which is 0
Sep 11 7:51:14.534857 osafamfd [8907:8907:../../opensaf/src/amf/amfd/siass.cc:0475] >> avdsusicreate: safSu=PL-4,safSg=2N,safApp=ABC-min.minm safSi=min.minm-2N-1,safApp=ABC-min.minm state=2
Sep 11 7:51:14.536717 osafamfd [8907:8907:../../opensaf/src/amf/amfd/util.cc:0693] >> avdsndsusimsg
Sep 11 7:51:14.536940 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0269] TR node:'safAmfNode=PL-4,safAmfCluster=myAmfCluster', sucntadminoper:0
Sep 11 7:51:14.536946 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0272] TR nodeoperlist size:2
Sep 11 7:51:14.537445 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:2648] << susisuccess: rc:1
Sep 11 7:51:14.537453 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1682] TR node:'safAmfNode=SC-1,safAmfCluster=myAmfCluster', sucntadminoper:0
Sep 11 7:51:14.537457 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0359] >> processsusiresponseforng: 'safSu=SC-1,safSg=2N,safApp=ABC-min.minm'
Sep 11 7:51:14.537462 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0430] TR nodeoperlist size:1
Sep 11 7:51:14.537465 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0458] << processsusiresponseforng
amf receives the last susi response, su_cnt_admin_oper decrease by 1, which is ~0. It's not zero, so PL-4 is not removed out of the list.
Sep 11 7:51:14.544879 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1172] >> avdsusiassignevh: id:754, node:2040f, act:2, 'safSu=PL-4,safSg=2N,safApp=ABC-min.minm', 'safSi=min.minm-2N-1,safApp=ABC-min.minm', ha:2, err:1, single:0
Sep 11 7:51:14.545179 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:2433] >> susisuccess: 'safSu=PL-4,safSg=2N,safApp=ABC-min.minm' act=2, hastate=2, TEST sgfsmstate=1
Sep 11 7:51:14.545183 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sg2nfsm.cc:1498] >> susisuccesssgrealign: 'safSu=PL-4,safSg=2N,safApp=ABC-min.minm' act=2, state=2
Sep 11 7:51:14.545994 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:1682] TR node:'safAmfNode=PL-4,safAmfCluster=myAmfCluster', sucntadminoper:4294967295
Sep 11 7:51:14.545998 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0359] >> processsusiresponseforng: 'safSu=PL-4,safSg=2N,safApp=ABC-min.minm'
Sep 11 7:51:14.546002 osafamfd [8907:8907:../../opensaf/src/amf/amfd/sgproc.cc:0458] << processsusiresponseforng
A similiar issue of unlock nodegroup operation, but different cause.
.#2526 amfd: Command unlock nodegroup timeout if su failover is escalated (>= 2SIs)
Diff:
Please provide Configuration, steps to reproduce and attach AMFD logs,
A possible fix for this?
diff --git a/src/amf/amfd/sgproc.cc b/src/amf/amfd/sgproc.cc
index 610c205..e9cb4a4 100644
--- a/src/amf/amfd/sgproc.cc
+++ b/src/amf/amfd/sgproc.cc
@@ -2633,6 +2633,13 @@ void avd_su_role_failover(AVD_SU su, AVD_SU stdby_su) {
* So perform role failover
/
rc = avd_sg_su_si_mod_snd(avd_cb, stdby_su, SA_AMF_HA_ACTIVE);
+ if ((su->su_on_node->admin_ng != nullptr) &&
+ (su->su_on_node->admin_ng->admin_ng_pend_cbk.invocation != 0)) {
+ su->su_on_node->su_cnt_admin_oper++;
+ TRACE("node:'%s', su_cnt_admin_oper:%u", su->su_on_node->name.c_str(),
+ su->su_on_node->su_cnt_admin_oper);
+ }
+
if (rc == NCSCC_RC_SUCCESS) {
/ Update the dependent SI's dep_state */
avd_sidep_update_depstate_su_rolefailover(su);
Fixed with the patch:
commit c7092966942574b013f590a6f00cc2f261ab8748
Author: ravi-sekhar ravisekhar.konda@oracle.com
Date: Fri Feb 23 15:35:57 2018 +0530