OS : Suse 64bit
Changeset : 8701 ( 5.2.RC1)
4 nodes setup(2 controller and 2 payload)
clm admin operation returns SA_AIS_ERR_BAD_OPERATION after rebooting node
SLES-SLOT1:~ # amf-adm unlock safNode=PL-3,safCluster=myClmCluster
Mar 15 14:35:20 SLES-SLOT1 osafclmd[2763]: ER clms_imm_node_unlock failed
Mar 15 14:35:20 SLES-SLOT1 osafamfd[2773]: NO Received node_up from 2030f: msg_id 1
Mar 15 14:35:20 SLES-SLOT1 osafamfd[2773]: NO Node 'PL-3' joined the cluster
Mar 15 14:35:20 SLES-SLOT1 osafimmnd[2733]: NO Implementer connected: 197 (MsgQueueService131855) <0, 2030f>
error - command timed out (alarm)
SLES-SLOT1:~ # amf-adm lock safNode=PL-3,safCluster=myClmCluster
error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: SA_AIS_ERR_BAD_OPERATION (20)
SLES-SLOT1:~ # amf-adm unlock safNode=PL-3,safCluster=myClmCluster
error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: SA_AIS_ERR_BAD_OPERATION (20)
Traces:
From the traces:
Node PL-3 joined the cluster
Mar 15 14:35:20.373997 osafclmd [2763:src/clm/clmd/clms_imm.c:0939] >> clms_imm_admin_op_callback: Admin callback for nodename:safNode=PL-3,safCluster=myClmCluster, opId:1
Mar 15 14:35:20.374002 osafclmd [2763:src/clm/clmd/clms_util.c:0038] >> clms_node_get_by_name: name input safNode=PL-3,safCluster=myClmCluster length 36
Mar 15 14:35:20.374006 osafclmd [2763:src/clm/clmd/clms_util.c:0046] TR nodename after patricia tree get safNode=PL-3,safCluster=myClmCluster
Mar 15 14:35:20.374009 osafclmd [2763:src/clm/clmd/clms_util.c:0049] << clms_node_get_by_name
Mar 15 14:35:20.374012 osafclmd [2763:src/clm/clmd/clms_imm.c:2223] >> clms_imm_node_unlock: Node name safNode=PL-3,safCluster=myClmCluster to unlock
Mar 15 14:35:20.374015 osafclmd [2763:src/clm/clmd/clms_imm.c:0579] >> clms_admin_state_update_rattr: Admin state 1 update for node safNode=PL-3,safCluster=myClmCluster
Mar 15 14:35:20.374018 osafclmd [2763:src/clm/clmd/clms_util.c:0038] >> clms_node_get_by_name: name input safNode=PL-3,safCluster=myClmCluster length 36
Mar 15 14:35:20.374021 osafclmd [2763:src/clm/clmd/clms_util.c:0046] TR nodename after patricia tree get safNode=PL-3,safCluster=myClmCluster
......
......
but Sending track callback failed for SA_CLM_CHANGE_COMPLETED
Mar 15 14:35:20.380860 osafclmd [2763:src/clm/clmd/clms_imm.c:1439] TR callback msg send to clma failed
Mar 15 14:35:20.380869 osafclmd [2763:src/clm/clmd/clms_imm.c:1447] << clms_prep_and_send_track
Mar 15 14:35:20.380872 osafclmd [2763:src/clm/clmd/clms_imm.c:1220] TR Sending track callback failed for SA_CLM_CHANGE_COMPLETED
Mar 15 14:35:20.380875 osafclmd [2763:src/clm/clmd/clms_imm.c:1380] >> clms_prep_and_send_track
and later performed admin operation got failed as 'Another Admin operation already in progress'
Mar 15 14:51:21.878688 osafclmd [2763:src/clm/clmd/clms_imm.c:0939] >> clms_imm_admin_op_callback: Admin callback for nodename:safNode=PL-3,safCluster=myClmCluster, opId:2
Mar 15 14:51:21.878700 osafclmd [2763:src/clm/clmd/clms_util.c:0038] >> clms_node_get_by_name: name input safNode=PL-3,safCluster=myClmCluster length 36
Mar 15 14:51:21.878712 osafclmd [2763:src/clm/clmd/clms_util.c:0046] TR nodename after patricia tree get safNode=PL-3,safCluster=myClmCluster
Mar 15 14:51:21.878720 osafclmd [2763:src/clm/clmd/clms_util.c:0049] << clms_node_get_by_name
Mar 15 14:51:21.878726 osafclmd [2763:src/clm/clmd/clms_imm.c:0982] TR Another Admin operation already in progress: 4
Notes:
Diff:
Analysis:
There seems problem with atleast one CLM clinet which resides on PL-3. CLM is unable to send any message to this client.
1) CLMD creates this client as standby:
2) Mar 15 13:47:47.080883 osafclmd [2763:src/clm/clmd/clms_evt.c:0140] TR client_id: 63 lookup failed
Mar 15 13:47:47.080886 osafclmd [2763:src/clm/clmd/clms_evt.c:0250] >> clms_client_new: MDS dest 2030feebec01a
Mar 15 13:47:47.080888 osafclmd [2763:src/clm/clmd/clms_evt.c:0277] << clms_client_new: client_id 63
2) When user performs admin operation, CLMD tries to send track callback for complete step to this client but mds returns failure:
3) Mar 15 14:32:10.759655 osafclmd [2763:src/clm/clmd/clms_util.c:1095] TR Client ID 63 ,track_flags=3
Mar 15 14:32:10.759658 osafclmd [2763:src/clm/clmd/clms_imm.c:1380] >> clms_prep_and_send_track
Mar 15 14:32:10.759661 osafclmd [2763:src/clm/clmd/clms_util.c:0352] >> clms_nodedb_lookup
Mar 15 14:32:10.759664 osafclmd [2763:src/clm/clmd/clms_util.c:0354] TR patricia tree size 4
Mar 15 14:32:10.759667 osafclmd [2763:src/clm/clmd/clms_util.c:0149] TR Node found 131343
Mar 15 14:32:10.759670 osafclmd [2763:src/clm/clmd/clms_util.c:0149] TR Node found 131599
Mar 15 14:32:10.759673 osafclmd [2763:src/clm/clmd/clms_util.c:0149] TR Node found 131855
Mar 15 14:32:10.759676 osafclmd [2763:src/clm/clmd/clms_util.c:0149] TR Node found 132111
Mar 15 14:32:10.759687 osafclmd [2763:src/clm/clmd/clms_util.c:0375] TR num_nd_changes 4
Mar 15 14:32:10.759689 osafclmd [2763:src/clm/clmd/clms_util.c:0376] << clms_nodedb_lookup
Mar 15 14:32:10.759693 osafclmd [2763:src/clm/clmd/clms_mds.c:1494] >> clms_mds_msg_send
Mar 15 14:32:10.759728 osafclmd [2763:src/clm/clmd/clms_mds.c:1525] IN mds send returned: 2
Mar 15 14:32:10.759732 osafclmd [2763:src/clm/clmd/clms_mds.c:1527] << clms_mds_msg_send
Mar 15 14:32:10.759735 osafclmd [2763:src/clm/clmd/clms_imm.c:1439] TR callback msg send to clma failed
4) Before admin operation on PL-3, this node was restarted. There is no evidence of this client going down in clmd traces.
5) When unlock operation was performed, CLMD again could not send membeship status to this client and did not reply to IMM. Also admin op params are not reset.. Since admin operation params are not reset, no further admin operation are not allowed and getting timed out.
There is a way to reproduce this problem.
1) Bring two controllers up.
2) On standby controller, run 5 process of a CLM application that subscribes for track callback.
3) CLM lock of standby controller.
4) Now try to perform CLM unlock of standby controller and kill 2 application instances on standby controller.
5) During unlock CLMS sends memebership status updates to agents hosted on the node being unlocked. Since CLMS has not processed MDS_DOWN events, MDS will return send failure.
Attached is CLMD traces after reproducing in this way.
commit 66970f59421f9d4338ee6d13134afca9082c1e91
Author: Praveen praveen.malviya@oracle.com
Date: Fri Apr 21 14:31:19 2017 +0530
clms: return TIME_OUT for unlock op if CLMS update to CLM agent fails [#2381]
changeset: 8775:10bbd3156a40
tag: tip
user: Praveen Malviya praveen.malviya@oracle.com
date: Fri Apr 21 14:45:15 2017 +0530
summary: clms: return TIME_OUT for unlock op if CLMS update to CLM agent fails [#2381]
Related
Tickets:
#2381Pushed in released branch with revision:
commit a79f101873dffd145aa70d9cb4eb3c99b8ffd4ca
Author: Praveen praveen.malviya@oracle.com
Date: Fri Apr 21 14:31:19 2017 +0530