OS : Suse 64bit
Changeset : 8701 ( 5.2.RC1)
6 nodes setup(3 controller and 3 payload, with SC_ABSENCE enabled)
choose CLM unlocked spare controller for standby role in failover situation
Expected:
clm_lock node should not get standby role as it is in locked state and SC-1 should join as a Standby after recovery from failover.
Syslog:
Mar 17 17:56:59 suseR2-S2 osafimmnd[21809]: NO Implementer (applier) connected: 28 (@safSmf_applier1) <0, 2030f>
Mar 17 17:56:59 suseR2-S2 osafamfnd[21859]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO RDE role set to STANDBY
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO Peer up on node 0x2030f
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO Got peer info request from node 0x2030f with role ACTIVE
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO Got peer info response from node 0x2030f with role ACTIVE
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 24 (change:3, dest:13)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 24 (change:5, dest:13)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 24 (change:5, dest:13)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 25 (change:3, dest:566317113647120)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 25 (change:3, dest:565213543063568)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: IN AMF HA STANDBY request
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: IN Added IMMND node with dest 566317113647120
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: IN Added IMMND node with dest 565213543063568
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA saClmClusterNodeGet failed, rc=SA_AIS_ERR_UNAVAILABLE (31)
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA proc_mds_info: SMFND UP failed
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA saClmClusterNodeGet failed, rc=SA_AIS_ERR_UNAVAILABLE (31)
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA proc_mds_info: SMFND UP failed
From Traces:
SC-2 left the cluster as clm lock operation performed and later SC-1 left the cluster as one failover performed:
SC-2::: Mar 17 17:54:24.123134 osafamfnd [6773:src/amf/amfnd/clm.cc:0196] >> clm_track_cb: '0' '4' '1' Mar 17 17:54:24.123142 osafamfnd [6773:src/amf/amfnd/clm.cc:0217] TR Node has left the cluster 'safNode=SC-2,safCluster=myClmCluster', avnd_cb->first_time_up 0,notifItem->clusterNode.nodeId 131599, avnd_cb->node_info.nodeId 131343 ----- ----- SC-1::: Mar 17 17:57:03.514477 osafamfnd [9266:src/amf/amfnd/clm.cc:0196] >> clm_track_cb: '0' '4' '1' Mar 17 17:57:03.514484 osafamfnd [9266:src/amf/amfnd/clm.cc:0217] TR Node has left the cluster 'safNode=SC-1,safCluster=myClmCluster', avnd_cb->first_time_up 0,notifItem->clusterNode.nodeId 131343, avnd_cb->node_info.nodeId 131855
after failover SC-2 got standby role and SC-3 Active :
SC::2 Mar 17 17:56:59.941081 osafamfnd [21859:src/amf/amfnd/susm.cc:1043] NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF' Mar 17 17:56:59.941089 osafamfnd [21859:src/amf/amfnd/err.cc:1639] >> is_no_assignment_due_to_escalations Mar 17 17:56:59.941097 osafamfnd [21859:src/amf/amfnd/err.cc:1651] << is_no_assignment_due_to_escalations: false Mar 17 17:56:59.941104 osafamfnd [21859:src/amf/amfnd/di.cc:0829] >> avnd_di_susi_resp_send: Sending Resp su=safSu=SC-2,safSg=2N,safApp=OpenSAF, si=safSi=SC-2N,safApp=OpenSAF, curr_state=2, prv_state=0 Mar 17 17:56:59.941112 osafamfnd [21859:src/amf/amfnd/di.cc:0839] TR curr_assign_state '3 ---- ---- SC:::3 Mar 17 17:57:03.656105 osafamfnd [9266:src/amf/amfnd/susm.cc:1043] NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-3,safSg=2N,safApp=OpenSAF' Mar 17 17:57:03.656113 osafamfnd [9266:src/amf/amfnd/err.cc:1639] >> is_no_assignment_due_to_escalations Mar 17 17:57:03.656120 osafamfnd [9266:src/amf/amfnd/err.cc:1651] << is_no_assignment_due_to_escalations: false
Notes:
1. Syslog attached
2. amfd and amfnd traces of active, standby and spare controller attached
Hi Ritu,
I have analysed this issue. The problem is because SMF tries to call saClmClusterNodeGet() when it gets standby assignment. API call fails as it is a non-member node. This problem was identified already while fixing #1781 and an enhancement ticket was raised in SMF #1791 "smf: use CLM cluster tracking instead of reading per node up for SMFND". Since MW assignments are not affected on CLM locked node, AMF giving fresh standby role seems to be justified. Problem will get fixed when SMF ticket #1791 is implemented.
However this AMF ticket can be used for one purpose. In failover situation, AMF will change standby controller to active controller and then it will choose a spare controller for fresh standby assignments. What I am observing is: if multiple spare controllers are available then also AMF is chosing CLM locked spare controller for fresh standby role. If available, AMF must choose CLM unlocked spare controller for fresh standby assignments. This will keep alive possibiltiy of controller role swap with si-swap admin op.
Please change the title of the ticket to "amf: choose CLM unlocked spare controller for standby role in failover situation."
Thanks,
Praveen
Diff:
https://sourceforge.net/p/opensaf/mailman/message/35738800/
changeset: 8712:a3ba6212ecf6
branch: opensaf-5.1.x
parent: 8707:4e47c66382f3
user: Praveen Malviya praveen.malviya@oracle.com
date: Thu Mar 23 11:34:31 2017 +0530
summary: amfd: choose CLM unlocked spare controller for standby role in failover situation[#2387]
changeset: 8713:3a718e40acec
branch: opensaf-5.0.x
parent: 8708:9073359c83b4
user: Praveen Malviya praveen.malviya@oracle.com
date: Thu Mar 23 11:35:07 2017 +0530
summary: amfd: choose CLM unlocked spare controller for standby role in failover situation[#2387]
changeset: 8714:ffb6233abe8b
tag: tip
parent: 8711:262d1f2132ca
user: Praveen Malviya praveen.malviya@oracle.com
date: Thu Mar 23 11:36:00 2017 +0530
summary: amfd: choose CLM unlocked spare controller for standby role in failover situation[#2387]
Related
Tickets:
#2387