Menu

#2387 amf: choose CLM unlocked spare controller for standby role in failover situation

5.0.2
fixed
Praveen
None
defect
amf
d
5.2.RC1
major
2017-03-23
2017-03-17
Ritu Raj
No

Environment details

OS : Suse 64bit
Changeset : 8701 ( 5.2.RC1)
6 nodes setup(3 controller and 3 payload, with SC_ABSENCE enabled)

Summary

choose CLM unlocked spare controller for standby role in failover situation

Steps followed & Observed behaviour

  1. Initially SC-1 (ACTIVE), SC-2 (QUIESCED) , SC-3 (STANDBY) role
  2. Performed clm_lock operation on SC-2(QUIESCED) controller
  3. after, that perfomed on failover on Active controller (SC-1), by killing one director
  4. Observed that SC-3 got Active role while SC-2 got Standby role, which is not expcted as node SC-2 is in clm_locked state
  5. Later, SC-1 joined as QUIESCED controller (after recovery from failover)

Expected:
clm_lock node should not get standby role as it is in locked state and SC-1 should join as a Standby after recovery from failover.

Syslog:
Mar 17 17:56:59 suseR2-S2 osafimmnd[21809]: NO Implementer (applier) connected: 28 (@safSmf_applier1) <0, 2030f>
Mar 17 17:56:59 suseR2-S2 osafamfnd[21859]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO RDE role set to STANDBY
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO Peer up on node 0x2030f
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO Got peer info request from node 0x2030f with role ACTIVE
Mar 17 17:56:59 suseR2-S2 osafrded[21779]: NO Got peer info response from node 0x2030f with role ACTIVE
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 24 (change:3, dest:13)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 24 (change:5, dest:13)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 24 (change:5, dest:13)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 25 (change:3, dest:566317113647120)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: NO MDS event from svc_id 25 (change:3, dest:565213543063568)
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: IN AMF HA STANDBY request
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: IN Added IMMND node with dest 566317113647120
Mar 17 17:56:59 suseR2-S2 osafimmd[21798]: IN Added IMMND node with dest 565213543063568
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA saClmClusterNodeGet failed, rc=SA_AIS_ERR_UNAVAILABLE (31)
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA proc_mds_info: SMFND UP failed
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA saClmClusterNodeGet failed, rc=SA_AIS_ERR_UNAVAILABLE (31)
Mar 17 17:56:59 suseR2-S2 osafsmfd[21878]: WA proc_mds_info: SMFND UP failed

From Traces:

SC-2 left the cluster as clm lock operation performed and later SC-1 left the cluster as one failover performed:

SC-2:::
 Mar 17 17:54:24.123134 osafamfnd [6773:src/amf/amfnd/clm.cc:0196] >> clm_track_cb: '0' '4' '1'
Mar 17 17:54:24.123142 osafamfnd [6773:src/amf/amfnd/clm.cc:0217] TR Node has left the cluster 'safNode=SC-2,safCluster=myClmCluster', avnd_cb->first_time_up 0,notifItem->clusterNode.nodeId 131599, avnd_cb->node_info.nodeId 131343
-----
-----
SC-1:::
 Mar 17 17:57:03.514477 osafamfnd [9266:src/amf/amfnd/clm.cc:0196] >> clm_track_cb: '0' '4' '1'
Mar 17 17:57:03.514484 osafamfnd [9266:src/amf/amfnd/clm.cc:0217] TR Node has left the cluster 'safNode=SC-1,safCluster=myClmCluster', avnd_cb->first_time_up 0,notifItem->clusterNode.nodeId 131343, avnd_cb->node_info.nodeId 131855

after failover SC-2 got standby role and SC-3 Active :

SC::2
 Mar 17 17:56:59.941081 osafamfnd [21859:src/amf/amfnd/susm.cc:1043] NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Mar 17 17:56:59.941089 osafamfnd [21859:src/amf/amfnd/err.cc:1639] >> is_no_assignment_due_to_escalations
Mar 17 17:56:59.941097 osafamfnd [21859:src/amf/amfnd/err.cc:1651] << is_no_assignment_due_to_escalations: false
Mar 17 17:56:59.941104 osafamfnd [21859:src/amf/amfnd/di.cc:0829] >> avnd_di_susi_resp_send: Sending Resp su=safSu=SC-2,safSg=2N,safApp=OpenSAF, si=safSi=SC-2N,safApp=OpenSAF, curr_state=2, prv_state=0
Mar 17 17:56:59.941112 osafamfnd [21859:src/amf/amfnd/di.cc:0839] TR curr_assign_state '3
----
----

SC:::3
Mar 17 17:57:03.656105 osafamfnd [9266:src/amf/amfnd/susm.cc:1043] NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-3,safSg=2N,safApp=OpenSAF'
Mar 17 17:57:03.656113 osafamfnd [9266:src/amf/amfnd/err.cc:1639] >> is_no_assignment_due_to_escalations
Mar 17 17:57:03.656120 osafamfnd [9266:src/amf/amfnd/err.cc:1651] << is_no_assignment_due_to_escalations: false

Notes:
1. Syslog attached
2. amfd and amfnd traces of active, standby and spare controller attached

3 Attachments

Related

Tickets: #2387
Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

  • Praveen

    Praveen - 2017-03-20
    • status: unassigned --> assigned
    • assigned_to: Praveen
    • Part: - --> d
    • Milestone: 5.2.RC2 --> 5.0.2
     
  • Praveen

    Praveen - 2017-03-20

    Hi Ritu,
    I have analysed this issue. The problem is because SMF tries to call saClmClusterNodeGet() when it gets standby assignment. API call fails as it is a non-member node. This problem was identified already while fixing #1781 and an enhancement ticket was raised in SMF #1791 "smf: use CLM cluster tracking instead of reading per node up for SMFND". Since MW assignments are not affected on CLM locked node, AMF giving fresh standby role seems to be justified. Problem will get fixed when SMF ticket #1791 is implemented.
    However this AMF ticket can be used for one purpose. In failover situation, AMF will change standby controller to active controller and then it will choose a spare controller for fresh standby assignments. What I am observing is: if multiple spare controllers are available then also AMF is chosing CLM locked spare controller for fresh standby role. If available, AMF must choose CLM unlocked spare controller for fresh standby assignments. This will keep alive possibiltiy of controller role swap with si-swap admin op.
    Please change the title of the ticket to "amf: choose CLM unlocked spare controller for standby role in failover situation."
    Thanks,
    Praveen

     
  • Ritu Raj

    Ritu Raj - 2017-03-20
    • summary: clm_locked spare controller got standby role after failover --> amf: choose CLM unlocked spare controller for standby role in failover situation
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -4,7 +4,7 @@
     6 nodes setup(3 controller and 3 payload,  with SC_ABSENCE enabled)
    
     ###Summary
    -clm_locked spare controller got standby role after failover
    +choose CLM unlocked spare controller for standby role in failover situation
    
     ###Steps followed & Observed behaviour
     1. Initially SC-1 (ACTIVE), SC-2 (QUIESCED) , SC-3 (STANDBY) role
    
     
  • Praveen

    Praveen - 2017-03-20
    • status: assigned --> accepted
     
  • Praveen

    Praveen - 2017-03-21
    • status: accepted --> review
     
  • Praveen

    Praveen - 2017-03-23
    • status: review --> fixed
     
  • Praveen

    Praveen - 2017-03-23

    https://sourceforge.net/p/opensaf/mailman/message/35738800/

    changeset: 8712:a3ba6212ecf6
    branch: opensaf-5.1.x
    parent: 8707:4e47c66382f3
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Thu Mar 23 11:34:31 2017 +0530
    summary: amfd: choose CLM unlocked spare controller for standby role in failover situation[#2387]

    changeset: 8713:3a718e40acec
    branch: opensaf-5.0.x
    parent: 8708:9073359c83b4
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Thu Mar 23 11:35:07 2017 +0530
    summary: amfd: choose CLM unlocked spare controller for standby role in failover situation[#2387]

    changeset: 8714:ffb6233abe8b
    tag: tip
    parent: 8711:262d1f2132ca
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Thu Mar 23 11:36:00 2017 +0530
    summary: amfd: choose CLM unlocked spare controller for standby role in failover situation[#2387]

     

    Related

    Tickets: #2387


Log in to post a comment.