Menu

#2547 amfd: payload cannot join cluster

5.17.11
fixed
Gary Lee
None
defect
amf
d
major
False
2017-10-30
2017-08-09
Gary Lee
No

If a payload is stopped and restarted quickly, sometimes it will not be able to re-join the cluster.

CLM and MDS events are sent to the main thread in separate pathways. Here we can see a MDS DOWN event arriving out of order, after CLM JOIN.

Jul 27 11:45:15.259963 osafamfd [264:264:src/clm/agent/clma_api.c:0829] >> saClmDispatch
Jul 27 11:45:15.260082 osafamfd [264:264:src/amf/amfd/clm.cc:0222] >> clm_track_cb: '0' '4' '1'
Jul 27 11:45:15.260103 osafamfd [264:264:src/amf/amfd/clm.cc:0238] TR numberOfMembers:'4', numberOfItems:'1'
Jul 27 11:45:15.260121 osafamfd [264:264:src/amf/amfd/clm.cc:0244] TR i = 0, node:'safNode=PL-4,safCluster=myClmCluster', clusterChange:3
Jul 27 11:45:15.260133 osafamfd [264:264:src/amf/amfd/clm.cc:0299] TR  Node Left: rootCauseEntity safNode=PL-4,safCluster=myClmCluster for node 132111

Jul 27 11:45:15.279492 osafamfd [264:264:src/clm/agent/clma_api.c:0829] >> saClmDispatch
Jul 27 11:45:15.279574 osafamfd [264:264:src/amf/amfd/clm.cc:0222] >> clm_track_cb: '0' '4' '1'
Jul 27 11:45:15.279581 osafamfd [264:264:src/amf/amfd/clm.cc:0238] TR numberOfMembers:'5', numberOfItems:'1'
Jul 27 11:45:15.279589 osafamfd [264:264:src/amf/amfd/clm.cc:0244] TR i = 0, node:'safNode=PL-4,safCluster=myClmCluster', clusterChange:2
Jul 27 11:45:15.279609 osafamfd [264:264:src/amf/amfd/node.cc:0052] TR added node 132111
Jul 27 11:45:15.279620 osafamfd [264:264:src/amf/amfd/clm.cc:0380] TR Node Joined 'safNode=PL-4,safCluster=myClmCluster' '36'

Jul 27 11:45:15.287973 osafamfd [264:264:src/amf/amfd/main.cc:0770] >> process_event: evt->rcv_evt 21
Jul 27 11:45:15.287979 osafamfd [264:264:src/amf/amfd/ndfsm.cc:0771] >> avd_mds_avnd_down_evh: 2040f, 0x55c93b1dfda0
Jul 27 11:45:15.287986 osafamfd [264:264:src/amf/amfd/ndproc.cc:1219] >> avd_node_failover: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
Jul 27 11:45:15.287991 osafamfd [264:264:src/amf/amfd/ndfsm.cc:1110] >> avd_node_mark_absent

Jul 27 11:45:15.785245 osafamfd [264:264:src/amf/amfd/ndfsm.cc:0296] >> avd_node_up_evh: from 2040f, safAmfNode=PL-4,safAmfCluster=myAmfCluster
Jul 27 11:45:15.785261 osafamfd [264:264:src/amf/amfd/ndfsm.cc:0363] TR invalid node ID (2040f)

Related

Wiki: ChangeLog-5.17.11

Discussion

  • Gary Lee

    Gary Lee - 2017-08-14
    • status: accepted --> review
     
  • Gary Lee

    Gary Lee - 2017-08-29
    • status: review --> fixed
     
  • Gary Lee

    Gary Lee - 2017-08-29

    develop:

    commit f921600fa2affd69e898a8beb0848c75924cfae1
    Author: Gary Lee gary.lee@dektech.com.au
    Date: Tue Aug 29 13:42:50 2017 +1000

    amfd: postpone deletion of node from node_id_db [#2547]
    
    CLM and MDS callbacks are delivered to the main thread via different paths.
    If a node is restarted quickly, sometimes CLM JOIN is processed before the
    prior MDS down. This means the node will not be able to join the cluster
    as it is not in node_id_db (deleted in MDS down processing).
    
    This patch ensures addition to, and removal from node_id_db is only done
    from CLM callbacks to avoid race conditions such as above.
    

    release:

    commit ea6e888f075d2eb365f22b90755fa4ef248c934d
    

    Author: Gary Lee gary.lee@dektech.com.au
    Date: Tue Aug 29 13:42:50 2017 +1000

    amfd: postpone deletion of node from node_id_db [#2547]
    
    CLM and MDS callbacks are delivered to the main thread via different paths.
    If a node is restarted quickly, sometimes CLM JOIN is processed before the
    prior MDS down. This means the node will not be able to join the cluster
    as it is not in node_id_db (deleted in MDS down processing).
    
    This patch ensures addition to, and removal from node_id_db is only done
    from CLM callbacks to avoid race conditions such as above.
    
     

    Last edit: Gary Lee 2017-08-29

Log in to post a comment.