OpenSAF / Tickets / #2547 amfd: payload cannot join cluster

#2547 amfd: payload cannot join cluster

Milestone: 5.17.11

Status: fixed

Owner: Gary Lee

Labels: None

Type: defect

Component: amf

Part: d

Version:

Priority: major

Blocker: False

Updated: 2017-10-30

Created: 2017-08-09

Creator: Gary Lee

Private: No

If a payload is stopped and restarted quickly, sometimes it will not be able to re-join the cluster.

CLM and MDS events are sent to the main thread in separate pathways. Here we can see a MDS DOWN event arriving out of order, after CLM JOIN.

Jul 27 11:45:15.259963 osafamfd [264:264:src/clm/agent/clma_api.c:0829] >> saClmDispatch
Jul 27 11:45:15.260082 osafamfd [264:264:src/amf/amfd/clm.cc:0222] >> clm_track_cb: '0' '4' '1'
Jul 27 11:45:15.260103 osafamfd [264:264:src/amf/amfd/clm.cc:0238] TR numberOfMembers:'4', numberOfItems:'1'
Jul 27 11:45:15.260121 osafamfd [264:264:src/amf/amfd/clm.cc:0244] TR i = 0, node:'safNode=PL-4,safCluster=myClmCluster', clusterChange:3
Jul 27 11:45:15.260133 osafamfd [264:264:src/amf/amfd/clm.cc:0299] TR  Node Left: rootCauseEntity safNode=PL-4,safCluster=myClmCluster for node 132111

Jul 27 11:45:15.279492 osafamfd [264:264:src/clm/agent/clma_api.c:0829] >> saClmDispatch
Jul 27 11:45:15.279574 osafamfd [264:264:src/amf/amfd/clm.cc:0222] >> clm_track_cb: '0' '4' '1'
Jul 27 11:45:15.279581 osafamfd [264:264:src/amf/amfd/clm.cc:0238] TR numberOfMembers:'5', numberOfItems:'1'
Jul 27 11:45:15.279589 osafamfd [264:264:src/amf/amfd/clm.cc:0244] TR i = 0, node:'safNode=PL-4,safCluster=myClmCluster', clusterChange:2
Jul 27 11:45:15.279609 osafamfd [264:264:src/amf/amfd/node.cc:0052] TR added node 132111
Jul 27 11:45:15.279620 osafamfd [264:264:src/amf/amfd/clm.cc:0380] TR Node Joined 'safNode=PL-4,safCluster=myClmCluster' '36'

Jul 27 11:45:15.287973 osafamfd [264:264:src/amf/amfd/main.cc:0770] >> process_event: evt->rcv_evt 21
Jul 27 11:45:15.287979 osafamfd [264:264:src/amf/amfd/ndfsm.cc:0771] >> avd_mds_avnd_down_evh: 2040f, 0x55c93b1dfda0
Jul 27 11:45:15.287986 osafamfd [264:264:src/amf/amfd/ndproc.cc:1219] >> avd_node_failover: 'safAmfNode=PL-4,safAmfCluster=myAmfCluster'
Jul 27 11:45:15.287991 osafamfd [264:264:src/amf/amfd/ndfsm.cc:1110] >> avd_node_mark_absent

Jul 27 11:45:15.785245 osafamfd [264:264:src/amf/amfd/ndfsm.cc:0296] >> avd_node_up_evh: from 2040f, safAmfNode=PL-4,safAmfCluster=myAmfCluster
Jul 27 11:45:15.785261 osafamfd [264:264:src/amf/amfd/ndfsm.cc:0363] TR invalid node ID (2040f)

Gary Lee - 2017-08-14

status: accepted --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gary Lee - 2017-08-29

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

develop:

commit f921600fa2affd69e898a8beb0848c75924cfae1
Author: Gary Lee gary.lee@dektech.com.au
Date: Tue Aug 29 13:42:50 2017 +1000

amfd: postpone deletion of node from node_id_db [#2547]

CLM and MDS callbacks are delivered to the main thread via different paths.
If a node is restarted quickly, sometimes CLM JOIN is processed before the
prior MDS down. This means the node will not be able to join the cluster
as it is not in node_id_db (deleted in MDS down processing).

This patch ensures addition to, and removal from node_id_db is only done
from CLM callbacks to avoid race conditions such as above.

release:

commit ea6e888f075d2eb365f22b90755fa4ef248c934d

Author: Gary Lee gary.lee@dektech.com.au
Date: Tue Aug 29 13:42:50 2017 +1000

amfd: postpone deletion of node from node_id_db [#2547]

CLM and MDS callbacks are delivered to the main thread via different paths.
If a node is restarted quickly, sometimes CLM JOIN is processed before the
prior MDS down. This means the node will not be able to join the cluster
as it is not in node_id_db (deleted in MDS down processing).

This patch ensures addition to, and removal from node_id_db is only done
from CLM callbacks to avoid race conditions such as above.

Last edit: Gary Lee 2017-08-29

amfd: payload cannot join cluster

Milestone

Searches

Help

#2547 amfd: payload cannot join cluster

Related

Discussion