The current handling of SMFD for SMFND MDS_UP/DOWN events is problematic, as shown below.
case NCSMDS_UP: if (mds_info->svc_id == NCSMDS_SVC_ID_SMFD) { return; } if (mds_info->svc_id == NCSMDS_SVC_ID_SMFND) { if (smfnd_up(mds_info->node_id, mds_info->dest, mds_info->rem_svc_pvt_ver) == SA_AIS_OK) cb->no_of_smfnd++; else LOG_WA("%s: SMFND UP failed", __FUNCTION__); } break; case NCSMDS_DOWN: if (mds_info->svc_id == NCSMDS_SVC_ID_SMFD) { return; } if (mds_info->svc_id == NCSMDS_SVC_ID_SMFND) { smfnd_down(mds_info->node_id); cb->no_of_smfnd--; } break;
Possible failure when handling SMFND UP event is not accounted for when handling the DOWN event; specifically the 'no_of_smfnd' is not incremented for failed UP events, but is always decremented for DOWN events. This could eventually result in an incorrect view of the actual number of SMFND nodes in the cluster.
An observed issue from this incorrect handling is with campaign callback invocations, in which SMFD uses the 'no_of_smfnd' as the number of expected consolidated callback responses. One or more callback responses can be missed if the 'no_of_smfnd' is miscalculated, potentially leading to wrong campaign execution.
The issue can be reproduced with below steps and attached files.
The attached 'campaign_failed_callback.xml' has a callback label for 'smf_demo_agent' (smf_demo.c) which will wait for 10s and respond with ERR_FAILED_OPERATION, so the campaign is expected to fail. But in case of the miscalculation of 'no_of_smfnd', the campaign will unexpectedly pass.
===osafsmfd traces===
<143>1 2018-03-27T11:21:49.17754+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2252"] 843:smf/smfd/SmfUpgradeStep.cc:2554 >> checkAndInvokeCallback <143>1 2018-03-27T11:21:49.17755+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2253"] 843:smf/smfd/SmfCallback.cc:76 >> execute <143>1 2018-03-27T11:21:49.17756+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2254"] 843:smf/smfd/SmfCallback.cc:108 >> send_callback_msg: callback invoked atAction 0 <143>1 2018-03-27T11:21:49.177571+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2255"] 843:smf/smfd/SmfCallback.cc:165 T2 cbk label c_str() SMF_DEMO, size 8 <143>1 2018-03-27T11:21:49.177582+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2256"] 843:smf/smfd/SmfCallback.cc:206 T2 stringToPass CALLBACK_FAILED <143>1 2018-03-27T11:21:49.177592+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2257"] 843:smf/smfd/SmfCallback.cc:212 T2 dn safSmfStep=0001,safSmfProc=Callback_Return_Failure,safSmfCampaign=failed_callback, size 81 <143>1 2018-03-27T11:21:49.177602+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2258"] 843:smf/smfd/SmfCallback.cc:259 TR before mds send <143>1 2018-03-27T11:21:49.177771+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2259"] 843:smf/smfd/SmfCallback.cc:268 T2 mds send successful, rc=1 ..... <143>1 2018-03-27T11:21:49.178932+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2277"] 342:smf/smfd/smfd_evt.c:175 >> proc_callback_rsp <143>1 2018-03-27T11:21:49.178935+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2278"] 342:smf/smfd/smfd_evt.c:188 T2 Received evt_type: 1, inv_id: 1, err: 1 <143>1 2018-03-27T11:21:49.178938+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2279"] 342:smf/smfd/smfd_evt.c:193 T2 found the node with inv_id: 1 <143>1 2018-03-27T11:21:49.178941+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2280"] 342:smf/smfd/smfd_evt.c:205 TR last response received, cleaning up the node <143>1 2018-03-27T11:21:49.17895+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2281"] 342:smf/smfd/smfd_evt.c:243 << proc_callback_rsp <143>1 2018-03-27T11:21:49.179221+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2282"] 843:smf/smfd/SmfCallback.cc:276 T2 before poll, fds[0].fd = 40 <143>1 2018-03-27T11:21:49.179242+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2283"] 843:smf/smfd/SmfCallback.cc:303 T2 Got response for the inv_id 1, err 1 <143>1 2018-03-27T11:21:49.179254+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2284"] 843:smf/smfd/SmfCallback.cc:347 << send_callback_msg <143>1 2018-03-27T11:21:49.179265+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2285"] 843:smf/smfd/SmfCallback.cc:79 << execute: rc = SA_AIS_OK (1) <143>1 2018-03-27T11:21:49.179276+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2286"] 843:smf/smfd/SmfUpgradeStep.cc:0 << checkAndInvokeCallback
===osafsmfnd traces on PL with smf_demo_agent===
<143>1 2018-03-27T11:21:59.181472+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="28"] 203:smf/smfnd/smfnd_evt.c:536 >> proc_cbk_req_rsp <143>1 2018-03-27T11:21:59.181528+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="29"] 203:smf/smfnd/smfnd_evt.c:505 >> smfnd_cbk_resp_proc <143>1 2018-03-27T11:21:59.181553+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="30"] 203:smf/smfnd/smfnd_evt.c:398 >> smfnd_cbk_resp_err_proc: SMFND: Received ERR resp for the inv_id: 1. <143>1 2018-03-27T11:21:59.181755+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="31"] 203:smf/smfnd/smfnd_evt.c:430 << smfnd_cbk_resp_err_proc <143>1 2018-03-27T11:21:59.181809+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="32"] 203:smf/smfnd/smfnd_evt.c:518 << smfnd_cbk_resp_proc <143>1 2018-03-27T11:21:59.181833+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="33"] 203:smf/smfnd/smfnd_evt.c:552 << proc_cbk_req_rsp
commit 1c302a300e449e8a8527671fbd6c7f4e2b41e95d (origin/develop)
Author: Nguyen Luu nguyen.tk.luu@dektech.com.au
Date: Wed Mar 28 11:34:52 2018 +0700