Menu

#2821 smfd: Incorrect handling of SMFND NCSMDS_UP/DOWN events

5.18.04
fixed
nobody
None
defect
smf
d
major
False
2018-06-08
2018-03-27
No

The current handling of SMFD for SMFND MDS_UP/DOWN events is problematic, as shown below.

    case NCSMDS_UP:

        if (mds_info->svc_id == NCSMDS_SVC_ID_SMFD) {
            return;
        }

        if (mds_info->svc_id == NCSMDS_SVC_ID_SMFND) {
            if (smfnd_up(mds_info->node_id, mds_info->dest,
                     mds_info->rem_svc_pvt_ver) == SA_AIS_OK)
                cb->no_of_smfnd++;
            else
                LOG_WA("%s: SMFND UP failed", __FUNCTION__);
        }
        break;

    case NCSMDS_DOWN:

        if (mds_info->svc_id == NCSMDS_SVC_ID_SMFD) {
            return;
        }

        if (mds_info->svc_id == NCSMDS_SVC_ID_SMFND) {
            smfnd_down(mds_info->node_id);
            cb->no_of_smfnd--;
        }
        break;

Possible failure when handling SMFND UP event is not accounted for when handling the DOWN event; specifically the 'no_of_smfnd' is not incremented for failed UP events, but is always decremented for DOWN events. This could eventually result in an incorrect view of the actual number of SMFND nodes in the cluster.

An observed issue from this incorrect handling is with campaign callback invocations, in which SMFD uses the 'no_of_smfnd' as the number of expected consolidated callback responses. One or more callback responses can be missed if the 'no_of_smfnd' is miscalculated, potentially leading to wrong campaign execution.

The issue can be reproduced with below steps and attached files.
The attached 'campaign_failed_callback.xml' has a callback label for 'smf_demo_agent' (smf_demo.c) which will wait for 10s and respond with ERR_FAILED_OPERATION, so the campaign is expected to fail. But in case of the miscalculation of 'no_of_smfnd', the campaign will unexpectedly pass.

  • Start up the cluster (2SC+nPL).
  • Lock a PL ClmNode (e.g clm-adm -o lock safNode=PL-3,safCluster=myClmCluster).
  • Restart SMFND on the locked PL node (e.g amf-adm restart safComp=SMFND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF) -> SMFD will fail the SMFND UP event.
  • Unlock the previous PL ClmNode (e.g clm-adm -o unlock safNode=PL-3,safCluster=myClmCluster).
  • Restart SMFND on the same PL again -> no_of_smfnd will be miscalculated.
  • Build and run the 'smf_demo_agent' on a PL.
  • Run the attached campaign -> the campaign will pass (though it is expected to fail).

===osafsmfd traces===

<143>1 2018-03-27T11:21:49.17754+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2252"] 843:smf/smfd/SmfUpgradeStep.cc:2554 >> checkAndInvokeCallback 
<143>1 2018-03-27T11:21:49.17755+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2253"] 843:smf/smfd/SmfCallback.cc:76 >> execute 
<143>1 2018-03-27T11:21:49.17756+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2254"] 843:smf/smfd/SmfCallback.cc:108 >> send_callback_msg: callback invoked atAction 0
<143>1 2018-03-27T11:21:49.177571+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2255"] 843:smf/smfd/SmfCallback.cc:165 T2 cbk label c_str() SMF_DEMO, size 8
<143>1 2018-03-27T11:21:49.177582+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2256"] 843:smf/smfd/SmfCallback.cc:206 T2 stringToPass CALLBACK_FAILED
<143>1 2018-03-27T11:21:49.177592+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2257"] 843:smf/smfd/SmfCallback.cc:212 T2 dn safSmfStep=0001,safSmfProc=Callback_Return_Failure,safSmfCampaign=failed_callback, size 81
<143>1 2018-03-27T11:21:49.177602+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2258"] 843:smf/smfd/SmfCallback.cc:259 TR before mds send 
<143>1 2018-03-27T11:21:49.177771+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2259"] 843:smf/smfd/SmfCallback.cc:268 T2 mds send successful, rc=1
.....
<143>1 2018-03-27T11:21:49.178932+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2277"] 342:smf/smfd/smfd_evt.c:175 >> proc_callback_rsp 
<143>1 2018-03-27T11:21:49.178935+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2278"] 342:smf/smfd/smfd_evt.c:188 T2 Received evt_type: 1, inv_id: 1, err: 1
<143>1 2018-03-27T11:21:49.178938+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2279"] 342:smf/smfd/smfd_evt.c:193 T2 found the node with inv_id: 1
<143>1 2018-03-27T11:21:49.178941+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2280"] 342:smf/smfd/smfd_evt.c:205 TR last response received, cleaning up the node
<143>1 2018-03-27T11:21:49.17895+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2281"] 342:smf/smfd/smfd_evt.c:243 << proc_callback_rsp 
<143>1 2018-03-27T11:21:49.179221+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2282"] 843:smf/smfd/SmfCallback.cc:276 T2 before poll, fds[0].fd = 40
<143>1 2018-03-27T11:21:49.179242+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2283"] 843:smf/smfd/SmfCallback.cc:303 T2 Got response for the inv_id 1, err 1
<143>1 2018-03-27T11:21:49.179254+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2284"] 843:smf/smfd/SmfCallback.cc:347 << send_callback_msg 
<143>1 2018-03-27T11:21:49.179265+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2285"] 843:smf/smfd/SmfCallback.cc:79 << execute: rc = SA_AIS_OK (1)
<143>1 2018-03-27T11:21:49.179276+07:00 SC-1 osafsmfd 342 osafsmfd [meta sequenceId="2286"] 843:smf/smfd/SmfUpgradeStep.cc:0 << checkAndInvokeCallback

===osafsmfnd traces on PL with smf_demo_agent===

<143>1 2018-03-27T11:21:59.181472+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="28"] 203:smf/smfnd/smfnd_evt.c:536 >> proc_cbk_req_rsp 
<143>1 2018-03-27T11:21:59.181528+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="29"] 203:smf/smfnd/smfnd_evt.c:505 >> smfnd_cbk_resp_proc 
<143>1 2018-03-27T11:21:59.181553+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="30"] 203:smf/smfnd/smfnd_evt.c:398 >> smfnd_cbk_resp_err_proc: SMFND: Received ERR resp for the inv_id: 1.
<143>1 2018-03-27T11:21:59.181755+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="31"] 203:smf/smfnd/smfnd_evt.c:430 << smfnd_cbk_resp_err_proc 
<143>1 2018-03-27T11:21:59.181809+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="32"] 203:smf/smfnd/smfnd_evt.c:518 << smfnd_cbk_resp_proc 
<143>1 2018-03-27T11:21:59.181833+07:00 PL-4 osafsmfnd 203 osafsmfnd [meta sequenceId="33"] 203:smf/smfnd/smfnd_evt.c:552 << proc_cbk_req_rsp 
2 Attachments

Related

Wiki: ChangeLog-5.18.04

Discussion

  • Nguyen TK Luu

    Nguyen TK Luu - 2018-03-27
    • status: unassigned --> assigned
    • assigned_to: Nguyen TK Luu
     
  • Nguyen TK Luu

    Nguyen TK Luu - 2018-03-28
    • status: assigned --> accepted
     
  • Nguyen TK Luu

    Nguyen TK Luu - 2018-03-28
    • status: accepted --> review
     
  • Nguyen TK Luu

    Nguyen TK Luu - 2018-04-10
    • status: review --> fixed
     
  • Nguyen TK Luu

    Nguyen TK Luu - 2018-04-10

    commit 1c302a300e449e8a8527671fbd6c7f4e2b41e95d (origin/develop)
    Author: Nguyen Luu nguyen.tk.luu@dektech.com.au
    Date: Wed Mar 28 11:34:52 2018 +0700

    smfd: Fix incorrect handling of SMFND NCSMDS_UP/DOWN events [#2821]
    
    Current handling of SMFND DOWN event does not take into account failed
    SMFND UP event, which could eventually result in an inexact view of the
    actual number of SMFND nodes in the cluster if, for example, a node
    happened to be DOWN and UP twice, and the first UP event somehow failed.
    
     
  • Nguyen TK Luu

    Nguyen TK Luu - 2018-06-08
    • assigned_to: Nguyen TK Luu --> nobody
     

Log in to post a comment.