Menu

#2061 smfd faulted on Active controller due to csiSetcallbackTimeout during si-swap operation

future
assigned
None
defect
smf
d
5.1.FC
major
False
2019-07-23
2016-09-22
Ritu Raj
No

Environment details

OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 3 nodes ( 2 controllers and 1 payloads with headless feature disabled & 1PBE with 100K objects)

Summary

smfd faulted on Active(Previous active) controller due to csiSetcallbackTimeout during si-swap operation

Steps followed & Observed behaviour

  1. Initiate si-swap operation from Active Controller, simultaneously killed osafsmfnd on STANDBY and osafckptnd on Payload(PL-3)
  2. Observed that, during role change smfd faulted on Active

From the traces, it is observed that:
** In the file "osaf/services/saf/smfsv/smfd/smfd_smfnd.c"
there is no TRY_AGAIN mechanism for below api

/* Find Clm info about the node */
        rc = saClmInitialize(&clmHandle, NULL, &clmVersion);
        if (rc != SA_AIS_OK) {
                LOG_ER("saClmInitialize failed, rc=%s", saf_error(rc));
                if (newNode) free(smfnd);
                pthread_mutex_unlock(&smfnd_list_lock);
                return NCSCC_RC_FAILURE;
        }

        /* Get Clm info about the node */
        SaClmClusterNodeT clmInfo;
        rc = saClmClusterNodeGet(clmHandle, i_node_id,
                                 10000000000LL, &clmInfo);
        if (rc != SA_AIS_OK) {
                LOG_ER("saClmClusterNodeGet failed, rc=%s", saf_error(rc));
                if (newNode) free(smfnd);
                rc = saClmFinalize(clmHandle);
                if (rc != SA_AIS_OK) {
                        LOG_ER("saClmFinalize failed, rc=%s", saf_error(rc));
                }
                pthread_mutex_unlock(&smfnd_list_lock);
                return NCSCC_RC_FAILURE;
        }

**Syslog :
Sep 22 14:15:05 fos1 osafimmnd[6164]: NO Implementer disconnected 17 <0, 2030f(down)> (MsgQueueService131855)
Sep 22 14:15:08 fos1 osafamfnd[6253]: NO 'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
Sep 22 14:15:08 fos1 osafamfnd[6253]: ER safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
Sep 22 14:15:08 fos1 osafamfnd[6253]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 22 14:15:08 fos1 opensaf_reboot: Rebooting local node; timeout=60
Sep 22 14:15:09 fos1 osafsmfd[6272]: ER saClmInitialize failed, rc=SA_AIS_ERR_TRY_AGAIN (6)
Sep 22 14:15:09 fos1 osafsmfd[6272]: WA proc_mds_info: SMFND UP failed

**Notes:
1. Syslog of controller's attached
2. smfd tarces attached

2 Attachments

Discussion

  • Ritu Raj

    Ritu Raj - 2016-09-22
    • summary: smfd faulted on Active controller due csiSetcallbackTimeout during si-swap operation --> smfd faulted on Active controller due to csiSetcallbackTimeout during si-swap operation
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -4,7 +4,7 @@
     Setup : 3 nodes ( 2 controllers and 1 payloads with headless feature disabled & 1PBE with 100K objects)
    
     # Summary
    -smfd faulted on Active(Previous active) controller due csiSetcallbackTimeout during si-swap operation 
    +smfd faulted on Active(Previous active) controller due to csiSetcallbackTimeout during si-swap operation 
    
     # Steps followed & Observed behaviour
     1. Initiate si-swap operation from Active Controller, simultaneously killed osafsmfnd on STANDBY and osafckptnd on Payload(PL-3) 
    @@ -14,6 +14,7 @@
     ** In the file "osaf/services/saf/smfsv/smfd/smfd_smfnd.c"
      there is no TRY_AGAIN mechanism for below api
    
    +~~~
     /* Find Clm info about the node */
             rc = saClmInitialize(&clmHandle, NULL, &clmVersion);
             if (rc != SA_AIS_OK) {
    @@ -37,6 +38,7 @@
                     pthread_mutex_unlock(&smfnd_list_lock);
                     return NCSCC_RC_FAILURE;
             }
    +~~~
    
     **Syslog :
    
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 4.7.2 --> future
     
  • Krishna Pawar

    Krishna Pawar - 2018-09-14
    • status: unassigned --> assigned
    • assigned_to: Krishna Pawar
    • Blocker: --> False
    • Milestone: future --> 5.18.09
     
  • Krishna Pawar

    Krishna Pawar - 2018-09-14

    I am able to reproduce the issue. Syslog below:

    Sep 14 18:53:31 krishna-VirtualBox osafamfd[6029]: NO safSi=SC-2N,safApp=OpenSAF Swap initiated
    Sep 14 18:53:31 krishna-VirtualBox osafamfnd[6045]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer locally disconnected. Marking it as doomed 34 <485, 2010f> (safSmfService)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 27 <336, 2010f> (safMsgGrpService)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 34 <485, 2010f> (safSmfService)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 35 <323, 2010f> (@safSmf_applier1)
    Sep 14 18:53:31 krishna-VirtualBox osafsmfd[6065]: NO MDS amf_quiesced_state_handler: smfd_mds_change_role()
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer locally disconnected. Marking it as doomed 33 <489, 2010f> (safLogService)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer locally disconnected. Marking it as doomed 32 <488, 2010f> (@safLogService_appl)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 33 <489, 2010f> (safLogService)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 32 <488, 2010f> (@safLogService_appl)
    Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 30 <328, 2010f> (safEvtService)
    Sep 14 18:53:31 krishna-VirtualBox osafsmfd[6065]: NO MDS smfd_mds_change_role: Setting; arg.info.vdest_chg_role.i_vdest = 0xf, ncsvda_api() rc = 1
    Sep 14 18:53:32 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 29 <14, 2010f> (safClmService)
    Sep 14 18:53:32 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 28 <343, 2010f> (safCheckPointService)
    Sep 14 18:53:32 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 31 <334, 2010f> (safLckService)
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: ER Alarm lost for safComp=CPND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: safSu=PL-3,safSg=NoRed,safApp=OpenSAF OperState ENABLED => DISABLED
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: WA State change notification lost for 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
    Sep 14 18:53:53 krishna-VirtualBox osafimmd[5966]: NO MDS event from svc_id 25 (change:4, dest:566314102483891)
    Sep 14 18:53:53 krishna-VirtualBox osafimmnd[5978]: NO Global discard node received for nodeId:2030f pid:3936
    Sep 14 18:53:53 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: NO Node 'PL-3' left the cluster
    Sep 14 18:53:53 krishna-VirtualBox osafclmd[6018]: NO Node 131855 went down. Not sending track callback for agents on that node
    Sep 14 18:53:53 krishna-VirtualBox osafclmd[6018]: NO Node 131855 went down. Not sending track callback for agents on that node
    Sep 14 18:53:53 krishna-VirtualBox osafclmd[6018]: safNode=PL-3,safCluster=myClmCluster LEFT, init view=3, cluster view=4
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: ER Alarm lost for safComp=CPND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: safSu=PL-3,safSg=NoRed,safApp=OpenSAF PresenceState INSTANTIATED => UNINSTANTIATED
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: WA State change notification lost for 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: safSu=PL-3,safSg=NoRed,safApp=OpenSAF ReadinessState IN_SERVICE => OUT_OF_SERVICE
    Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: ER Alarm lost for safSi=NoRed3,safApp=OpenSAF
    Sep 14 18:54:01 krishna-VirtualBox kernel: [22703.023603] Disabling bearer <eth:eth0>
    Sep 14 18:54:01 krishna-VirtualBox osafamfnd[6045]: ER safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
    Sep 14 18:54:01 krishna-VirtualBox osafamfnd[6045]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Sep 14 18:54:01 krishna-VirtualBox opensaf_reboot: Rebooting local node; timeout=60</eth:eth0>

     
  • Krishna Pawar

    Krishna Pawar - 2018-09-14

    The analysis is :
    csi set cbk for quisced comes at smfd at 18:53:31, smfd doesn't send response to Amf from amf_quiesced_state_handler because it can only sent it after it receive quisced ack from mds.
    <141>1 2018-09-14T18:53:31.67077+05:30 osafsmfd [meta sequenceId="1309"] 6065:smf/smfd/smfd_amf.cc:153 NO MDS amf_quiesced_state_handler: smfd_mds_change_role()

    Check the code in amf_csi_set_callback():
    if (new_haState == SA_AMF_HA_QUIESCED)
    / AMF response will be done later when MDS quiesced ack has been received
    /
    goto done;

    So, smfd didn't sent response to amf yet. It is waiting for Mds quisced ack from mds.

    But, smfnd down and up event come to Smfd before mds_quisced ack comes to Smfd.
    And Smfd stuck in saClmInitialize() api call and never comes out of clma_startup() beccause there is no trace of leaving clma_startup(). This is the root cause of amf csi set cbk timeout as Smfd never sent the response to amf as it is stuck there in saClmInitialize().

    <143>1 2018-09-14T18:53:31.906809+05:30 osafsmfd [meta sequenceId="1319"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2020f, svc version 2 role 1
    <143>1 2018-09-14T18:53:31.906875+05:30 osafsmfd [meta sequenceId="1320"] 6065:smf/smfd/smfd_smfnd.c:210 TR SMFND DOWN for node id 2020f
    <143>1 2018-09-14T18:53:31.906883+05:30 osafsmfd [meta sequenceId="1321"] 6065:smf/smfd/smfd_smfnd.c:228 TR SMFND state updated for node [safNode=SC-2,safCluster=myClmCluster], id [2020f]
    <143>1 2018-09-14T18:53:31.954053+05:30 osafsmfd [meta sequenceId="1322"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 3 for nodeid 2020f, svc version 2 role 1
    <143>1 2018-09-14T18:53:31.954085+05:30 osafsmfd [meta sequenceId="1323"] 6065:smf/smfd/smfd_smfnd.c:124 TR SMFND UP for node id 2020f, version 2
    <143>1 2018-09-14T18:53:31.954092+05:30 osafsmfd [meta sequenceId="1324"] 6065:clm/agent/clma_api.cc:538 >> saClmInitialize
    <143>1 2018-09-14T18:53:31. [meta sequenceId="1325"] 6065:clm/agent/clma_api.cc:630 >> clmainitialize
    <143>1 2018-09-14T18:53:31.954099+05:30 osafsmfd [meta sequenceId="1326"] 6065:clm/agent/clma_util.cc:99 >> clma_startup: clma_use_count: 0
    <143>1 2018-09-14T18:53:31.954117+05:30 osafsmfd [meta sequenceId="1327"] 6065:clm/agent/clma_mds.cc:1116 >> clma_mds_init
    <143>1 2018-09-14T18:53:31.95585+05:30 osafsmfd [meta sequenceId="1328"] 6065:clm/agent/clma_mds.cc:1164 << clma_mds_init
    <143>1 2018-09-14T18:53:53.204569+05:30 osafsmfd [meta sequenceId="1329"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2030f, svc version 2 role 1
    <143>1 2018-09-14T18:54:01.774991+05:30 osafsmfd [meta sequenceId="1330"] 6070:smf/smfd/smfd_mds.c:294 TR SMFD SVC event 6 for nodeid 2020f, svc version 1 role 2
    <143>1 2018-09-14T18:54:01.775047+05:30 osafsmfd [meta sequenceId="1331"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2020f, svc version 2 role 1

     
  • Krishna Pawar

    Krishna Pawar - 2018-09-14

    It is stuck at:
    saClmInitialize->clmainitialize->clma_startup->clma_create() {

    Here--> osaf_poll_one_fd(m_GET_FD_FROM_SEL_OBJ(clma_cb.clms_sync_sel), 30000);
    }

    Since Clmd is in quisced mode, so until other Clmd doesn't become Act, it will not come out.
    Clmd will not become Active until Smfd doesn't respond to quisced.

     
  • Krishna Pawar

    Krishna Pawar - 2018-09-14

    This is chicken-egg issue.

     
  • Gary Lee

    Gary Lee - 2018-09-29
    • Milestone: 5.18.09 --> 5.18.12
     
  • MeenakshiTK

    MeenakshiTK - 2018-11-16
    • status: assigned --> unassigned
    • assigned_to: Krishna Pawar --> nobody
     
  • MeenakshiTK

    MeenakshiTK - 2018-11-16
    • status: unassigned --> assigned
    • assigned_to: MeenakshiTK
     
  • Gary Lee

    Gary Lee - 2019-01-09
    • Milestone: 5.19.01 --> 5.19.03
     
  • Gary Lee

    Gary Lee - 2019-03-26
    • Milestone: 5.19.03 --> 5.19.06
     
  • Gary Lee

    Gary Lee - 2019-07-23
    • Milestone: 5.19.07 --> future
     

Log in to post a comment.