OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 3 nodes ( 2 controllers and 1 payloads with headless feature disabled & 1PBE with 100K objects)
smfd faulted on Active(Previous active) controller due to csiSetcallbackTimeout during si-swap operation
From the traces, it is observed that:
** In the file "osaf/services/saf/smfsv/smfd/smfd_smfnd.c"
there is no TRY_AGAIN mechanism for below api
/* Find Clm info about the node */ rc = saClmInitialize(&clmHandle, NULL, &clmVersion); if (rc != SA_AIS_OK) { LOG_ER("saClmInitialize failed, rc=%s", saf_error(rc)); if (newNode) free(smfnd); pthread_mutex_unlock(&smfnd_list_lock); return NCSCC_RC_FAILURE; } /* Get Clm info about the node */ SaClmClusterNodeT clmInfo; rc = saClmClusterNodeGet(clmHandle, i_node_id, 10000000000LL, &clmInfo); if (rc != SA_AIS_OK) { LOG_ER("saClmClusterNodeGet failed, rc=%s", saf_error(rc)); if (newNode) free(smfnd); rc = saClmFinalize(clmHandle); if (rc != SA_AIS_OK) { LOG_ER("saClmFinalize failed, rc=%s", saf_error(rc)); } pthread_mutex_unlock(&smfnd_list_lock); return NCSCC_RC_FAILURE; }
**Syslog :
Sep 22 14:15:05 fos1 osafimmnd[6164]: NO Implementer disconnected 17 <0, 2030f(down)> (MsgQueueService131855)
Sep 22 14:15:08 fos1 osafamfnd[6253]: NO 'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
Sep 22 14:15:08 fos1 osafamfnd[6253]: ER safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
Sep 22 14:15:08 fos1 osafamfnd[6253]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 22 14:15:08 fos1 opensaf_reboot: Rebooting local node; timeout=60
Sep 22 14:15:09 fos1 osafsmfd[6272]: ER saClmInitialize failed, rc=SA_AIS_ERR_TRY_AGAIN (6)
Sep 22 14:15:09 fos1 osafsmfd[6272]: WA proc_mds_info: SMFND UP failed
**Notes:
1. Syslog of controller's attached
2. smfd tarces attached
Diff:
I am able to reproduce the issue. Syslog below:
Sep 14 18:53:31 krishna-VirtualBox osafamfd[6029]: NO safSi=SC-2N,safApp=OpenSAF Swap initiated
Sep 14 18:53:31 krishna-VirtualBox osafamfnd[6045]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer locally disconnected. Marking it as doomed 34 <485, 2010f> (safSmfService)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 27 <336, 2010f> (safMsgGrpService)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 34 <485, 2010f> (safSmfService)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 35 <323, 2010f> (@safSmf_applier1)
Sep 14 18:53:31 krishna-VirtualBox osafsmfd[6065]: NO MDS amf_quiesced_state_handler: smfd_mds_change_role()
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer locally disconnected. Marking it as doomed 33 <489, 2010f> (safLogService)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer locally disconnected. Marking it as doomed 32 <488, 2010f> (@safLogService_appl)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 33 <489, 2010f> (safLogService)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 32 <488, 2010f> (@safLogService_appl)
Sep 14 18:53:31 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 30 <328, 2010f> (safEvtService)
Sep 14 18:53:31 krishna-VirtualBox osafsmfd[6065]: NO MDS smfd_mds_change_role: Setting; arg.info.vdest_chg_role.i_vdest = 0xf, ncsvda_api() rc = 1
Sep 14 18:53:32 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 29 <14, 2010f> (safClmService)
Sep 14 18:53:32 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 28 <343, 2010f> (safCheckPointService)
Sep 14 18:53:32 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 31 <334, 2010f> (safLckService)
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: ER Alarm lost for safComp=CPND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: safSu=PL-3,safSg=NoRed,safApp=OpenSAF OperState ENABLED => DISABLED
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: WA State change notification lost for 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
Sep 14 18:53:53 krishna-VirtualBox osafimmd[5966]: NO MDS event from svc_id 25 (change:4, dest:566314102483891)
Sep 14 18:53:53 krishna-VirtualBox osafimmnd[5978]: NO Global discard node received for nodeId:2030f pid:3936
Sep 14 18:53:53 krishna-VirtualBox osafimmnd[5978]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: NO Node 'PL-3' left the cluster
Sep 14 18:53:53 krishna-VirtualBox osafclmd[6018]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 14 18:53:53 krishna-VirtualBox osafclmd[6018]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 14 18:53:53 krishna-VirtualBox osafclmd[6018]: safNode=PL-3,safCluster=myClmCluster LEFT, init view=3, cluster view=4
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: ER Alarm lost for safComp=CPND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: safSu=PL-3,safSg=NoRed,safApp=OpenSAF PresenceState INSTANTIATED => UNINSTANTIATED
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: WA State change notification lost for 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF'
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: safSu=PL-3,safSg=NoRed,safApp=OpenSAF ReadinessState IN_SERVICE => OUT_OF_SERVICE
Sep 14 18:53:53 krishna-VirtualBox osafamfd[6029]: ER Alarm lost for safSi=NoRed3,safApp=OpenSAF
Sep 14 18:54:01 krishna-VirtualBox kernel: [22703.023603] Disabling bearer <eth:eth0>
Sep 14 18:54:01 krishna-VirtualBox osafamfnd[6045]: ER safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
Sep 14 18:54:01 krishna-VirtualBox osafamfnd[6045]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 14 18:54:01 krishna-VirtualBox opensaf_reboot: Rebooting local node; timeout=60</eth:eth0>
The analysis is :
csi set cbk for quisced comes at smfd at 18:53:31, smfd doesn't send response to Amf from amf_quiesced_state_handler because it can only sent it after it receive quisced ack from mds.
<141>1 2018-09-14T18:53:31.67077+05:30 osafsmfd [meta sequenceId="1309"] 6065:smf/smfd/smfd_amf.cc:153 NO MDS amf_quiesced_state_handler: smfd_mds_change_role()
Check the code in amf_csi_set_callback():
if (new_haState == SA_AMF_HA_QUIESCED)
/ AMF response will be done later when MDS quiesced ack has been received
/
goto done;
So, smfd didn't sent response to amf yet. It is waiting for Mds quisced ack from mds.
But, smfnd down and up event come to Smfd before mds_quisced ack comes to Smfd.
And Smfd stuck in saClmInitialize() api call and never comes out of clma_startup() beccause there is no trace of leaving clma_startup(). This is the root cause of amf csi set cbk timeout as Smfd never sent the response to amf as it is stuck there in saClmInitialize().
<143>1 2018-09-14T18:53:31.906809+05:30 osafsmfd [meta sequenceId="1319"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2020f, svc version 2 role 1
<143>1 2018-09-14T18:53:31.906875+05:30 osafsmfd [meta sequenceId="1320"] 6065:smf/smfd/smfd_smfnd.c:210 TR SMFND DOWN for node id 2020f
<143>1 2018-09-14T18:53:31.906883+05:30 osafsmfd [meta sequenceId="1321"] 6065:smf/smfd/smfd_smfnd.c:228 TR SMFND state updated for node [safNode=SC-2,safCluster=myClmCluster], id [2020f]
<143>1 2018-09-14T18:53:31.954053+05:30 osafsmfd [meta sequenceId="1322"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 3 for nodeid 2020f, svc version 2 role 1
<143>1 2018-09-14T18:53:31.954085+05:30 osafsmfd [meta sequenceId="1323"] 6065:smf/smfd/smfd_smfnd.c:124 TR SMFND UP for node id 2020f, version 2
<143>1 2018-09-14T18:53:31.954092+05:30 osafsmfd [meta sequenceId="1324"] 6065:clm/agent/clma_api.cc:538 >> saClmInitialize
<143>1 2018-09-14T18:53:31. [meta sequenceId="1325"] 6065:clm/agent/clma_api.cc:630 >> clmainitialize
<143>1 2018-09-14T18:53:31.954099+05:30 osafsmfd [meta sequenceId="1326"] 6065:clm/agent/clma_util.cc:99 >> clma_startup: clma_use_count: 0
<143>1 2018-09-14T18:53:31.954117+05:30 osafsmfd [meta sequenceId="1327"] 6065:clm/agent/clma_mds.cc:1116 >> clma_mds_init
<143>1 2018-09-14T18:53:31.95585+05:30 osafsmfd [meta sequenceId="1328"] 6065:clm/agent/clma_mds.cc:1164 << clma_mds_init
<143>1 2018-09-14T18:53:53.204569+05:30 osafsmfd [meta sequenceId="1329"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2030f, svc version 2 role 1
<143>1 2018-09-14T18:54:01.774991+05:30 osafsmfd [meta sequenceId="1330"] 6070:smf/smfd/smfd_mds.c:294 TR SMFD SVC event 6 for nodeid 2020f, svc version 1 role 2
<143>1 2018-09-14T18:54:01.775047+05:30 osafsmfd [meta sequenceId="1331"] 6070:smf/smfd/smfd_mds.c:294 TR SMFND SVC event 4 for nodeid 2020f, svc version 2 role 1
It is stuck at:
saClmInitialize->clmainitialize->clma_startup->clma_create() {
Here--> osaf_poll_one_fd(m_GET_FD_FROM_SEL_OBJ(clma_cb.clms_sync_sel), 30000);
}
Since Clmd is in quisced mode, so until other Clmd doesn't become Act, it will not come out.
Clmd will not become Active until Smfd doesn't respond to quisced.
This is chicken-egg issue.