Configuration:
- Cluster with 10 SCs, allow sc absence.
- Split cluster to four partitions: [[SC-1, SC-2], [SC-3, SC-4, SC-5, SC-6], [SC-7, SC-8, SC-9], [SC-10]]
- Role of SCs after network splits: [[SC-1(ACT), SC-2(STB)], [SC-3(ACT), SC-4, SC-5(STB), SC-6], [SC-7(STB), SC-8(ATC), SC-9], [SC-10(ATC)]]
- Network merges
Observation:
- All active and standby SCs rebooted due to split brain detected except SCs in partition 1. The SCs in partition 1 don't reboot because the active and standby SCs in other partitions rebooted too fast.
- Ntf agent in SC1 fails to send notification to ntf server and it will not recover.
<143>1 2021-09-04T07:11:56.225221+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136348"] 439:amf/amfd/imm.cc:419 >> execute
<143>1 2021-09-04T07:11:56.225223+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136349"] 439:amf/amfd/ntf.cc:804 >> exec: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:11:56.225227+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136350"] 439:amf/amfd/ntf.cc:491 >> avd_try_send_notification: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:11:56.225231+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136351"] 439:ntf/agent/ntfa_api.c:2016 >> saNtfNotificationSend
<143>1 2021-09-04T07:11:56.225235+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136352"] 439:ntf/agent/ntfa_api.c:62 TR NTFS server is unavailable
<143>1 2021-09-04T07:11:56.225238+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136353"] 439:ntf/agent/ntfa_api.c:2260 << saNtfNotificationSend
<143>1 2021-09-04T07:11:56.225241+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136354"] 439:amf/amfd/ntf.cc:513 TR Notification Send unsuccesful TRY_AGAIN or TIMEOUT rc:6
<143>1 2021-09-04T07:11:56.225243+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136355"] 439:amf/amfd/ntf.cc:532 << avd_try_send_notification
<143>1 2021-09-04T07:11:56.225246+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136356"] 439:amf/amfd/ntf.cc:811 TR TRY-AGAIN
<143>1 2021-09-04T07:11:56.225249+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136357"] 439:amf/amfd/ntf.cc:822 << exec
<143>1 2021-09-04T07:11:56.225252+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136358"] 439:amf/amfd/imm.cc:427 << execute: 2
...
<143>1 2021-09-04T07:26:00.185418+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191040"] 439:amf/amfd/imm.cc:419 >> execute
<143>1 2021-09-04T07:26:00.185465+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191041"] 439:amf/amfd/ntf.cc:804 >> exec: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:26:00.185475+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191042"] 439:amf/amfd/ntf.cc:491 >> avd_try_send_notification: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:26:00.185485+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191043"] 439:ntf/agent/ntfa_api.c:2016 >> saNtfNotificationSend
<143>1 2021-09-04T07:26:00.185497+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191044"] 439:ntf/agent/ntfa_api.c:59 TR NTFS server is down
<143>1 2021-09-04T07:26:00.185505+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191045"] 439:ntf/agent/ntfa_api.c:2260 << saNtfNotificationSend
<143>1 2021-09-04T07:26:00.185513+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191046"] 439:amf/amfd/ntf.cc:513 TR Notification Send unsuccesful TRY_AGAIN or TIMEOUT rc:6
<143>1 2021-09-04T07:26:00.18552+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191047"] 439:amf/amfd/ntf.cc:532 << avd_try_send_notification
<143>1 2021-09-04T07:26:00.185528+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191048"] 439:amf/amfd/ntf.cc:811 TR TRY-AGAIN
<143>1 2021-09-04T07:26:00.185536+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191049"] 439:amf/amfd/ntf.cc:822 << exec
<143>1 2021-09-04T07:26:00.185543+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191050"] 439:amf/amfd/imm.cc:427 << execute: 2
Reason:
- Before the active SC-3 rebooted, SC-1 still had enough time to connect to the ntf server in SC-3. When SC-3 rebooted, the ntf agent in SC-1 received a NO_ACTIVE message of NTFS service. Actually, the ntf server in SC-1 is still in active state.
- The following mds log is generated by the opensaf code which applied the patch dump_mds_5.21.06.patch.
<142>1 2021-09-04T07:10:34.384094+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307776"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.3841+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307777"] MDTM: Received SVC event
<142>1 2021-09-04T07:10:34.384102+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307778"] MDTM: No entry in ref tbl so dropping the recd event
<142>1 2021-09-04T07:10:34.384105+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307779"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.384111+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307780"] MDTM: Received SVC event
<143>1 2021-09-04T07:10:34.384113+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307781"] >> mds_mcm_svc_up
<143>1 2021-09-04T07:10:34.384115+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307782"] MCM:API: LOCAL SVC INFO : svc_id = NTFA(29) | PWE id = 1 | VDEST id = 65535 |
<143>1 2021-09-04T07:10:34.384117+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307783"] MCM:API: REMOTE SVC INFO : svc_id = NTFS(28) | PWE id = 1 | VDEST id = 14 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2021-09-04T07:10:34.38412+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307784"] >> mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.384122+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307785"] << mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.384127+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307786"] >> mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.38413+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307787"] << mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.384132+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307788"] >> mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.384135+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307789"] << mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.384138+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307790"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.384141+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307791"] MDS:DB: Subscription Result not present
<143>1 2021-09-04T07:10:34.384144+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307792"] << mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.384146+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307793"] >> mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.384149+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307794"] << mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.384152+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307795"] >> mds_subtn_res_tbl_add
<143>1 2021-09-04T07:10:34.384155+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307796"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.384267+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307831"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2010f0000019d
<143>1 2021-09-04T07:10:34.384269+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307832"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.384272+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307833"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.384287+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307838"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.38429+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307839"] MDS:DB: adest_details: <rem_node[0x2030f]:dest_pid[398]>
<143>1 2021-09-04T07:10:34.384298+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307840"] << get_subtn_adest_details
<143>1 2021-09-04T07:10:34.384301+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307841"] MDS:DB: sub_adest_details: <rem_node[0x2030f]:dest_pid[398]>
<143>1 2021-09-04T07:10:34.384304+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307842"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.384462+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307877"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2030f0000018e
<143>1 2021-09-04T07:10:34.384465+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307878"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.384468+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307879"] sub_svc_id 28 adest 2030f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.384471+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307880"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.38449+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307885"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.384493+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307886"] << mds_subtn_res_tbl_add
<143>1 2021-09-04T07:10:34.384496+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307887"] << mds_mcm_svc_up
<142>1 2021-09-04T07:10:34.396969+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308753"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.396975+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308754"] MDTM: Received SVC event
<143>1 2021-09-04T07:10:34.396979+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308755"] >> mds_mcm_svc_down
<143>1 2021-09-04T07:10:34.396983+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308756"] MCM:API: LOCAL SVC INFO : svc_id = NTFA(29) | PWE id = 1 | VDEST id = 65535 |
<143>1 2021-09-04T07:10:34.396986+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308757"] MCM:API: REMOTE SVC INFO : svc_id = NTFS(28) | PWE id = 1 | VDEST id = 14 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2021-09-04T07:10:34.396989+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308758"] >> mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.396992+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308759"] << mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.396995+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308760"] >> mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.396998+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308761"] << mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.397001+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308762"] >> mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.397003+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308763"] << mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.397006+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308764"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.39701+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308765"] << mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.397014+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308766"] >> mds_subtn_res_tbl_get_by_adest
<143>1 2021-09-04T07:10:34.397016+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308767"] << mds_subtn_res_tbl_get_by_adest
<143>1 2021-09-04T07:10:34.397019+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308768"] >> mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.397022+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308769"] << mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.397025+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308770"] >> mds_subtn_res_tbl_del
<143>1 2021-09-04T07:10:34.397029+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308771"] >> dump_subt_res_tree...
<143>1 2021-09-04T07:10:34.397157+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308805"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2030f0000018e
<143>1 2021-09-04T07:10:34.397159+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308806"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.397162+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308807"] sub_svc_id 28 adest 2030f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.397165+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308808"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.397186+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308813"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.397189+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308814"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.397314+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308848"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2030f0000018e
<143>1 2021-09-04T07:10:34.397321+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308849"] sub_svc_id 28 adest 2020f0000018e vdest_id 14
<143>1 2021-09-04T07:10:34.397325+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308850"] sub_svc_id 28 adest 2010f0000019d vdest_id 14
...
<143>1 2021-09-04T07:10:34.39735+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308855"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.397353+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308856"] << mds_subtn_res_tbl_del
<143>1 2021-09-04T07:10:34.397356+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308857"] >> mds_subtn_res_tbl_getnext_any
<143>1 2021-09-04T07:10:34.397365+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308858"] << mds_subtn_res_tbl_getnext_any
<143>1 2021-09-04T07:10:34.39737+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308859"] >> mds_subtn_res_tbl_remove_active
<143>1 2021-09-04T07:10:34.397373+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308860"] MCM_DB:RemoveActive:TimerStart:AwaitActiveTmri:Hdl=0xfc300003:SvcHdl=562945658454045:sbscr-svcid=NTFS(28),vdest=14
<143>1 2021-09-04T07:10:34.397377+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308861"] << mds_subtn_res_tbl_remove_active
<143>1 2021-09-04T07:10:34.397379+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308862"] >> mds_mcm_user_event_callback
<143>1 2021-09-04T07:10:34.397382+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308863"] >> mds_svc_tbl_get
<143>1 2021-09-04T07:10:34.397385+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308864"] << mds_svc_tbl_get
<143>1 2021-09-04T07:10:34.397387+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308865"] >> mds_subtn_tbl_get
<143>1 2021-09-04T07:10:34.397395+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308866"] << mds_subtn_tbl_get
<143>1 2021-09-04T07:10:34.397423+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308867"] << mds_mcm_user_event_callback
<142>1 2021-09-04T07:10:34.397425+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308868"] MCM:API: svc_down : svc_id = NTFA(29) on DEST id = 65535 got NO_ACTIVE for svc_id = NTFS(28) on Vdest id = 14 Adest = <rem_node[0x2030f]:dest_pid[398]>, rem_svc_pvt_ver=1
<143>1 2021-09-04T07:10:34.397429+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308869"] << mds_mcm_svc_down
When there is a conflict between active MxN vdest, we should keep current active. Only change the active entry when current active vdest is down.
commit 203ff838c39910902eb16965e964a580b0cffe3b (HEAD -> develop, origin/develop, ticket-3281)
Author: hieu.h.hoang hieu.h.hoang@dektech.com.au
Date: Mon Oct 18 08:59:03 2021 +0700