Menu

#3281 mds: Sending wrong NO_ACTIVE after split brain detection

5.22.01
fixed
None
defect
mds
-
major
False
2022-01-23
2021-09-06
No

Configuration:
- Cluster with 10 SCs, allow sc absence.
- Split cluster to four partitions: [[SC-1, SC-2], [SC-3, SC-4, SC-5, SC-6], [SC-7, SC-8, SC-9], [SC-10]]
- Role of SCs after network splits: [[SC-1(ACT), SC-2(STB)], [SC-3(ACT), SC-4, SC-5(STB), SC-6], [SC-7(STB), SC-8(ATC), SC-9], [SC-10(ATC)]]
- Network merges

Observation:
- All active and standby SCs rebooted due to split brain detected except SCs in partition 1. The SCs in partition 1 don't reboot because the active and standby SCs in other partitions rebooted too fast.
- Ntf agent in SC1 fails to send notification to ntf server and it will not recover.

<143>1 2021-09-04T07:11:56.225221+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136348"] 439:amf/amfd/imm.cc:419 >> execute 
<143>1 2021-09-04T07:11:56.225223+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136349"] 439:amf/amfd/ntf.cc:804 >> exec: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:11:56.225227+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136350"] 439:amf/amfd/ntf.cc:491 >> avd_try_send_notification: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:11:56.225231+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136351"] 439:ntf/agent/ntfa_api.c:2016 >> saNtfNotificationSend 
<143>1 2021-09-04T07:11:56.225235+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136352"] 439:ntf/agent/ntfa_api.c:62 TR NTFS server is unavailable
<143>1 2021-09-04T07:11:56.225238+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136353"] 439:ntf/agent/ntfa_api.c:2260 << saNtfNotificationSend 
<143>1 2021-09-04T07:11:56.225241+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136354"] 439:amf/amfd/ntf.cc:513 TR Notification Send unsuccesful TRY_AGAIN or TIMEOUT rc:6
<143>1 2021-09-04T07:11:56.225243+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136355"] 439:amf/amfd/ntf.cc:532 << avd_try_send_notification 
<143>1 2021-09-04T07:11:56.225246+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136356"] 439:amf/amfd/ntf.cc:811 TR TRY-AGAIN
<143>1 2021-09-04T07:11:56.225249+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136357"] 439:amf/amfd/ntf.cc:822 << exec 
<143>1 2021-09-04T07:11:56.225252+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="136358"] 439:amf/amfd/imm.cc:427 << execute: 2
...
<143>1 2021-09-04T07:26:00.185418+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191040"] 439:amf/amfd/imm.cc:419 >> execute 
<143>1 2021-09-04T07:26:00.185465+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191041"] 439:amf/amfd/ntf.cc:804 >> exec: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:26:00.185475+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191042"] 439:amf/amfd/ntf.cc:491 >> avd_try_send_notification: Ntf Type:3000, sent status:0
<143>1 2021-09-04T07:26:00.185485+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191043"] 439:ntf/agent/ntfa_api.c:2016 >> saNtfNotificationSend 
<143>1 2021-09-04T07:26:00.185497+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191044"] 439:ntf/agent/ntfa_api.c:59 TR NTFS server is down
<143>1 2021-09-04T07:26:00.185505+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191045"] 439:ntf/agent/ntfa_api.c:2260 << saNtfNotificationSend 
<143>1 2021-09-04T07:26:00.185513+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191046"] 439:amf/amfd/ntf.cc:513 TR Notification Send unsuccesful TRY_AGAIN or TIMEOUT rc:6
<143>1 2021-09-04T07:26:00.18552+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191047"] 439:amf/amfd/ntf.cc:532 << avd_try_send_notification 
<143>1 2021-09-04T07:26:00.185528+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191048"] 439:amf/amfd/ntf.cc:811 TR TRY-AGAIN
<143>1 2021-09-04T07:26:00.185536+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191049"] 439:amf/amfd/ntf.cc:822 << exec 
<143>1 2021-09-04T07:26:00.185543+02:00 SC-1 osafamfd 439 osafamfd [meta sequenceId="191050"] 439:amf/amfd/imm.cc:427 << execute: 2

Reason:
- Before the active SC-3 rebooted, SC-1 still had enough time to connect to the ntf server in SC-3. When SC-3 rebooted, the ntf agent in SC-1 received a NO_ACTIVE message of NTFS service. Actually, the ntf server in SC-1 is still in active state.
- The following mds log is generated by the opensaf code which applied the patch dump_mds_5.21.06.patch.

  • The ntf agent in SC-1 detects the ntf server in SC-3 is up. Mds updates the active destination of NTFS service to SC-3.
<142>1 2021-09-04T07:10:34.384094+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307776"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.3841+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307777"] MDTM: Received SVC event
<142>1 2021-09-04T07:10:34.384102+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307778"] MDTM: No entry in ref tbl so dropping the recd event
<142>1 2021-09-04T07:10:34.384105+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307779"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.384111+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307780"] MDTM: Received SVC event
<143>1 2021-09-04T07:10:34.384113+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307781"] >> mds_mcm_svc_up
<143>1 2021-09-04T07:10:34.384115+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307782"] MCM:API: LOCAL SVC INFO  : svc_id = NTFA(29) | PWE id = 1 | VDEST id = 65535 |
<143>1 2021-09-04T07:10:34.384117+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307783"] MCM:API: REMOTE SVC INFO : svc_id = NTFS(28) | PWE id = 1 | VDEST id = 14 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2021-09-04T07:10:34.38412+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307784"] >> mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.384122+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307785"] << mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.384127+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307786"] >> mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.38413+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307787"] << mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.384132+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307788"] >> mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.384135+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307789"] << mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.384138+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307790"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.384141+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307791"] MDS:DB: Subscription Result not present
<143>1 2021-09-04T07:10:34.384144+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307792"] << mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.384146+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307793"] >> mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.384149+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307794"] << mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.384152+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307795"] >> mds_subtn_res_tbl_add
<143>1 2021-09-04T07:10:34.384155+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307796"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.384267+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307831"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2010f0000019d
<143>1 2021-09-04T07:10:34.384269+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307832"] sub_svc_id 28 adest 2020f0000018e vdest_id 14 
<143>1 2021-09-04T07:10:34.384272+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307833"] sub_svc_id 28 adest 2010f0000019d vdest_id 14 
...
<143>1 2021-09-04T07:10:34.384287+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307838"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.38429+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307839"] MDS:DB: adest_details: <rem_node[0x2030f]:dest_pid[398]> 
<143>1 2021-09-04T07:10:34.384298+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307840"] << get_subtn_adest_details
<143>1 2021-09-04T07:10:34.384301+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307841"] MDS:DB: sub_adest_details: <rem_node[0x2030f]:dest_pid[398]>
<143>1 2021-09-04T07:10:34.384304+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307842"] >> dump_subt_res_tree
... 
<143>1 2021-09-04T07:10:34.384462+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307877"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2030f0000018e
<143>1 2021-09-04T07:10:34.384465+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307878"] sub_svc_id 28 adest 2020f0000018e vdest_id 14 
<143>1 2021-09-04T07:10:34.384468+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307879"] sub_svc_id 28 adest 2030f0000018e vdest_id 14 
<143>1 2021-09-04T07:10:34.384471+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307880"] sub_svc_id 28 adest 2010f0000019d vdest_id 14 
...
<143>1 2021-09-04T07:10:34.38449+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307885"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.384493+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307886"] << mds_subtn_res_tbl_add
<143>1 2021-09-04T07:10:34.384496+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="307887"] << mds_mcm_svc_up
  • The ntf agent in SC-1 detected the ntf server in SC-3 was down.
<142>1 2021-09-04T07:10:34.396969+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308753"] MDTM: Processing pollin events
<142>1 2021-09-04T07:10:34.396975+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308754"] MDTM: Received SVC event
<143>1 2021-09-04T07:10:34.396979+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308755"] >> mds_mcm_svc_down
<143>1 2021-09-04T07:10:34.396983+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308756"] MCM:API: LOCAL SVC INFO  : svc_id = NTFA(29) | PWE id = 1 | VDEST id = 65535 |
<143>1 2021-09-04T07:10:34.396986+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308757"] MCM:API: REMOTE SVC INFO : svc_id = NTFS(28) | PWE id = 1 | VDEST id = 14 | POLICY = 2 | SCOPE = 4 | ROLE = 1 | MY_PCON = 0 |
<143>1 2021-09-04T07:10:34.396989+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308758"] >> mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.396992+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308759"] << mds_svc_tbl_query
<143>1 2021-09-04T07:10:34.396995+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308760"] >> mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.396998+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308761"] << mds_subtn_tbl_get_details
<143>1 2021-09-04T07:10:34.397001+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308762"] >> mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.397003+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308763"] << mds_mcm_validate_scope
<143>1 2021-09-04T07:10:34.397006+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308764"] >> mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.39701+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308765"] << mds_get_subtn_res_tbl_by_adest
<143>1 2021-09-04T07:10:34.397014+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308766"] >> mds_subtn_res_tbl_get_by_adest
<143>1 2021-09-04T07:10:34.397016+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308767"] << mds_subtn_res_tbl_get_by_adest
<143>1 2021-09-04T07:10:34.397019+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308768"] >> mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.397022+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308769"] << mds_subtn_res_tbl_get
<143>1 2021-09-04T07:10:34.397025+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308770"] >> mds_subtn_res_tbl_del
<143>1 2021-09-04T07:10:34.397029+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308771"] >> dump_subt_res_tree...
<143>1 2021-09-04T07:10:34.397157+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308805"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2030f0000018e
<143>1 2021-09-04T07:10:34.397159+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308806"] sub_svc_id 28 adest 2020f0000018e vdest_id 14 
<143>1 2021-09-04T07:10:34.397162+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308807"] sub_svc_id 28 adest 2030f0000018e vdest_id 14 
<143>1 2021-09-04T07:10:34.397165+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308808"] sub_svc_id 28 adest 2010f0000019d vdest_id 14 
...
<143>1 2021-09-04T07:10:34.397186+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308813"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.397189+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308814"] >> dump_subt_res_tree
...
<143>1 2021-09-04T07:10:34.397314+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308848"] sub_svc_id 28 adest 0 vdest_id 14 next_active_adest 2030f0000018e
<143>1 2021-09-04T07:10:34.397321+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308849"] sub_svc_id 28 adest 2020f0000018e vdest_id 14 
<143>1 2021-09-04T07:10:34.397325+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308850"] sub_svc_id 28 adest 2010f0000019d vdest_id 14 
...
<143>1 2021-09-04T07:10:34.39735+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308855"] << dump_subt_res_tree
<143>1 2021-09-04T07:10:34.397353+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308856"] << mds_subtn_res_tbl_del
<143>1 2021-09-04T07:10:34.397356+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308857"] >> mds_subtn_res_tbl_getnext_any
<143>1 2021-09-04T07:10:34.397365+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308858"] << mds_subtn_res_tbl_getnext_any
  • Mds removed the active destination of NTFS service although the NTFS service in SC-1 was still active.
<143>1 2021-09-04T07:10:34.39737+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308859"] >> mds_subtn_res_tbl_remove_active
<143>1 2021-09-04T07:10:34.397373+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308860"] MCM_DB:RemoveActive:TimerStart:AwaitActiveTmri:Hdl=0xfc300003:SvcHdl=562945658454045:sbscr-svcid=NTFS(28),vdest=14
<143>1 2021-09-04T07:10:34.397377+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308861"] << mds_subtn_res_tbl_remove_active
<143>1 2021-09-04T07:10:34.397379+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308862"] >> mds_mcm_user_event_callback
<143>1 2021-09-04T07:10:34.397382+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308863"] >> mds_svc_tbl_get
<143>1 2021-09-04T07:10:34.397385+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308864"] << mds_svc_tbl_get
<143>1 2021-09-04T07:10:34.397387+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308865"] >> mds_subtn_tbl_get
<143>1 2021-09-04T07:10:34.397395+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308866"] << mds_subtn_tbl_get
<143>1 2021-09-04T07:10:34.397423+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308867"] << mds_mcm_user_event_callback
<142>1 2021-09-04T07:10:34.397425+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308868"] MCM:API: svc_down : svc_id = NTFA(29) on DEST id = 65535 got NO_ACTIVE for svc_id = NTFS(28) on Vdest id = 14 Adest = <rem_node[0x2030f]:dest_pid[398]>, rem_svc_pvt_ver=1
<143>1 2021-09-04T07:10:34.397429+02:00 SC-1 osafamfd 439 mds.log [meta sequenceId="308869"] << mds_mcm_svc_down
2 Attachments

Related

Wiki: ChangeLog-5.22.01

Discussion

  • Hieu Hong Hoang

    Hieu Hong Hoang - 2021-09-06
    • assigned_to: Hieu Hong Hoang
     
  • Gary Lee

    Gary Lee - 2021-09-14
    • Milestone: 5.21.09 --> 5.21.12
     
  • Hieu Hong Hoang

    Hieu Hong Hoang - 2021-10-15
    • summary: mds: Wrong sending NO_ACTIVE after split brain detection --> mds: Sending wrong NO_ACTIVE after split brain detection
    • status: assigned --> accepted
     
  • Hieu Hong Hoang

    Hieu Hong Hoang - 2021-10-18

    When there is a conflict between active MxN vdest, we should keep current active. Only change the active entry when current active vdest is down.

     
  • Hieu Hong Hoang

    Hieu Hong Hoang - 2021-10-18
    • status: accepted --> review
     
  • Hieu Hong Hoang

    Hieu Hong Hoang - 2021-11-01

    commit 203ff838c39910902eb16965e964a580b0cffe3b (HEAD -> develop, origin/develop, ticket-3281)
    Author: hieu.h.hoang hieu.h.hoang@dektech.com.au
    Date: Mon Oct 18 08:59:03 2021 +0700

    mds: Resolve active MxN VDEST conflict in split brain [#3281]
    
    If split brain happens and network merges back, one active SC is possible to stay alive due to other active SCs reboot too fast. Although alive SC doesn't detect the split brain, it still detects some services are up in active rebooting SCs.
    In the ticket description, the status of cluster before network merges is [[SC-1(ACT), SC-2(STB)], [SC-3(ACT), SC-4, SC-5(STB), SC-6], [SC-7(STB), SC-8(ATC), SC-9], [SC-10(ATC)]]. On SC-1, the ntfa received wrong NCSMDS_NO_ACTIVE in the below scenario:
    - Detect ntfs on active SC-3 is up.
    - Because ntfs on SC-1 is active, there are not SVC up event.
    - Update the active vdest to new ntfs(on SC-3).
    - Detect ntfs on active SC-3 is down.
    - The active vdest is removed. A NCSMDS_NO_ACTIVE is generated, although ntfs on SC-1 is still active.
    Solution: don't replace current active vdest by a new active vdest. When current active vdest is down, replace it by other active vdest if any.
    
     
  • Hieu Hong Hoang

    Hieu Hong Hoang - 2021-11-01
    • status: review --> fixed
     

Log in to post a comment.