Menu

#3309 amf: the payload node unexpectedly left cluster right after failover

5.22.06
fixed
None
defect
amf
d
major
False
2022-06-01
2022-02-24
No

After the active SC rebooted, the standby SC executed failover to active. The new active SC notified a PL left cluster but that PL was still in cluster. The reason is the connection between the standby SC and that PL was dropped in the past, but that PL still connected with the active SC. It led the standby SC considered that PL absented regardless the connection was established after that. The standby SC only change the PL state when it receives a check point from the active SC. However, the active SC will not send that check point because it still connect with the PL. During failover, the standby SC will notify all recorded absent nodes left cluster.

                                     absent nodes:PL-3           absent nodes:PL-3
SC-1(Act)----SC-2(Stb)    SC-1(Act)----SC-2(Stb)       SC-1(Act)----SC-2(Stb)
    \        /                \                            \        /
     \      /                  \                            \      /
       PL-3                      PL-3                         PL-3

                     absent nodes:PL-3,SC-1
          SC-1(Down)   SC-2(Stb)            SC-1(Stb)----SC-2(Act)
                       /                        \        /
                      /                          \      /
                 PL-3                              PL-3

Log analysis:

  • SC-2 (standby SC) lost contact with PL-3
    2022-02-23 09:03:24.114 SC-2 osafdtmd[320]: NO Lost contact with 'PL-3'

  • SC-2 (standby SC) re-established contact with PL-3
    2022-02-23 09:03:24.513 SC-2 osafdtmd[320]: NO Established contact with 'PL-3'

  • SC-2 finished the failover:
    2022-02-23 09:03:25.582 SC-2 osafamfd[422]: NO FAILOVER StandBy --> Active DONE!

  • SC-2 notified the PL-3 left the cluster:
    2022-02-23 09:03:25.679 SC-2 osafamfd[422]: NO Node 'PL-3' left the cluster

  • State of nodes:
    safAmfNode=PL-3,safAmfCluster=myAmfCluster
    saAmfNodeAdminState=UNLOCKED(1)
    saAmfNodeOperState=DISABLED(2)
    safAmfNode=PL-4,safAmfCluster=myAmfCluster
    saAmfNodeAdminState=UNLOCKED(1)
    saAmfNodeOperState=ENABLED(1)
    safAmfNode=PL-5,safAmfCluster=myAmfCluster
    saAmfNodeAdminState=UNLOCKED(1)
    saAmfNodeOperState=ENABLED(1)
    safAmfNode=SC-1,safAmfCluster=myAmfCluster
    saAmfNodeAdminState=UNLOCKED(1)
    saAmfNodeOperState=ENABLED(1)
    safAmfNode=SC-2,safAmfCluster=myAmfCluster
    saAmfNodeAdminState=UNLOCKED(1)
    saAmfNodeOperState=ENABLED(1)

Steps to reproduce:

  1. Drop connection between the standby SC-2 and PL-3
  2. Reconnect SC-2 with PL-3
  3. Execute "immdump" inside a node. (immd in the standby SC-2 will remove the PL-3 from the list of detached nodes)
  4. Reboot the active SC-1
  5. Execute "amf-state node" inside a node

Related

Wiki: ChangeLog-5.22.06

Discussion

  • Hieu Hong Hoang

    Hieu Hong Hoang - 2022-02-28
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,13 +1,13 @@
    -After the active SC rebooted, the standby SC executed failover to active. The new active SC notified a PL left cluster but that PL was still in cluster. The reason is the connection between the standby SC and that PL was dropped in the past, but that PL still connected with the active SC. It led the standby SC considered that PL is down regardless the connection was established after that. The standby SC only removes a down PL when it receives a check point from the active SC. However, the active SC will not send that check point because it still connect with the PL. During failover, the standby SC will notify all recorded down nodes left cluster.
    +After the active SC rebooted, the standby SC executed failover to active. The new active SC notified a PL left cluster but that PL was still in cluster. The reason is the connection between the standby SC and that PL was dropped in the past, but that PL still connected with the active SC. It led the standby SC considered that PL absented regardless the connection was established after that. The standby SC only change the PL  state when it receives a check point from the active SC. However, the active SC will not send that check point because it still connect with the PL. During failover, the standby SC will notify all recorded absent nodes left cluster.
     <pre>
    
    -                                    down list:PL-3             down list:PL-3
    -SC-1(Act)----SC-2(Stb)   SC-1(Act)----SC-2(Stb)        SC-1(Act)----SC-2(Stb)
    -    \        /               \                          \        / 
    -     \      /                 \                          \      /
    -       PL-3                     PL-3                       PL-3
    +                                     absent nodes:PL-3           absent nodes:PL-3
    +SC-1(Act)----SC-2(Stb)    SC-1(Act)----SC-2(Stb)       SC-1(Act)----SC-2(Stb)
    +    \        /                \                            \        /
    +     \      /                  \                            \      /
    +       PL-3                      PL-3                         PL-3
    
    
    -                     down list:PL-3,SC-1
    -          SC-1(Down)   SC-2(Stb)            SC-1(Stb)----SC-2(Atc)
    +                     absent nodes:PL-3,SC-1
    +          SC-1(Down)   SC-2(Stb)            SC-1(Stb)----SC-2(Act)
                            /                        \        /
                           /                          \      /
                      PL-3                              PL-3
    
    • Component: clm --> amf
    • Part: - --> d
     
  • Thang Duc Nguyen

    • summary: clm: the payload node unexpectedly left cluster right after failover --> amf: the payload node unexpectedly left cluster right after failover
    • assigned_to: Hieu Hong Hoang --> Thang Duc Nguyen
     
  • Thang Duc Nguyen

    • status: accepted --> review
     
  • Thang Duc Nguyen

    • status: review --> fixed
     
  • Thang Duc Nguyen

    commit f7e9ed4cee2d95490a3d5c05676dc6c512d08b9a (HEAD -> develop, origin/develop, ticket-3309)
    Author: thang.d.nguyen thang.d.nguyen@dektech.com.au
    Date: Fri Mar 4 14:57:19 2022 +0700

    amf: reboot to recovery PL in split-brain [#3309]
    
    The connection between the standby SC and that PL was dropped,
    but that PL still connected with the active SC. It led the
    standby SC considered that PL absented regardless the connection
    was established after that. During failover, the standby SC will
    notify all recorded absent nodes left cluster. It causes PL left
    cluster from AMF view but still connect to active.
    
    This scenario is a kind of split-brain use case and amfd should
    order PL reboot to recovery the issue.
    
     

Log in to post a comment.

MongoDB Logo MongoDB