Menu

#2920 amfd: cyclic SC reboot after split network

5.18.09
fixed
Gary Lee
None
defect
amf
d
major
False
2018-09-05
2018-08-30
Gary Lee
No

After a split network event, both SCs can reboot endlessly, due to this assertion:

2018-08-29 18:05:34.689 SC-2 osafamfd[263]: src/amf/amfd/sg_2n_fsm.cc:596: avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
2018-08-29 18:05:34.695 SC-2 osafamfnd[273]: ER AMFD has unexpectedly crashed. Rebooting node

To reproduce, enable SC absence, and split a network into two partitions.

Partition 1 contains SC-1, PL-3
Partition 2 contains SC-2, PL-4,PL-5

Before the split, PL-3 is active for a 2N SG. PL-4 is standby.

2018-08-30 19:06:53.913 PL-3 osafamfnd[204]: NO Assigning 'safSi=A,safApp=AmfDemo' ACTIVE to 'safSu=1,safSg=1,safApp=AmfDemo'
2018-08-30 19:06:53.944 PL-3 osafamfnd[204]: NO Assigned 'safSi=A,safApp=AmfDemo' ACTIVE to 'safSu=1,safSg=1,safApp=AmfDemo'

2018-08-30 19:06:54.094 PL-4 osafamfnd[204]: NO Assigning 'safSi=A,safApp=AmfDemo' STANDBY to 'safSu=2,safSg=1,safApp=AmfDemo'
2018-08-30 19:06:54.128 PL-4 osafamfnd[204]: NO Assigned 'safSi=A,safApp=AmfDemo' STANDBY to 'safSu=2,safSg=1,safApp=AmfDemo'

During the split, SC-2 may assign PL-4 to be active.

2018-08-30 19:07:04.299 PL-4 osafamfnd[204]: NO Assigning 'safSi=A,safApp=AmfDemo' ACTIVE to 'safSu=2,safSg=1,safApp=AmfDemo'

After the network merges, SC-1 and SC-2 may both reboot after they detect spilt brain.

2018-08-30 19:07:05.003 SC-1 osafrded[178]: NO Got peer info response from node 0x2020f with role ACTIVE
2018-08-30 19:07:05.003 SC-1 osafrded[178]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131343, SupervisionTime = 60

2018-08-30 19:07:04.999 SC-2 osafrded[180]: NO Got peer info response from node 0x2010f with role ACTIVE
2018-08-30 19:07:05.001 SC-2 osafrded[180]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131599, SupervisionTime = 60

Then PL-3 and PL-4 will sync these duplicated active assignments to AMFD, and cause an assertion in AMFD.

2018-08-30 19:08:43.974 SC-1 osafamfd[267]: NO Perform absent failover for failed SU:safSu=1,safSg=1,safApp=AmfDemo
2018-08-30 19:08:43.975 SC-1 osafamfd[267]: src/amf/amfd/sg_2n_fsm.cc:596: avd_sg_2n_act_susi: Assertion 'a_susi_1->su == a_susi_2->su' failed.
2018-08-30 19:08:43.981 SC-1 osafamfnd[282]: ER AMFD has unexpectedly crashed. Rebooting node
2018-08-30 19:08:43.982 SC-1 osafamfnd[282]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: AMFD has unexpectedly crashed. Rebooting node, OwnNodeId = 131343, SupervisionTime = 60

The user must then manually recover the cluster by doing a cluster reboot, or rebooting one of PL-3 / PL-4.

[#2918] addresses issues such as this, but for now, we can aid recovery of the cluster by rebooting one or both of the PLs in place of the assertion.

Related

Tickets: #2918
Tickets: #2920
Wiki: ChangeLog-5.18.09

Discussion

  • Gary Lee

    Gary Lee - 2018-08-31
    • status: unassigned --> review
    • assigned_to: Gary Lee
    • Part: - --> d
     
  • Gary Lee

    Gary Lee - 2018-09-05

    develop:

    commit f238ceb4dfe5ccc81ae9d921a26d944189bd20ab
    Author: Gary Lee <gary.lee@dektech.com.au>
    Date:   Wed Sep 5 00:43:46 2018 +0000
    
    amfd: reboot nodes that report conflicting 2N active assignments [#2920]
    

    release:

    commit 24e532611e400e85b6f03256d1111e2bdbb0d277
    Author: Gary Lee <gary.lee@dektech.com.au>
    Date:   Wed Sep 5 00:43:46 2018 +0000
    
     amfd: reboot nodes that report conflicting 2N active assignments [#2920]
    
     

    Related

    Tickets: #2920

  • Gary Lee

    Gary Lee - 2018-09-05
    • status: review --> fixed
     

Log in to post a comment.