Menu

#3263 rde: Cluster is unrecoverable after all nodes split-brain in roaming SC

5.21.09
fixed
None
defect
rde
d
major
False
2021-09-14
2021-05-14
No

In Roaming SC deployment, if split-brain occurs that separates all nodes apart, in which each partition has one SC, we have all SCs becoming active. At rejoin, all SCs detect themself as duplicated active to one of other SCs, they should all reboot, ideally.
However, sometimes the last active SC is not detected as duplicated because all the other SCs already reboot. The last SC does not find any others as active duplicated to itself. As of this result, since the last SC is not healthy throughout the split time, it's causing many errors for other nodes to rejoin again after reboot.

Related

Wiki: ChangeLog-5.21.06
Wiki: ChangeLog-5.21.09

Discussion

  • Minh Hon Chau

    Minh Hon Chau - 2021-05-14
    • status: unassigned --> accepted
     
  • Minh Hon Chau

    Minh Hon Chau - 2021-05-14
    • summary: rde: Cluster is unrecoverable after all node split-brain in roaming SC --> rde: Cluster is unrecoverable after all nodes split-brain in roaming SC
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,2 +1,2 @@
    -In Roaming SC deployment, if split-brain occurs that separate all node apart, in which each partition has one SC, we have all SC becoming active. At rejoin, all SC detects themself as duplicated active to one of other SC, they should all reboot, ideally.
    +In Roaming SC deployment, if split-brain occurs that separates all nodes apart, in which each partition has one SC, we have all SCs becoming active. At rejoin, all SCs detect themself as duplicated active to one of other SCs, they should all reboot, ideally.
     However, sometimes the last active SC is not detected as duplicated because all the other SCs already reboot. The last SC does not find any others as active duplicated to itself. As of this result, since the last SC is not healthy throughout the split time, it's causing many errors for other nodes to rejoin again after reboot.
    
     
  • Minh Hon Chau

    Minh Hon Chau - 2021-05-26
    • status: accepted --> fixed
     
  • Minh Hon Chau

    Minh Hon Chau - 2021-05-26

    commit 68fde36133a5fd47b667c6971c967a7cf8629b03
    Author: Minh Chau minh.chau@dektech.com.au
    Date: Wed May 26 21:05:12 2021 +1000

    rde: Use broadcast for peer info message [#3263]
    

    commit ca0cb78a03a2eb3cfa3519b4c5d9af0905f325a5
    Author: Minh Chau minh.chau@dektech.com.au
    Date: Wed May 26 21:05:12 2021 +1000

    rde: Add timeout waiting for peer info [#3263]
    
     
  • Gary Lee

    Gary Lee - 2021-09-14

    commit bbe47278c2499bc738bf0c2dc8cc4ebbbb9a026d
    Author: Minh Chau minh.chau@dektech.com.au
    Date: Tue Jul 13 18:00:41 2021 +1000

    rde: Add timeout of waiting for peer info [#3263]
    
    This ticket revisit the waiting for peer info and
    fix the problem of disordered peer_up and peer info
    in the commit d1593b03b3c9bec292b14dde65264c261760bf46
    
     
  • Gary Lee

    Gary Lee - 2021-09-14
    • status: fixed --> assigned
     
  • Gary Lee

    Gary Lee - 2021-09-14
    • status: assigned --> fixed
    • Milestone: 5.21.06 --> 5.21.09
     

Log in to post a comment.