Menu

#2451 clm: Make the cluster reset admin op safe

future
unassigned
None
enhancement
clm
-
major
False
2021-09-14
2017-05-03
No

The cluster reset admin operation that was implemented in ticket [#2053] is not safe: if a node reboots very fast it can come up again and join the old cluster before other nodes have rebooted. See mail discussion:

https://sourceforge.net/p/opensaf/mailman/message/35398725/

This can be solved by implementing a two-phase cluster reset or by introducing a cluster generation number which is increased at each cluster reset (maybe both ordered an spontaneous cluster resets). A node will not be allowed to join the cluster with a different cluster genration without first rebooting.

Related

Tickets: #2053
Tickets: #2542

Discussion

  • Rafael Odzakow

    Rafael Odzakow - 2017-06-29

    For the node that is not allowed to join the CLM cluster will this solution also block IMM (and other services) from starting up?

     
  • Anders Widell

    Anders Widell - 2017-06-29

    Ideally yes, though then we are talking about a full clustering solution.

     
  • Anders Widell

    Anders Widell - 2017-07-01
    • Milestone: 5.17.08 --> 5.17.10
     
  • Hans Nordebäck

    Hans Nordebäck - 2017-09-15
    • status: unassigned --> review
    • assigned_to: Hans Nordebäck
     
  • Zoran Milinkovic

    I attached one idea (prototype) for the safe cluster restart.
    The attached file contains a bit change in IMM and CLM.

    The idea is that when cluster restart is invoked by CLM admin operation, that CLM first disable sync in IMM (change in IMM), and then continue with rebooting nodes.

    If a rebooted node comes up too fast, before the last IMM veteran node goes down, IMM sync will not be possible, and the node will be hanging in the NID phase waiting for the sync.
    When the last IMM veteran node goes down, IMMD will start with electing a new coordinator. Since there is no any veteran node in the cluster, the new IMM coordinator will start loading data from PBE or XML file.

    The side effect of the attached file is that some nodes which joined before the last veteran goes down, can be rebooted again mostly due to QUIESCED role in RDE, or if they are payload running without SC absence allowed.
    There is nothing wrong with rebooting that nodes again. They are still in OpenSAF starting phase, and there is no any application up and running. So, rebooting that nodes are safe.

    The attached file is only a proposal and needs to be split in two tickets, one for IMM (disable sync feature) and this ticket for CLM.

    For IMM part, I would like to make the disable sync function as a one way function, and when the sync is disabled, it cannot be enabled again until the cluster restart is done.
    In the attached file, disable sync feature can be switched on and off.

     
  • Hans Nordebäck

    Hans Nordebäck - 2017-09-26

    As an alternative to already sent out patches, I'll send out another patch that "emulates pxe" for reivew during this week.

     
  • Anders Widell

    Anders Widell - 2017-10-30
    • Milestone: 5.17.10 --> 5.18.01
     
  • Anders Widell

    Anders Widell - 2018-02-02
    • Milestone: 5.18.01 --> 5.18.04
     
  • Anders Widell

    Anders Widell - 2018-04-20
    • Milestone: 5.18.04 --> 5.18.06
     
  • Gary Lee

    Gary Lee - 2018-06-29
    • Milestone: 5.18.06 --> 5.18.08
     
  • Gary Lee

    Gary Lee - 2018-09-29
    • Milestone: 5.18.09 --> 5.18.12
     
  • Gary Lee

    Gary Lee - 2019-01-09
    • Milestone: 5.19.01 --> 5.19.03
     
  • Gary Lee

    Gary Lee - 2019-03-26
    • Milestone: 5.19.03 --> 5.19.06
     
  • Gary Lee

    Gary Lee - 2019-07-23
    • Milestone: 5.19.07 --> 5.19.10
     
  • Gary Lee

    Gary Lee - 2019-10-21
    • Milestone: 5.19.10 --> 5.20.01
     
  • Gary Lee

    Gary Lee - 2020-02-15
    • Milestone: 5.20.02 --> 5.20.05
     
  • Gary Lee

    Gary Lee - 2020-05-30
    • Milestone: 5.20.05 --> 5.20.08
     
  • Gary Lee

    Gary Lee - 2020-08-31
    • Milestone: 5.20.08 --> 5.20.11
     
  • Gary Lee

    Gary Lee - 2020-12-01
    • Milestone: 5.20.11 --> 5.21.03
     
  • Gary Lee

    Gary Lee - 2021-03-01
    • Milestone: 5.21.03 --> 5.21.06
     
  • Gary Lee

    Gary Lee - 2021-06-01
    • Milestone: 5.21.06 --> 5.21.10
     
  • Gary Lee

    Gary Lee - 2021-09-14
    • status: review --> unassigned
    • Milestone: 5.21.09 --> future
     

Log in to post a comment.