Menu

#2094 Standby controller goes for reboot on stopping openSaf with STONITH enabled cluster

5.2.RC1
duplicate
nobody
None
defect
fm
-
5.1 FC
major
2017-03-01
2016-10-05
No

OS : Ubuntu 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 2-node cluster (both controllers) Remote fencing enabled

Steps:
1. Bring up OpenSaf on two nodes
2. Enable STONITH
3. Stop opensaf on Standby

Active controller triggers reboot of standby

SC-1 Syslog

Oct 5 13:01:23 SC-1 osafimmd[5535]: NO MDS event from svc_id 25 (change:4, dest:565215202263055)
Oct 5 13:01:23 SC-1 osafimmnd[5545]: NO Global discard node received for nodeId:2020f pid:3579
Oct 5 13:01:23 SC-1 osafimmnd[5545]: NO Implementer disconnected 14 <0, 2020f(down)> (@safAmfService2020f)
Oct 5 13:01:24 SC-1 osafamfd[5592]: NO Node 'SC-2' left the cluster
Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Node Down event for node id 2020f:
Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Current role: ACTIVE
Oct 5 13:01:24 SC-1 osaffmd[5526]: Rebooting OpenSAF NodeId = 131599 EE Name = SC-2, Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Oct 5 13:01:25 SC-1 external/libvirt[5893]: [5906]: notice: Domain SC-2 was stopped

Oct 5 13:01:27 SC-1 kernel: [ 5355.132093] tipc: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Oct 5 13:01:27 SC-1 kernel: [ 5355.132123] tipc: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Oct 5 13:01:27 SC-1 kernel: [ 5355.132126] tipc: Lost contact with <1.1.2>
Oct 5 13:01:27 SC-1 external/libvirt[5893]: [5915]: notice: Domain SC-2 was started
Oct 5 13:01:42 SC-1 kernel: [ 5370.557180] tipc: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Oct 5 13:01:42 SC-1 osafimmd[5535]: NO MDS event from svc_id 25 (change:3, dest:565217457979407)
Oct 5 13:01:42 SC-1 osafimmd[5535]: NO New IMMND process is on STANDBY Controller at 2020f
Oct 5 13:01:42 SC-1 osafimmd[5535]: WA IMMND on controller (not currently coord) requests sync
Oct 5 13:01:42 SC-1 osafimmd[5535]: NO Node 2020f request sync sync-pid:1176 epoch:0
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Announce sync, epoch:4
Oct 5 13:01:43 SC-1 osafimmd[5535]: NO Successfully announced sync. New ruling epoch:4
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Oct 5 13:01:43 SC-1 osafimmloadd: NO Sync starting
Oct 5 13:01:43 SC-1 osafimmloadd: IN Synced 346 objects in total
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 18430
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Epoch set to 4 in ImmModel
Oct 5 13:01:43 SC-1 osafimmd[5535]: NO ACT: New Epoch for IMMND process at node 2010f old epoch: 3 new epoch:4
Oct 5 13:01:43 SC-1 osafimmd[5535]: NO ACT: New Epoch for IMMND process at node 2020f old epoch: 0 new epoch:4
Oct 5 13:01:43 SC-1 osafimmloadd: NO Sync ending normally
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Oct 5 13:01:43 SC-1 osafamfd[5592]: NO Received node_up from 2020f: msg_id 1
Oct 5 13:01:43 SC-1 osafamfd[5592]: NO Node 'SC-2' joined the cluster
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer connected: 16 (MsgQueueService131599) <467, 2010f>
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer locally disconnected. Marking it as doomed 16 <467, 2010f> (MsgQueueService131599)
Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer disconnected 16 <467, 2010f> (MsgQueueService131599)
Oct 5 13:01:44 SC-1 osafrded[5518]: NO Peer up on node 0x2020f
Oct 5 13:01:44 SC-1 osaffmd[5526]: NO clm init OK
Oct 5 13:01:44 SC-1 osafimmd[5535]: NO MDS event from svc_id 24 (change:5, dest:13)
Oct 5 13:01:44 SC-1 osaffmd[5526]: NO Peer clm node name: SC-2
Oct 5 13:01:44 SC-1 osafrded[5518]: NO Got peer info request from node 0x2020f with role STANDBY
Oct 5 13:01:44 SC-1 osafrded[5518]: NO Got peer info response from node 0x2020f with role STANDBY

Related

Tickets: #2160

Discussion

  • Hans Nordebäck

    Hans Nordebäck - 2016-10-06

    This is the same behaviour as running without stontih or PLM. Without stonith opensaf tries to reboot the standby controller at opensafd stop, but needs either PLM or stonith to succeed. Perhaps it is needed to stop opensaf and not trigger remote fencing? Is this an upgrade case? Perhaps we should create an enhancement ticket for this?

     
  • Mathi Naickan

    Mathi Naickan - 2016-10-06

    This seems to be a case of differentiating a hung node versus a node on which the middleware is stopped.

    Is there any standard means to detect a "hung" node?
    IF there is such a mechanism to detect a hung node, then
    Upon receiving "NODE_DOWN" i.e. below event
    "Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Node Down event for node id 2020f:"
    FM could use (say a libvirt command) command to detect if the node is hung or running healthy. If running healthy then a reboot using stonith could be avoided.

    OpenSAF did support the usecase of "/etc/init.d/opensafd stop without OS reboot". Should we continue to support that?

     
  • Anders Widell

    Anders Widell - 2016-10-06

    I think the procedure for stopping OpenSAF in a controlled way is to first lock the node using CLM. The CLM lock admin operation will remove the node from cluster membership. The it should be safe to stop OpenSAF on that node without getting fenced - i.e. we should not fence a node that we lost contact with if the node was not a member of the cluster.

     
  • Chani Srivastava

    Is Stonith applicable only for controllers? As no reboot observed while stopping opensaf on Payload.

     
  • Hans Nordebäck

    Hans Nordebäck - 2016-10-13

    Split brain may only happen between the system controllers.

     
  • Chani Srivastava

    Can you provide the documentation on how to stop opensaf in a controlled manner so that I can close the ticket.

     
  • Hans Nordebäck

    Hans Nordebäck - 2016-11-02

    Ticket [#2160] will add support to differentiate between a hung versus a stopped node, no additional documentation will be needed.

     

    Related

    Tickets: #2160

  • Srikanth R

    Srikanth R - 2016-11-07

    There are two scenarios where "opensafd stop" is invoked on any opensaf controller.

    SCENARIO-1) Where /etc/init.d/opensafd script is invoked manually on command prompt when the system is running and up.
    SCENARIO-2) Software on a controller ( other than opensafd) invoked "reboot" for which opensafd stop is invoked in run level 3 or higher.

    With the patch submitted for #2160,

    a)node shall go for reboot in scenario-1, if administrator doesn't invoke clm admin operation. This is fine.

    b) For scenario-2, all run level services shall not be stopped gracefully as the node shall be rebooted abruptly after opensafd stop as admin did not invoke clm admin operation. So, opensafd as a HA software shall not support graceful reboot on standby controller with the #2160 fix ?

     
  • Hans Nordebäck

    Hans Nordebäck - 2016-11-08

    in a) doing /etc/init.d/opensafd stop doesn't reboot the node, but stops opensaf on that node and saClmNodeIsMember is set to false. The active controller will then not perform remote fencing of that node.
    in b) "graceful" reboot after opensafd stop, should work fine without any involvemnet of the remote fencing functionality

     
    • Srikanth R

      Srikanth R - 2016-11-08

      For the scenario-2,
      -> Management software e.g. SWN other than opensafd issued reboot on standby controller. From opensaf perspecitve , the standby controller might be healthy member of a cluster. But from the SWN perspecitve, node needs to be repaired and reboot is invoked.

      -> When reboot command is invoked by SWN, all services in configured runlevel shall be stopped in the order.

      -> Once the opensafd stop script is invoked on standby controller, active controller detects that the standby controller is in healthy state and remote fencing shall be done.

      -> As part of remote fencing, the node shall be hard rebooted, which doesn't give chance for other services in runlevel to be stopped gracefully.

      -> If the SWN has a database service ( e.g. drbd) which is to be stopped after opensafd stop, the database service stop script shall not be invoked as remote fencing is done. This may result in bad state for the other management software e.g. SWN.

      Suggestion :

      1) Either opensaf shall document that admin needs to perform clm admin lock of standby controller before repairing. OR
      2) FM should detect the difference between opensafd stop and hung opensaf processes. As part of opensafd stop, peer fmd on standby contoller can update fmd on active controller that opensafd on standby is going gracefully.

       
  • Hans Nordebäck

    Hans Nordebäck - 2016-11-11

    Agree, "Suggestion: 1" document that admin needs to perform clm admin lock of standby is a good suggestion. The node will then not be a member of the cluster and not affected by remote fencing

     
  • Anders Widell

    Anders Widell - 2017-02-28
    • Milestone: 5.2.FC --> 5.2.RC1
     
  • Hans Nordebäck

    Hans Nordebäck - 2017-03-01

    I suggest to close this ticket as a duplicate of ticket #2160

     
  • Chani Srivastava

    • status: unassigned --> duplicate
     
  • Chani Srivastava

    Closing as duplicate of #2160

     

Log in to post a comment.