Menu

#599 CLM: CLMD gets ERR_EXIST on implementerSet causes cluster reload

never
duplicate
nobody
None
defect
clm
-
4.4.0M0
critical
2014-01-09
2013-10-21
No

While testing SC failover, I encountered a case where the CLMD at old standby -
new active, becomes active before the imm has had time to discard the old
implementer for CLM. The new active CLMD tries to set implementer and gets
ERR_EXIST. This causes CLMD to crash which escalatges to SC restart which here
escalaets to cluster reload.

Oct 21 08:31:31 SC-1 user.notice opensafd: OpenSAF(4.4.M0) services successfully started
Oct 21 08:32:20 SC-1 local0.notice osaffmd[377]: NO Role: STANDBY, Node Down for node id: 2020f
Oct 21 08:32:20 SC-1 local0.crit osaffmd[377]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Oct 21 08:32:20 SC-1 user.notice kernel: TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, requested by peer while probing
Oct 21 08:32:20 SC-1 user.notice kernel: TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Oct 21 08:32:20 SC-1 user.notice kernel: TIPC: Lost contact with <1.1.2>
Oct 21 08:32:20 SC-1 user.notice opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Oct 21 08:32:20 SC-1 local0.notice osafrded[368]: NO rde_rde_set_role: role set to 1
Oct 21 08:32:20 SC-1 local0.notice osaflogd[406]: NO ACTIVE request
Oct 21 08:32:20 SC-1 local0.notice osafntfd[419]: NO ACTIVE request
Oct 21 08:32:20 SC-1 local0.notice osafclmd[429]: NO ACTIVE request
Oct 21 08:32:20 SC-1 local0.notice osafamfd[448]: NO FAILOVER StandBy --> Active
Oct 21 08:32:20 SC-1 local0.err osafclmd[429]: ER saImmOiImplementerSet failed rc:14, exiting
Oct 21 08:32:21 SC-1 syslog.info syslogd started: BusyBox v1.19.4
Oct 21 08:32:22 SC-1 user.notice opensafd: Starting OpenSAF Services

Discussion

  • Sirisha Alla

    Sirisha Alla - 2013-10-21
     
  • Anders Bjornerstedt

    Possibly so, but:
    (a) The new ticket system "Service backlog" reports suck. They exclude
    assigned tickets which is really stupid since ticketrs that are assigned
    are still really backlogged. Hence I missed detecting 528.

    (b) Ticket #528 should be severity critical. It causes a cluster restart.

    (c) The problem is not start order here, it is timing. There is no guarantee
    that the imm has had time to clear all implementers from the cluster at
    a failover before services start becoming active at new active. Particularly
    the first starting services really have to tolerate getting ERR_EXIST on
    implementer set. At least a few times, in te context of the failover failure
    case. This really holds for ALL non restartable OpensAF services residing at
    an SC.

     
  • Anders Bjornerstedt

    Duplicate of #528

     
  • Anders Bjornerstedt

    • status: unassigned --> duplicate
     
  • Anders Widell

    Anders Widell - 2014-01-09
    • Milestone: 4.4.FC --> never
     

Log in to post a comment.