While testing SC failover, I encountered a case where the CLMD at old standby -
new active, becomes active before the imm has had time to discard the old
implementer for CLM. The new active CLMD tries to set implementer and gets
ERR_EXIST. This causes CLMD to crash which escalatges to SC restart which here
escalaets to cluster reload.
Oct 21 08:31:31 SC-1 user.notice opensafd: OpenSAF(4.4.M0) services successfully started
Oct 21 08:32:20 SC-1 local0.notice osaffmd[377]: NO Role: STANDBY, Node Down for node id: 2020f
Oct 21 08:32:20 SC-1 local0.crit osaffmd[377]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Oct 21 08:32:20 SC-1 user.notice kernel: TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, requested by peer while probing
Oct 21 08:32:20 SC-1 user.notice kernel: TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Oct 21 08:32:20 SC-1 user.notice kernel: TIPC: Lost contact with <1.1.2>
Oct 21 08:32:20 SC-1 user.notice opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Oct 21 08:32:20 SC-1 local0.notice osafrded[368]: NO rde_rde_set_role: role set to 1
Oct 21 08:32:20 SC-1 local0.notice osaflogd[406]: NO ACTIVE request
Oct 21 08:32:20 SC-1 local0.notice osafntfd[419]: NO ACTIVE request
Oct 21 08:32:20 SC-1 local0.notice osafclmd[429]: NO ACTIVE request
Oct 21 08:32:20 SC-1 local0.notice osafamfd[448]: NO FAILOVER StandBy --> Active
Oct 21 08:32:20 SC-1 local0.err osafclmd[429]: ER saImmOiImplementerSet failed rc:14, exiting
Oct 21 08:32:21 SC-1 syslog.info syslogd started: BusyBox v1.19.4
Oct 21 08:32:22 SC-1 user.notice opensafd: Starting OpenSAF Services
I think this is same as http://sourceforge.net/p/opensaf/tickets/528/
Possibly so, but:
(a) The new ticket system "Service backlog" reports suck. They exclude
assigned tickets which is really stupid since ticketrs that are assigned
are still really backlogged. Hence I missed detecting 528.
(b) Ticket #528 should be severity critical. It causes a cluster restart.
(c) The problem is not start order here, it is timing. There is no guarantee
that the imm has had time to clear all implementers from the cluster at
a failover before services start becoming active at new active. Particularly
the first starting services really have to tolerate getting ERR_EXIST on
implementer set. At least a few times, in te context of the failover failure
case. This really holds for ALL non restartable OpensAF services residing at
an SC.
Duplicate of #528