OpenSAF / Tickets / #708 Standby Controller failed to come up when failover happened in the middle of switchover

#708 Standby Controller failed to come up when failover happened in the middle of switchover

Milestone: 4.3.3

Status: fixed

Owner: Nagendra Kumar

Labels: None

Type: defect

Component: amf

Part: -

Version: 4.4.M0

Priority: major

Blocker:

Updated: 2015-01-18

Created: 2014-01-03

Creator: Sirisha Alla

Private: No

The issue is observed on changeset 4733 + #220 patches corresponding to cs 4741 and cs 4742. The test setup is a 4 node SLES 64bit VMs.The setup is single PBE enabled loaded with 25k objects.

Following steps are done when the issue is observed.

1) Trigger middleware switchover. Make sure the coordinator IMMND is on the Active controller.
2) Reboot standby controller once the Active moves to quiesced.

Syslog on the controllers: SLOT2(SC-2) was active at the beginning of the test

Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Peer FM down on node_id: 131343
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: STANDBY, Node Down for node id: 2010f
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Failover occurred in the middle of switchover
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532113] TIPC: Resetting link <1.1.2:eth0-1.1.1:eth0>, peer not responding
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532122] TIPC: Lost link <1.1.2:eth0-1.1.1:eth0> on network plane A
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532130] TIPC: Lost contact with <1.1.1>
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Global discard node received for nodeId:2010f pid:2412
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 8 <0, 2010f(down)> (safAmfService)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 26 <0, 2010f(down)> (@safLogService)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 27 <0, 2010f(down)> (@OpenSafImmReplicatorB)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 9 <0, 2010f(down)> (MsgQueueService131343)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmd[2343]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN)
Jan 3 15:38:56 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafrded[2324]: NO rde_rde_set_role: role set to 1
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO FAILOVER StandBy --> Active
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: ACTIVE, Node Down for node id: 2010f
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 6 <18, 2020f> (@safAmfService2020f)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer connected: 28 (safAmfService) <18, 2020f>
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO Node 'SC-1' left the cluster
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO FAILOVER StandBy --> Active DONE!

(SLOT1)2010f did not come up after reboot. The node was rebooted again and following are the logs.

Jan 3 16:11:44 SLES-64BIT-SLOT1 osafrded[2849]: NO rde@2020f has active state => Standby role
Jan 3 16:11:44 SLES-64BIT-SLOT1 osaffmd[2859]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE-> IMM_NODE_ISOLATED
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: Ruling epoch noted as:23
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO IMMND coord at 2020f
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: SaImmRepositoryInitModeT changed and noted as 'SA_IMM_KEEP_REPOSITORY'
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE-> IMM_NODE_W_AVAILABLE
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2020f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO IMMND coord at 2020f
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2030f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2040f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2261
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO RepositoryInitModeT is SA_IMM_KEEP_REPOSITORY
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Epoch set to 23 in ImmModel
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2010f old epoch: 0 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM SERVER READY
Jan 3 16:12:00 SLES-64BIT-SLOT1 osaflogd[2892]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osaflogd[2892]: NO log root directory is: /var/log/opensaf/saflog
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier) connected: 34 (@safLogService) <3, 2010f>
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafntfd[2905]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmd[2919]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmna[2929]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmna[2929]: NO safNode=SC-1,safCluster=myClmCluster Joined cluster, nodeid=2010f
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafamfd[2938]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier) connected: 35 (@safAmfService2010f) <10, 2010f>
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier) connected: 36 (@OpenSafImmReplicatorB) <4, 2010f>
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafntfimcnd[2913]: NO Started
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Started
Jan 3 16:12:03 SLES-64BIT-SLOT1 osafamfd[2938]: NO Cold sync complete!
Jan 3 16:12:06 SLES-64BIT-SLOT1 osafimmnd[2879]: NO PBE-OI established on other SC. Dumping incrementally to file imm.db
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: renewing lease of 192.168.56.101
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: leased 192.168.56.101 for 3600 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: no renewal time supplied, assuming 1800 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: no rebind time supplied, assuming 3150 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: adding IP address 192.168.56.101/24
Jan 3 16:18:05 SLES-64BIT-SLOT1 ifup: eth0 device: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
Jan 3 16:18:05 SLES-64BIT-SLOT1 SuSEfirewall2: SuSEfirewall2 not active
Jan 3 16:38:05 SLES-64BIT-SLOT1 -- MARK --
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Timed-out for response from AMFND
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Going for recovery
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfd[2938]: exiting on signal 15
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: ER AMF director unexpectedly crashed
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60

Following message is observed in AMFD traces on SLOT2:

Jan 3 16:43:54.288337 osafamfd [2428:ndfsm.cc:0058] TR invalid node ID (2010f)
Jan 3 16:43:54.288354 osafamfd [2428:ndfsm.cc:0162] << avd_node_up_evh
Jan 3 16:43:54.288369 osafamfd [2428:mbcsv_api.c:0773] >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified
Jan 3 16:43:54.288386 osafamfd [2428:mbcsv_api.c:0803] TR svc_id:10, pwe_hdl:65537
Jan 3 16:43:54.288401 osafamfd [2428:mbcsv_util.c:0343] >> mbcsv_send_ckpt_data_to_all_peers
Jan 3 16:43:54.288456 osafamfd [2428:mbcsv_util.c:0387] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Jan 3 16:43:54.288486 osafamfd [2428:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 10, pwe_hdl: 65537
Jan 3 16:43:54.288503 osafamfd [2428:mbcsv_util.c:0399] TR calling encode callback
Jan 3 16:43:54.288520 osafamfd [2428:mbcsv_util.c:0438] TR send the encoded message to any other peer with same s/w version
Jan 3 16:43:54.288766 osafamfd [2428:mbcsv_util.c:0441] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Jan 3 16:43:54.288783 osafamfd [2428:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 10, pwe_hdl: 65537
Jan 3 16:43:54.288798 osafamfd [2428:mbcsv_mds.c:0185] >> mbcsv_mds_send_msg: sending to vdest:1
Jan 3 16:43:54.288813 osafamfd [2428:mbcsv_mds.c:0209] TR send type MDS_SENDTYPE_REDRSP:
Jan 3 16:43:54.289876 osafamfd [2428:mbcsv_mds.c:0244] << mbcsv_mds_send_msg: success
Jan 3 16:43:54.289923 osafamfd [2428:mbcsv_util.c:0492] << mbcsv_send_ckpt_data_to_all_peers
Jan 3 16:43:54.289941 osafamfd [2428:mbcsv_api.c:0868] << mbcsv_process_snd_ckpt_request: retval: 1
Jan 3 16:43:54.289959 osafamfd [2428:imm.cc:0235] >> peek
Jan 3 16:43:54.289973 osafamfd [2428:imm.cc:0244] << peek
Jan 3 16:43:55.383747 osafamfd [2428:ndmsg.cc:0345] >> avd_n2d_msg_rcv
Jan 3 16:43:55.383915 osafamfd [2428:ndfsm.cc:0054] >> avd_node_up_evh: from 2010f
Jan 3 16:43:55.383945 osafamfd [2428:ndfsm.cc:0058] TR invalid node ID (2010f)

1 Attachments

logs.tar.bz2

Nagendra Kumar - 2014-06-16

status: unassigned --> assigned

assigned_to: Nagendra Kumar

Milestone: future --> 4.3.3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-05

Logs attached is not from "Jan 3".
Analyzed based on logs attached in #707 as similar problem has been observed in #707 in some instances:
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Started
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Timed-out for response from AMFND
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: ER AMF director unexpectedly crashed

Analysis:
It happens because SC-1 makes safSi=SC-2N,safApp=OpenSAF as Act during switchover and then SC-1 reboots. SC-2 get Act role in avd_role_failover and calls avd_node_failover. avd_node_failover checks that safSu=SC-2,safSg=2N,safApp=OpenSAF is already Act and doesn't perform any action in node_fail().
Hence, avd_clm_track_start is never called as it was supposed to be called from avd_sg_2n_susi_sucss_sg_reln when safSu=SC-2,safSg=2N,safApp=OpenSAF goes to Act from Standby.
Since avd_clm_track_start is not called on newly Act, CLM doesn't deliver the track callback clm_track_cb and hence newly Act says that the node is not valid.

Fix: During failover, Amfd need to check whether mw 2n Su hosted on this node is already Act, then call avd_clm_track_start() else not.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-05

status: assigned --> accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-05

There is one more problem. safSg=2N,safApp=OpenSAF remain in 2 and when SC-1 joins after fix as explained above, it doesn't assign SI to safSu=SC-1,safSg=2N,safApp=OpenSAF.

Last edit: Nagendra Kumar 2014-08-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-06

status: accepted --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-12

changeset: 5560:0e6771c09786
tag: tip
user: Nagendra Kumarnagendra.k@oracle.com
date: Tue Aug 12 20:32:48 2014 +0530
summary: amfd: start clm track and mark sg stable after node failover [#708]

[staging:0e6771]

Related

Tickets: ~~#708~~
Commit: [0e6771]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-20

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-08-20

changeset: 5622:91cc3199615f
branch: opensaf-4.3.x
parent: 5620:3069c1092cc8
user: Nagendra Kumarnagendra.k@oracle.com
date: Wed Aug 20 16:06:03 2014 +0530
summary: amfd: start clm track and mark sg stable after node failover [#708]

changeset: 5623:b0fd1a2c4b8a
branch: opensaf-4.4.x
tag: tip
parent: 5621:34a06cd5d60a
user: Nagendra Kumarnagendra.k@oracle.com
date: Wed Aug 20 16:06:15 2014 +0530
summary: amfd: start clm track and mark sg stable after node failover [#708]

[staging:91cc31]
[staging:b0fd1a]

Related

Tickets: ~~#708~~
Commit: [91cc31]
Commit: [b0fd1a]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sirisha Alla - 2014-09-08

This issue is still seen with change set 5697.

Sep 5 17:52:34 SLES-64BIT-SLOT1 osafimmnd[16196]: NO Ccb 160 ABORTED (om_ccb_exec)
Sep 5 17:53:18 SLES-64BIT-SLOT1 osafamfnd[16273]: ER AMF director heart beat timeout, generating core for amfd
Sep 5 17:53:19 SLES-64BIT-SLOT1 osafamfnd[16273]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 131343, SupervisionTime = 60
Sep 5 17:53:19 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
Sep 5 17:53:22 SLES-64BIT-SLOT1 kernel: [22239.902876] md: stopping all md devices.
Sep 5 17:53:22 SLES-64BIT-SLOT1 kernel: [22240.902798] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Sep 5 17:53:22 SLES-64BIT-SLOT1 kernel: [22240.932448] ohci_hcd 0000:00:06.0: PCI INT A disabled

Sep 5 18:26:07 SLES-64BIT-SLOT1 opensafd[2344]: ER Timed-out for response from AMFND
Sep 5 18:26:07 SLES-64BIT-SLOT1 opensafd[2344]: ER
Sep 5 18:26:07 SLES-64BIT-SLOT1 opensafd[2344]: ER Going for recovery
Sep 5 18:26:07 SLES-64BIT-SLOT1 osafamfd[2461]: exiting for shutdown
Sep 5 18:26:07 SLES-64BIT-SLOT1 osafamfnd[2471]: ER AMF director unexpectedly crashed
Sep 5 18:26:07 SLES-64BIT-SLOT1 osafamfnd[2471]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sirisha Alla - 2014-09-08

status: fixed --> assigned
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-09-08

The following portion of the floated patch was missed in the push,, I will push now:

diff --git a/osaf/services/saf/amf/amfd/sgproc.cc b/osaf/services/saf/amf/amfd/sgproc.cc
--- a/osaf/services/saf/amf/amfd/sgproc.cc
+++ b/osaf/services/saf/amf/amfd/sgproc.cc
@@ -1565,6 +1565,11 @@ void avd_node_down_mw_susi_failover(AVD_
* accordingly.
/
i_su->sg_of_su->node_fail(cb, i_su);
+ / Mw 2N SU in stable state, that means clm track will not start
+ in avd_sg_2n_susi_sucss_sg_reln, so start here.*/
+ if ((i_su->sg_of_su->sg_redundancy_model == SA_AMF_2N_REDUNDANCY_MODEL) &&
+ (i_su->sg_of_su->sg_fsm_state == AVD_SG_FSM_STABLE))
+ (void) avd_clm_track_start();

/* Free all the SU SI assignments*/ i_su->delete_all_susis();
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-09-08

changeset: 5734:50f6d12a717e
branch: opensaf-4.3.x
parent: 5728:2efbadd0ce1d
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 08 11:46:35 2014 +0530
summary: amfd: submit the left out portion of the fix [#708]

changeset: 5735:c2cf6860ae85
branch: opensaf-4.4.x
parent: 5729:74d26d3c3d18
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 08 11:46:48 2014 +0530
summary: amfd: submit the left out portion of the fix [#708]

changeset: 5736:7d722962edd4
branch: opensaf-4.5.x
parent: 5732:ac0c1fec8f99
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 08 11:46:58 2014 +0530
summary: amfd: submit the left out portion of the fix [#708]

changeset: 5737:43eb3d9c8009
tag: tip
parent: 5733:e7734a5f6755
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 08 11:47:07 2014 +0530
summary: amfd: submit the left out portion of the fix [#708]

[staging:50f6d1]
[staging:c2cf68]
[staging:7d7229]
[staging:43eb3d]

Related

Tickets: ~~#708~~
Commit: [43eb3d]
Commit: [50f6d1]
Commit: [7d7229]
Commit: [c2cf68]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2014-09-08

status: assigned --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Standby Controller failed to come up when failover happened in the middle of...

Milestone

Searches

Help

#708 Standby Controller failed to come up when failover happened in the middle of switchover

Related

Discussion

Related

Related

Related