Menu

#708 Standby Controller failed to come up when failover happened in the middle of switchover

4.3.3
fixed
None
defect
amf
-
4.4.M0
major
2015-01-18
2014-01-03
No

The issue is observed on changeset 4733 + #220 patches corresponding to cs 4741 and cs 4742. The test setup is a 4 node SLES 64bit VMs.The setup is single PBE enabled loaded with 25k objects.

Following steps are done when the issue is observed.

1) Trigger middleware switchover. Make sure the coordinator IMMND is on the Active controller.
2) Reboot standby controller once the Active moves to quiesced.

Syslog on the controllers: SLOT2(SC-2) was active at the beginning of the test

Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Peer FM down on node_id: 131343
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: STANDBY, Node Down for node id: 2010f
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Failover occurred in the middle of switchover
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532113] TIPC: Resetting link <1.1.2:eth0-1.1.1:eth0>, peer not responding
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532122] TIPC: Lost link <1.1.2:eth0-1.1.1:eth0> on network plane A
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532130] TIPC: Lost contact with <1.1.1>
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Global discard node received for nodeId:2010f pid:2412
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 8 <0, 2010f(down)> (safAmfService)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 26 <0, 2010f(down)> (@safLogService)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 27 <0, 2010f(down)> (@OpenSafImmReplicatorB)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 9 <0, 2010f(down)> (MsgQueueService131343)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmd[2343]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN)
Jan 3 15:38:56 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafrded[2324]: NO rde_rde_set_role: role set to 1
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO FAILOVER StandBy --> Active
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: ACTIVE, Node Down for node id: 2010f
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 6 <18, 2020f> (@safAmfService2020f)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer connected: 28 (safAmfService) <18, 2020f>
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO Node 'SC-1' left the cluster
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO FAILOVER StandBy --> Active DONE!

(SLOT1)2010f did not come up after reboot. The node was rebooted again and following are the logs.

Jan 3 16:11:44 SLES-64BIT-SLOT1 osafrded[2849]: NO rde@2020f has active state => Standby role
Jan 3 16:11:44 SLES-64BIT-SLOT1 osaffmd[2859]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE-> IMM_NODE_ISOLATED
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: Ruling epoch noted as:23
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO IMMND coord at 2020f
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: SaImmRepositoryInitModeT changed and noted as 'SA_IMM_KEEP_REPOSITORY'
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE-> IMM_NODE_W_AVAILABLE
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2020f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO IMMND coord at 2020f
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2030f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2040f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2261
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO RepositoryInitModeT is SA_IMM_KEEP_REPOSITORY
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Epoch set to 23 in ImmModel
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND process at node 2010f old epoch: 0 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM SERVER READY
Jan 3 16:12:00 SLES-64BIT-SLOT1 osaflogd[2892]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osaflogd[2892]: NO log root directory is: /var/log/opensaf/saflog
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier) connected: 34 (@safLogService) <3, 2010f>
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafntfd[2905]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmd[2919]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmna[2929]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmna[2929]: NO safNode=SC-1,safCluster=myClmCluster Joined cluster, nodeid=2010f
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafamfd[2938]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier) connected: 35 (@safAmfService2010f) <10, 2010f>
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier) connected: 36 (@OpenSafImmReplicatorB) <4, 2010f>
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafntfimcnd[2913]: NO Started
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Started
Jan 3 16:12:03 SLES-64BIT-SLOT1 osafamfd[2938]: NO Cold sync complete!
Jan 3 16:12:06 SLES-64BIT-SLOT1 osafimmnd[2879]: NO PBE-OI established on other SC. Dumping incrementally to file imm.db
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: renewing lease of 192.168.56.101
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: leased 192.168.56.101 for 3600 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: no renewal time supplied, assuming 1800 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: no rebind time supplied, assuming 3150 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: adding IP address 192.168.56.101/24
Jan 3 16:18:05 SLES-64BIT-SLOT1 ifup: eth0 device: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
Jan 3 16:18:05 SLES-64BIT-SLOT1 SuSEfirewall2: SuSEfirewall2 not active
Jan 3 16:38:05 SLES-64BIT-SLOT1 -- MARK --
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Timed-out for response from AMFND
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Going for recovery
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfd[2938]: exiting on signal 15
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: ER AMF director unexpectedly crashed
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60

Following message is observed in AMFD traces on SLOT2:

Jan 3 16:43:54.288337 osafamfd [2428:ndfsm.cc:0058] TR invalid node ID (2010f)
Jan 3 16:43:54.288354 osafamfd [2428:ndfsm.cc:0162] << avd_node_up_evh
Jan 3 16:43:54.288369 osafamfd [2428:mbcsv_api.c:0773] >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified
Jan 3 16:43:54.288386 osafamfd [2428:mbcsv_api.c:0803] TR svc_id:10, pwe_hdl:65537
Jan 3 16:43:54.288401 osafamfd [2428:mbcsv_util.c:0343] >> mbcsv_send_ckpt_data_to_all_peers
Jan 3 16:43:54.288456 osafamfd [2428:mbcsv_util.c:0387] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Jan 3 16:43:54.288486 osafamfd [2428:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 10, pwe_hdl: 65537
Jan 3 16:43:54.288503 osafamfd [2428:mbcsv_util.c:0399] TR calling encode callback
Jan 3 16:43:54.288520 osafamfd [2428:mbcsv_util.c:0438] TR send the encoded message to any other peer with same s/w version
Jan 3 16:43:54.288766 osafamfd [2428:mbcsv_util.c:0441] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Jan 3 16:43:54.288783 osafamfd [2428:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 10, pwe_hdl: 65537
Jan 3 16:43:54.288798 osafamfd [2428:mbcsv_mds.c:0185] >> mbcsv_mds_send_msg: sending to vdest:1
Jan 3 16:43:54.288813 osafamfd [2428:mbcsv_mds.c:0209] TR send type MDS_SENDTYPE_REDRSP:
Jan 3 16:43:54.289876 osafamfd [2428:mbcsv_mds.c:0244] << mbcsv_mds_send_msg: success
Jan 3 16:43:54.289923 osafamfd [2428:mbcsv_util.c:0492] << mbcsv_send_ckpt_data_to_all_peers
Jan 3 16:43:54.289941 osafamfd [2428:mbcsv_api.c:0868] << mbcsv_process_snd_ckpt_request: retval: 1
Jan 3 16:43:54.289959 osafamfd [2428:imm.cc:0235] >> peek
Jan 3 16:43:54.289973 osafamfd [2428:imm.cc:0244] << peek
Jan 3 16:43:55.383747 osafamfd [2428:ndmsg.cc:0345] >> avd_n2d_msg_rcv
Jan 3 16:43:55.383915 osafamfd [2428:ndfsm.cc:0054] >> avd_node_up_evh: from 2010f
Jan 3 16:43:55.383945 osafamfd [2428:ndfsm.cc:0058] TR invalid node ID (2010f)

1 Attachments

Related

Tickets: #1312
Tickets: #708
Wiki: ChangeLog-4.3.3
Wiki: ChangeLog-4.4.1

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2014-06-16
    • status: unassigned --> assigned
    • assigned_to: Nagendra Kumar
    • Milestone: future --> 4.3.3
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-05

    Logs attached is not from "Jan 3".
    Analyzed based on logs attached in #707 as similar problem has been observed in #707 in some instances:
    Jan 3 16:12:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Started
    Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Timed-out for response from AMFND
    Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: ER AMF director unexpectedly crashed

    Analysis:
    It happens because SC-1 makes safSi=SC-2N,safApp=OpenSAF as Act during switchover and then SC-1 reboots. SC-2 get Act role in avd_role_failover and calls avd_node_failover. avd_node_failover checks that safSu=SC-2,safSg=2N,safApp=OpenSAF is already Act and doesn't perform any action in node_fail().
    Hence, avd_clm_track_start is never called as it was supposed to be called from avd_sg_2n_susi_sucss_sg_reln when safSu=SC-2,safSg=2N,safApp=OpenSAF goes to Act from Standby.
    Since avd_clm_track_start is not called on newly Act, CLM doesn't deliver the track callback clm_track_cb and hence newly Act says that the node is not valid.

    Fix: During failover, Amfd need to check whether mw 2n Su hosted on this node is already Act, then call avd_clm_track_start() else not.

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-05
    • status: assigned --> accepted
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-05

    There is one more problem. safSg=2N,safApp=OpenSAF remain in 2 and when SC-1 joins after fix as explained above, it doesn't assign SI to safSu=SC-1,safSg=2N,safApp=OpenSAF.

     

    Last edit: Nagendra Kumar 2014-08-05
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-06
    • status: accepted --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-12

    changeset: 5560:0e6771c09786
    tag: tip
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Tue Aug 12 20:32:48 2014 +0530
    summary: amfd: start clm track and mark sg stable after node failover [#708]

    [staging:0e6771]

     

    Related

    Tickets: #708
    Commit: [0e6771]

  • Nagendra Kumar

    Nagendra Kumar - 2014-08-20
    • status: review --> fixed
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-20

    changeset: 5622:91cc3199615f
    branch: opensaf-4.3.x
    parent: 5620:3069c1092cc8
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Wed Aug 20 16:06:03 2014 +0530
    summary: amfd: start clm track and mark sg stable after node failover [#708]

    changeset: 5623:b0fd1a2c4b8a
    branch: opensaf-4.4.x
    tag: tip
    parent: 5621:34a06cd5d60a
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Wed Aug 20 16:06:15 2014 +0530
    summary: amfd: start clm track and mark sg stable after node failover [#708]

    [staging:91cc31]
    [staging:b0fd1a]

     

    Related

    Tickets: #708
    Commit: [91cc31]
    Commit: [b0fd1a]

  • Sirisha Alla

    Sirisha Alla - 2014-09-08

    This issue is still seen with change set 5697.

    Sep 5 17:52:34 SLES-64BIT-SLOT1 osafimmnd[16196]: NO Ccb 160 ABORTED (om_ccb_exec)
    Sep 5 17:53:18 SLES-64BIT-SLOT1 osafamfnd[16273]: ER AMF director heart beat timeout, generating core for amfd
    Sep 5 17:53:19 SLES-64BIT-SLOT1 osafamfnd[16273]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 131343, SupervisionTime = 60
    Sep 5 17:53:19 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
    Sep 5 17:53:22 SLES-64BIT-SLOT1 kernel: [22239.902876] md: stopping all md devices.
    Sep 5 17:53:22 SLES-64BIT-SLOT1 kernel: [22240.902798] sd 0:0:0:0: [sda] Synchronizing SCSI cache
    Sep 5 17:53:22 SLES-64BIT-SLOT1 kernel: [22240.932448] ohci_hcd 0000:00:06.0: PCI INT A disabled

    Sep 5 18:26:07 SLES-64BIT-SLOT1 opensafd[2344]: ER Timed-out for response from AMFND
    Sep 5 18:26:07 SLES-64BIT-SLOT1 opensafd[2344]: ER
    Sep 5 18:26:07 SLES-64BIT-SLOT1 opensafd[2344]: ER Going for recovery
    Sep 5 18:26:07 SLES-64BIT-SLOT1 osafamfd[2461]: exiting for shutdown
    Sep 5 18:26:07 SLES-64BIT-SLOT1 osafamfnd[2471]: ER AMF director unexpectedly crashed
    Sep 5 18:26:07 SLES-64BIT-SLOT1 osafamfnd[2471]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60

     
  • Sirisha Alla

    Sirisha Alla - 2014-09-08
    • status: fixed --> assigned
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-08

    The following portion of the floated patch was missed in the push,, I will push now:

    diff --git a/osaf/services/saf/amf/amfd/sgproc.cc b/osaf/services/saf/amf/amfd/sgproc.cc
    --- a/osaf/services/saf/amf/amfd/sgproc.cc
    +++ b/osaf/services/saf/amf/amfd/sgproc.cc
    @@ -1565,6 +1565,11 @@ void avd_node_down_mw_susi_failover(AVD_
    * accordingly.
    /
    i_su->sg_of_su->node_fail(cb, i_su);
    + /
    Mw 2N SU in stable state, that means clm track will not start
    + in avd_sg_2n_susi_sucss_sg_reln, so start here.*/
    + if ((i_su->sg_of_su->sg_redundancy_model == SA_AMF_2N_REDUNDANCY_MODEL) &&
    + (i_su->sg_of_su->sg_fsm_state == AVD_SG_FSM_STABLE))
    + (void) avd_clm_track_start();

                /* Free all the SU SI assignments*/
                i_su->delete_all_susis();
    
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-08
    • status: assigned --> fixed
     

Log in to post a comment.