#682 LOG: New Active reboots when coordinator IMMND is killed in the middle of switchover

future
unassigned
nobody
None
defect
log
d
4.4.M0
major
2016-09-20
2013-12-20
No

The issue is observed on changeset 4733 + #220 patches corresponding to cs 4741 and cs 4742. The test setup is a 4 node SLES 64bit VMs.The setup is single PBE enabled loaded with 25k objects.

SC-2(SLES-64BIT-SLOT2) is Active and IMMND coordinator is hosted on SC-1(SLES-64BIT-SLOT1). Controller Switchover is initiated and immnd is killed on SC-1. SC-1 went for reboot because of the csi set callback timeout of logd.

/var/log/messages of SC-1 and SC-2 corresponding to the above mentioned steps :

SC-2:

Dec 19 17:21:36 SLES-64BIT-SLOT2 osafamfd[3609]: NO safSi=SC-2N,safApp=OpenSAF Swap initiated
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafamfnd[3619]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO Implementer disconnected 18 <320, 2020f> (safMsgGrpService)
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO implementer for class 'SaSmfCampaign' is released => class extent is UNSAFE
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO Implementer disconnected 22 <319, 2020f> (safEvtService)
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO Implementer disconnected 23 <3, 2020f> (safLogService)
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO implementer for class 'OpenSafSmfConfig' is released => class extent is UNSAFE
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO implementer for class 'SaSmfSwBundle' is released => class extent is UNSAFE
Dec 19 17:21:36 SLES-64BIT-SLOT2 osafimmnd[3554]: NO Implementer disconnected 24 <298, 2020f> (safSmfService)
Dec 19 17:21:37 SLES-64BIT-SLOT2 osafimmnd[3554]: NO IDec 19 17:21:38

SC-1:

SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 18 <0, 2020f> (safMsgGrpService)
Dec 19 17:21:38 SLES-64BIT-SLOT1 osafimmnd[3498]: NO implementer for class 'SaSmfCampaign' is released => class extent is UNSAFE
Dec 19 17:21:38 SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 22 <0, 2020f> (safEvtService)
Dec 19 17:21:38 SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 23 <0, 2020f> (safLogService)
Dec 19 17:21:38 SLES-64BIT-SLOT1 osafimmnd[3498]: NO implementer for class 'OpenSafSmfConfig' is released => class extent is UNSAFE
Dec 19 17:21:38 SLES-64BIT-SLOT1 osafimmnd[3498]: NO implementer for class 'SaSmfSwBundle' is released => class extent is UNSAFE
Dec 19 17:21:38 SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 24 <0, 2020f> (safSmfService)
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 20 <0, 2020f> (safLckService)
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 19 <0, 2020f> (safCheckPointService)
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafimmnd[3498]: NO Implementer disconnected 21 <0, 2020f> (safClmService)
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafimmpbed: WA PBE lost contact with parent IMMND - Exiting
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafamfnd[3578]: NO 'safComp=IMMND,safSu=SC-1,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafntfimcnd[3829]: ER saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9)
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafamfd[3565]: NO Re-initializing with IMM
Dec 19 17:21:39 SLES-64BIT-SLOT1 osafimmd[3488]: NO IMMND coord at 2020f
mplementer disconnected 20 <303, 2020f> (safLckService)
......

Dec 19 17:21:49 SLES-64BIT-SLOT1 osafimmnd[3953]: NO Implementer connected: 40 (OpenSafImmPBE) <0, 2020f>
Dec 19 17:21:49 SLES-64BIT-SLOT1 osafamfd[3565]: NO Finished re-initializing with IMM
Dec 19 17:21:50 SLES-64BIT-SLOT1 osafimmnd[3953]: NO PBE-OI established on other SC. Dumping incrementally to file imm.db
Dec 19 17:23:40 SLES-64BIT-SLOT1 osafamfnd[3578]: NO 'safComp=LOG,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
Dec 19 17:23:40 SLES-64BIT-SLOT1 osafamfnd[3578]: ER safComp=LOG,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
Dec 19 17:23:40 SLES-64BIT-SLOT1 osafamfnd[3578]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Dec 19 17:23:40 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

When LOGD trace is examined there is no information at that point of time for the failure.

Dec 19 17:21:48.406386 osaflogd [3518:imma_oi_api.c:0445] >> saImmOiDispatch
Dec 19 17:21:48.406468 osaflogd [3518:imma_oi_api.c:0572] << saImmOiDispatch
Dec 19 17:21:48.406494 osaflogd [3518:lgs_main.c:0373] << imm_reinit_thread
Dec 19 17:21:48.406619 osaflogd [3518:lgs_imm.c:2134] >> imm_impl_set
Dec 19 17:21:48.417979 osaflogd [3518:lgs_imm.c:2182] << imm_impl_set
Dec 19 17:24:31.724994 osaflogd [2427:lgs_main.c:0213] >> log_initialize
Dec 19 17:24:32.311734 osaflogd [2427:lgs_file.c:0262] >> lgs_file_init
Dec 19 17:24:32.311823 osaflogd [2427:lgs_imm.c:1579] >> read_logsv_config_obj: (logConfig=1,safApp=safLogService)
Dec 19 17:24:32.311872 osaflogd [2427:imma_om_api.c:0140] >> saImmOmInitialize
Dec 19 17:24:32.311914 osaflogd [2427:imma_om_api.c:0167] TR OM client version A.2.11 or higher
Dec 19 17:24:32.311936 osaflogd [2427:imma_om_api.c:0192] >> initialize_common
Dec 19 17:24:32.311957 osaflogd [2427:imma_init.c:0261] >> imma_startup: use count 0
Dec 19 17:24:32.311985 osaflogd [2427:ncs_main_pub.c:0223] TR

Switchover operation timedout. This issue is reproducible. Attaching the syslogs and logd trace on SC-1 and IMMND traces on both the controllers.

2 Attachments

Related

Tickets: #934

Discussion

  • Hrishikesh

    Hrishikesh - 2014-05-08

    The same issue is observed on the ChangeSet: 5142, with same scenario.

    1. SC-2 Active
      SC-1 Standby hosting imm coord.
    2. Kill immnd on SC-1(new active) during controller switchover.

    csiSetcallbackTimeout observed for LOG service and nodefail fast was triggered. Attaching the logs.

    snippet of syslog.

    May 8 16:14:29 SLES1 osafamfnd[7166]: NO 'safComp=LOG,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
    May 8 16:14:29 SLES1 osafamfnd[7166]: ER safComp=LOG,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
    May 8 16:14:29 SLES1 osafamfnd[7166]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    May 8 16:14:29 SLES1 opensaf_reboot: Rebooting local node; timeout=60
    May 8 16:14:32 SLES1 kernel: [11709.977659] md: stopping all md devices.
    May 8 16:14:32 SLES1 kernel: [11710.979718] sd 0:0:0:0: [sda] Synchronizing SCSI cache
    ========================

     
  • Hrishikesh

    Hrishikesh - 2014-05-08
     
  • Anders Bjornerstedt

    • Milestone: future --> 4.5.2
     
  • elunlen

    elunlen - 2015-08-03
    • summary: New Active reboots when coordinator IMMND is killed in the middle of switchover --> LOG: New Active reboots when coordinator IMMND is killed in the middle of switchover
     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.2 --> 4.6.2
     
  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 4.6.2 --> 4.7.2
     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks