Menu

#1529 Node rebooted as saImmOiInitialize_2 failed during middleware active assignment

future
unassigned
nobody
None
defect
unknown
-
major
2016-09-20
2015-10-08
No

Setup:
Changeset-6901
Invoked continuous failovers on a 4-node Cluster with 2 controllers and 2 payloads. All nodes have 64bit architecture.
2PBE enabled with 25K objects

Issue Observed:
Cluster reset occurred on invoking continuous failovers

Attachments:
Attaching syslogs for SC-1 and SC-2
Traces for immnd and immd can be shared seperately if required

Steps:
Initially SC-1 is active and SC-2 standby
A test script invoked failover via killing osafclmd on SC1
* SC-2 became active

Oct 7 18:23:32 OSAF-SC1 root: killing osafclmd from invoke_failover.sh
Oct 7 19:25:20 OSAF-SC2 osafamfd[2191]: NO FAILOVER StandBy --> Active

  • On the new active controler, saImmOiInitialize_2 failed

Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5)
Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init() Fail
Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 333 (safLckService) <299, 2020f>
Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 334 (safEvtService) <298, 2020f>
Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5)
Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init() Fail
Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA MDS Send Failed
Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA Error code 2 returned for message type 4 - ignoring

  • Other services also fail to initialize with IMM on new active controller..i.e. SC-2

  • And finally SMF had csi set timeout

  • SC-2 went for reboot and hence the entire cluster reset, as SC-2 is the only active controller at the time

Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: NO 'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: ER safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
Oct 7 19:25:51 OSAF-SC2 opensaf_reboot: Rebooting local node; timeout=60

3 Attachments

Discussion

  • Chani Srivastava

    • Milestone: 4.7.RC1 --> future
     
  • Chani Srivastava

    • Milestone: future --> 4.5.2
     
  • Neelakanta Reddy

    • Component: imm --> unknown
     
  • Neelakanta Reddy

    looks to be an MDS issue, please provide IMMND traces amd mds.log

    Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA MDS Send Failed

     
  • Ritu Raj

    Ritu Raj - 2015-10-09
    • summary: Opensaf cluster went for reset wihle invoking failover --> Node rebooted as saImmOiInitialize_2 failed during middleware active assignment
    • Attachments has changed:

    Diff:

    --- old
    +++ new
    @@ -1,2 +1,3 @@
     SC1_syslog.txt (436.4 kB; text/plain)
     SC2_syslog.txt (425.6 kB; text/plain)
    +1529.tgz (586.3 kB; application/x-compressed-tar)
    
     
  • Ritu Raj

    Ritu Raj - 2015-10-09

    Similar issue is observed while invoking switchover :

    On the newly promoted Controller SC-1 after some switchovers, imm initialize failed with ERR_TIMEOUT and CLMD faulted due to avaDown.

    Oct 9 14:22:16 SOFO-64BIT-S1 osafntfimcnd[30122]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5)
    Oct 9 14:22:16 SOFO-64BIT-S1 osafntfimcnd[30122]: ER ntfimcn_imm_init() Fail
    Oct 9 14:22:17 SOFO-64BIT-S1 osaflogd[5406]: NO conf_runtime_obj_create: Cannot create config runtime object SA_AIS_ERR_TIMEOUT (5)
    Oct 9 14:22:17 SOFO-64BIT-S1 osafclmd[5431]: ER saImmOiClassImplementerSet failed for class SaClmCluster, rc = 5,
    Oct 9 14:22:17 SOFO-64BIT-S1 osafamfnd[5460]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'

    The following is the mds.log snippet from the SC-1 at that time.

    Oct 9 14:22:16.232286 osafimmnd[5396] ERR |MDS_SND_RCV: Timeout or Error occured
    Oct 9 14:22:16.822751 osafntfimcnd[30122] ERR |MDS_SND_RCV: Timeout or Error occured
    Oct 9 14:22:16.822899 osafntfimcnd[30122] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
    Oct 9 14:22:17.213508 osafclmd[5431] ERR |MDS_SND_RCV: Timeout or Error occured
    Oct 9 14:22:17.213625 osafclmd[5431] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
    Oct 9 14:22:17.213871 osaflogd[5406] ERR |MDS_SND_RCV: Timeout or Error occured
    Oct 9 14:22:17.213949 osaflogd[5406] ERR |MDS_SND_RCV: Timeout occured on sndrsp message

    The quiesced / old active controller got promoted back to active and the remaining cluster is fine.

     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.2 --> 4.6.2
     
  • Mathi Naickan

    Mathi Naickan - 2016-05-04
    • Milestone: 4.6.2 --> 4.7.2
     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     

Log in to post a comment.