OpenSAF / Tickets / #1529 Node rebooted as saImmOiInitialize_2 failed during middleware active assignment

#1529 Node rebooted as saImmOiInitialize_2 failed during middleware active assignment

Milestone: future

Status: unassigned

Owner: nobody

Labels: None

Type: defect

Component: unknown

Part: -

Version:

Priority: major

Blocker:

Updated: 2016-09-20

Created: 2015-10-08

Creator: Chani Srivastava

Private: No

Setup:
Changeset-6901
Invoked continuous failovers on a 4-node Cluster with 2 controllers and 2 payloads. All nodes have 64bit architecture.
2PBE enabled with 25K objects

Issue Observed:
Cluster reset occurred on invoking continuous failovers

Attachments:
Attaching syslogs for SC-1 and SC-2
Traces for immnd and immd can be shared seperately if required

Steps:
Initially SC-1 is active and SC-2 standby
A test script invoked failover via killing osafclmd on SC1
* SC-2 became active

Oct 7 18:23:32 OSAF-SC1 root: killing osafclmd from invoke_failover.sh
Oct 7 19:25:20 OSAF-SC2 osafamfd[2191]: NO FAILOVER StandBy --> Active

On the new active controler, saImmOiInitialize_2 failed

Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5)
Oct 7 19:25:22 OSAF-SC2 osafntfimcnd[2735]: ER ntfimcn_imm_init() Fail
Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 333 (safLckService) <299, 2020f>
Oct 7 19:25:22 OSAF-SC2 osafimmnd[2131]: NO Implementer connected: 334 (safEvtService) <298, 2020f>
Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5)
Oct 7 19:25:23 OSAF-SC2 osafntfimcnd[2738]: ER ntfimcn_imm_init() Fail
Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA MDS Send Failed
Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA Error code 2 returned for message type 4 - ignoring

Other services also fail to initialize with IMM on new active controller..i.e. SC-2
And finally SMF had csi set timeout
SC-2 went for reboot and hence the entire cluster reset, as SC-2 is the only active controller at the time

Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: NO 'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'csiSetcallbackTimeout' : Recovery is 'nodeFailfast'
Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: ER safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:csiSetcallbackTimeout Recovery is:nodeFailfast
Oct 7 19:25:51 OSAF-SC2 osafamfnd[2205]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
Oct 7 19:25:51 OSAF-SC2 opensaf_reboot: Rebooting local node; timeout=60

3 Attachments

1529.tgz

SC1_syslog.txt

SC2_syslog.txt

Discussion

Chani Srivastava - 2015-10-08

Milestone: 4.7.RC1 --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chani Srivastava - 2015-10-08

Milestone: future --> 4.5.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-10-09

Component: imm --> unknown
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-10-09

looks to be an MDS issue, please provide IMMND traces amd mds.log

Oct 7 19:25:23 OSAF-SC2 osafimmnd[2131]: WA MDS Send Failed

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ritu Raj - 2015-10-09

summary: Opensaf cluster went for reset wihle invoking failover --> Node rebooted as saImmOiInitialize_2 failed during middleware active assignment

Attachments has changed:

Diff:

--- old +++ new @@ -1,2 +1,3 @@ SC1_syslog.txt (436.4 kB; text/plain) SC2_syslog.txt (425.6 kB; text/plain) +1529.tgz (586.3 kB; application/x-compressed-tar)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ritu Raj - 2015-10-09

Similar issue is observed while invoking switchover :

On the newly promoted Controller SC-1 after some switchovers, imm initialize failed with ERR_TIMEOUT and CLMD faulted due to avaDown.

Oct 9 14:22:16 SOFO-64BIT-S1 osafntfimcnd[30122]: ER ntfimcn_imm_init saImmOiInitialize_2 failed SA_AIS_ERR_TIMEOUT (5)
Oct 9 14:22:16 SOFO-64BIT-S1 osafntfimcnd[30122]: ER ntfimcn_imm_init() Fail
Oct 9 14:22:17 SOFO-64BIT-S1 osaflogd[5406]: NO conf_runtime_obj_create: Cannot create config runtime object SA_AIS_ERR_TIMEOUT (5)
Oct 9 14:22:17 SOFO-64BIT-S1 osafclmd[5431]: ER saImmOiClassImplementerSet failed for class SaClmCluster, rc = 5,
Oct 9 14:22:17 SOFO-64BIT-S1 osafamfnd[5460]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'

The following is the mds.log snippet from the SC-1 at that time.

Oct 9 14:22:16.232286 osafimmnd[5396] ERR |MDS_SND_RCV: Timeout or Error occured
Oct 9 14:22:16.822751 osafntfimcnd[30122] ERR |MDS_SND_RCV: Timeout or Error occured
Oct 9 14:22:16.822899 osafntfimcnd[30122] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
Oct 9 14:22:17.213508 osafclmd[5431] ERR |MDS_SND_RCV: Timeout or Error occured
Oct 9 14:22:17.213625 osafclmd[5431] ERR |MDS_SND_RCV: Timeout occured on sndrsp message
Oct 9 14:22:17.213871 osaflogd[5406] ERR |MDS_SND_RCV: Timeout or Error occured
Oct 9 14:22:17.213949 osaflogd[5406] ERR |MDS_SND_RCV: Timeout occured on sndrsp message

The quiesced / old active controller got promoted back to active and the remaining cluster is fine.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2015-11-02

Milestone: 4.5.2 --> 4.6.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2016-05-04

Milestone: 4.6.2 --> 4.7.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2016-09-20

Milestone: 4.7.2 --> 5.0.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2017-04-03

Milestone: 5.0.2 --> future
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Node rebooted as saImmOiInitialize_2 failed during middleware active assignment

Milestone

Searches

Help

#1529 Node rebooted as saImmOiInitialize_2 failed during middleware active assignment

Discussion