The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
Switchovers with IMM application running is in progress when the issue is observed.
Syslog on SC-1:
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
Syslog on SC-2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
Syslog and traces are attached.
SC-2 logs
The direct cause of the cluster reset is that CLM exits on receiving
ERR_EXIST on implementerSet.This could a case of #946
(fixed in 5722:d353ca39b3d9).
If not then someone needs to analyze what CLM is doing (it is detaching as main OI and fails to attach as a new OI/Applier).
The CLM problem actually only explains why this SLOT2 went down.
This was a switchover, not a failover, so the other SC should have
reverted the switchover.
And one more comment:
I dont think that this incident is logically related to 2PBE.
A PRTO fails to get created.
That is possibly more likely to happen with 2PBE than 1PBE or 0PBE,
but logically it can at least also happen also with 1PBE.
A. SLOT1 node went down:
Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init
Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14
946 fixes the above problem in CLM
B. Slot2 node went down(Quiesced --> Active)
Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active
saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg
Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv
Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed
Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
Good analysis.
We can add that the reason that AMFD got BAD_HANLDE when attempting to do an RtObjectUpdate is that
although the OI handle is valid, it was not associated with any implementer-name at that time.
So this must be a pure and plain bug in the AMFD. Most likely recently introduced since otherwise we should
have seen this before.
The AMFD interprets the BAD_HANDLE as the only "expected" reason for BAD_HANDLE, that the handle is invalid
due to the IMMND having restarted. By "expected" I dont mean common, I mean the only reason the AMFD programmer
has to re-initialize the handle: a restart of the local IMMND. But that is not what is happening here.
The SAF spec actually explicitly says that BAD_HANDLE is to be used for this particlar case (implementer-name not set
when trying to perform an OI operation). While this is not wrong. It would be better in this case to use ERR_BAD_OPERATION
since that is an error that is unambiguously a client error and the spec on that error code for this downcall also fitts the case:
ERR_BAD_OPERATION - The targeted object is not implemented by the invoking process.
So I think we should write an enhancement on the immsv to change the error code for this case.
It will also be a backwards compatible change. We are talking an interface violation here and that
should be handled by the process aborting.
We should also write a defect ticket on AMFD, or use this ticket, to track a fix for the AMFD bug - premature
use of an OI handle. This ticket also poitns to an independent CLM bug. So we should probably have two
tickets.
/AndersBj
From: Neelakanta Reddy [mailto:neelakanta@users.sf.net]
Sent: den 10 september 2014 17:58
To: opensaf-tickets@lists.sourceforge.net
Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers
A. SLOT1 node went down:
Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init
Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14
946 fixes the above problem in CLM
B. Slot2 node went down(Quiesced --> Active)
Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active
saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg
Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv
Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed
Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
[tickets:#1059]http://sourceforge.net/p/opensaf/tickets/1059 2PBE: cluster reset observed during switchovers
Status: unassigned
Milestone: 4.3.3
Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
Last Updated: Wed Sep 10, 2014 10:29 AM UTC
Owner: nobody
The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
Switchovers with IMM application running is in progress when the issue is observed.
Syslog on SC-1:
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
Syslog on SC-2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
Syslog and traces are attached.
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Related
Tickets:
#1059Tickets: tickets
Neel correctly points out that 946 may fix the CLM problem.
This is true if the ERR_EXIST on implementer-set is due to the prior OI having detached locally at this node
but not yet been confirmed globally over fevs. But since this is a switchover, it is more likley that the OI detached
on the other SC and that the AMF is faster in processing quiessed-ack from old active CLMD and ordering CLMD
at this node to become new active (and thus allocate OI).
So I think getting ERR_EXIST at new active OI implementer-set may unfortunately be a fact of life for the switch-over case.
The only fix I see here is that since the new active knows that this is (or could be) a switchover since it has just been
order to become active. It could in this context interpret ERR_EXIST from implementer-set as effectively TRY_AGAIN.
Perhaps even simpler: Director services could always interpret ERR_EXIST on implementer-set as TRY_AGAIN.
As always, a TRY_AGAIN loop must be finite. And implementer-set is not blocked by imm-sync so we are not
talking 60 secods here. At MOST we are talking sub-second, the fevs turnarround latency.
/AndersBj
From: Anders Björnerstedt [mailto:anders.bjornerstedt@ericsson.com]
Sent: den 10 september 2014 18:57
To: [opensaf:tickets] ; opensaf-tickets@lists.sourceforge.net
Subject: Re: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers
Good analysis.
We can add that the reason that AMFD got BAD_HANLDE when attempting to do an RtObjectUpdate is that
although the OI handle is valid, it was not associated with any implementer-name at that time.
So this must be a pure and plain bug in the AMFD. Most likely recently introduced since otherwise we should
have seen this before.
The AMFD interprets the BAD_HANDLE as the only "expected" reason for BAD_HANDLE, that the handle is invalid
due to the IMMND having restarted. By "expected" I dont mean common, I mean the only reason the AMFD programmer
has to re-initialize the handle: a restart of the local IMMND. But that is not what is happening here.
The SAF spec actually explicitly says that BAD_HANDLE is to be used for this particlar case (implementer-name not set
when trying to perform an OI operation). While this is not wrong. It would be better in this case to use ERR_BAD_OPERATION
since that is an error that is unambiguously a client error and the spec on that error code for this downcall also fitts the case:
ERR_BAD_OPERATION - The targeted object is not implemented by the invoking process.
So I think we should write an enhancement on the immsv to change the error code for this case.
It will also be a backwards compatible change. We are talking an interface violation here and that
should be handled by the process aborting.
We should also write a defect ticket on AMFD, or use this ticket, to track a fix for the AMFD bug - premature
use of an OI handle. This ticket also poitns to an independent CLM bug. So we should probably have two
tickets.
/AndersBj
From: Neelakanta Reddy [mailto:neelakanta@users.sf.net]
Sent: den 10 september 2014 17:58
To: opensaf-tickets@lists.sourceforge.net
Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers
A. SLOT1 node went down:
Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init
Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14
946 fixes the above problem in CLM
B. Slot2 node went down(Quiesced --> Active)
Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active
saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg
Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv
Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed
Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
[tickets:#1059]http://sourceforge.net/p/opensaf/tickets/1059 2PBE: cluster reset observed during switchovers
Status: unassigned
Milestone: 4.3.3
Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
Last Updated: Wed Sep 10, 2014 10:29 AM UTC
Owner: nobody
The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
Switchovers with IMM application running is in progress when the issue is observed.
Syslog on SC-1:
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
Syslog on SC-2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
Syslog and traces are attached.
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Related
Tickets:
#1059Tickets: tickets
Neel,
I have seen another issue after a switchover. This time CLM failed with
BAD_HANDLE on classImplSet(). Is this issue also resolved by #946 or
should we open a new ticket for it?
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
disconnected. Marking it as doomed 26 <5, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmpbed: IN Delaying class delete
at slave PBE due to ongoing commit of ccb:15/21
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
disconnected 26 <5, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
connected: 30 (safClmService) <324, 2020f>
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafclmd[9606]: ER
saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: NO
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'avaDown' : Recovery is 'nodeFailfast'
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: ER
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown
Recovery is:nodeFailfast
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: Rebooting OpenSAF
NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node
failfast, OwnNodeId = 131599, SupervisionTime = 60
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
disconnected. Marking it as doomed 30 <324, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
down so no response
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
down so no response
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
disconnected 30 <324, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
clmd traces on that node
Sep 11 14:58:24.715267 osafclmd [9606:clms_imm.c:0530] TR Implementer
not yet set: Switching on the tryagain flag
Sep 11 14:58:24.715296 osafclmd [9606:clms_imm.c:0533] <<
clms_node_update_rattr
Sep 11 14:58:24.715319 osafclmd [9606:clms_imm.c:0449] >>
clms_admin_state_update_rattr: Admin state 1 update for node
safNode=SC-1,safCluster=myClmCluster
Sep 11 14:58:24.715343 osafclmd [9606:clms_util.c:0036] >>
clms_node_get_by_name: name input safNode=SC-1,safCluster=myClmCluster
length 36
Sep 11 14:58:24.715369 osafclmd [9606:clms_util.c:0044] TR nodename
after patricia tree get safNode=SC-1,safCluster=myClmCluster
Sep 11 14:58:24.715393 osafclmd [9606:clms_util.c:0047] <<
clms_node_get_by_name
Sep 11 14:58:24.715421 osafclmd [9606:imma_oi_api.c:2275] T2
ERR_BAD_HANDLE: No initialized handle exists!
Sep 11 14:58:24.715452 osafclmd [9606:clms_imm.c:0488] IN
saImmOiRtObjectUpdate for safNode=SC-1,safCluster=myClmCluster failed
with rc = 9. Reinit with IMM
Sep 11 14:58:24.715480 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 11 14:58:24.715505 osafclmd [9606:imma_oi_api.c:0626] T2
ERR_BAD_HANDLE: No initialized handle exists!
....
Sep 11 14:58:24.719788 osafclmd [9606:clms_imm.c:0721] TR Implementer is
not set. Switching on flag in clms_cluster_update_rattr
Sep 11 14:58:24.719801 osafclmd [9606:clms_imm.c:0724] <<
clms_cluster_update_rattr
Sep 11 14:58:24.719815 osafclmd [9606:clms_imm.c:0633] <<
clms_retry_pending_rtupdates
Sep 11 14:58:24.719828 osafclmd [9606:clms_main.c:0435] TR There is an
IMM task to be tried again. setting poll time out to 500
.....
Sep 11 14:58:24.740335 osafclmd [9606:clms_imm.c:0620] <<
clms_retry_pending_rtupdates: Implementerset yet to happen, try later
Sep 11 14:58:24.740345 osafclmd [9606:clms_main.c:0435] TR There is an
IMM task to be tried again. setting poll time out to 500
Sep 11 14:58:24.744013 osafclmd [9606:imma_oi_api.c:1636] T2
ERR_BAD_HANDLE: No implementer is set for this handle
Sep 11 14:58:24.744055 osafclmd [9606:clms_imm.c:2291] ER
saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting
Regards,
Sirisha
On Wednesday 10 September 2014 09:28 PM, Neelakanta Reddy wrote:
Related
Tickets:
#1059See mail discussion related to this ticket.
At SC switchover, it can happen that:
1) Old active CLMD quiesces at old active SC and acks to AMFD.
2) This reaches AMFD on onew active (via AMFD<->AMFD direct communication.
3) The Implementer-release done by old active CLMD must pass over fevs.
4) CLMD at new active SC is ordered to become active.
5) CLMD at new active attempts to set implementer before the implementer
release has reached new active SC over fevs.
6) The implementer-set request is rejeted by the local IMMND with
ERR_EXIST.
point (6) above will not get reproduced after #946.
CLM is handling ERR_EXIST and TRY_AGAIN during role change(switchover/failover).
Neelakanta Reddy wrote:
You may be correct that #946 in this particular scenario will prevent
CLMD from having getting an ERR_EXIST.
It will get an ERR_TRY_AGAIN by the #946 fix, but ONLY IF the
implementer was released locally,
i.e. by the same CLMD that is now here in this case trying to set it.
Maybe this is the case here that this CLMD has done implementer-set
successfully, then released it
as part of switch-over, but then has to set it again (very soon quickly)
and is forced to wait on its
own release (by the #946 fix when it is there).
What troubles me is that for the more normal switchover case and for the
failover case,
the CLMD (and I am sure most other directors) are still vulnerable. In
these cases the
implementer is released on one SC but allocated at another SC and #946
does not help there.
The CLMD (and other directors) may then still get ERR_EXIST when trying
to set implementer.
And since CLMD (and other directors probably) dont handle ERR_EXIST as
TRY_AGAIN, they will exit
the same way that CLMD exits here.
Or am I missing something ?
/AndersBj
Related
Tickets:
#1059F.Y.I
The implementerset() operation is done from within two function during two scenarios:
From within functionX, during Initial role or role changes (switchover/failover). FunctionX is coded to handled err_exist and try_again.
From within functionY, during handling of BAD_HANDLE i.e. while reinitializing with IMM. This functionY uses immutils and does not handle ERR_EXIST, because the IMM reinit is attempted only after saImmoIfinalize().
Now, w.r.t the sequence/scenario in this ticket,
I think an 'ideal' thing would be:
(a) The local IMM returns TRY_AGAIN to the OI, instead of ERR_EXIST. Because the OI just cleared itself by calling saImmOiFinalize() and the local IMM knows about it. It is immaterial to the OI about internal processing(FEVS, etc).
OR
(b) Block on saImmOiFinalize() and return to the caller only when the FEVS protocol completes.
I have come to know that #946 is the same as (a). If that is not the case and if either of (a) or (b) involves complex fixes in IMM, we can change the functionY to handle ERR_EXIST.
Mathi Naickan wrote:
But there is still a problem in the AMFD to fix related to #1059.
This is the case where the AMFD tries to perform an RTO operation using
a valid oi-handle but the handle
is not associated with any implemmenter/implementer-name.
My guess is that this cas some happens (it happens in this incident)
when there is a failed switchover.
Probably the AMFD releases implementer (but keeps the handle open and
valid) in order to switrch
form OI to applier. But then the switchover fails and is reverted (?),
the old active AMFD becomes
active again. Somewhere in this region the AMFD tries to update RT data,
finds that it is "active" and
does so prematurely, when it still has not set implementer/implementer-name.
That needs a ticket.
Either a new ticket, or we can use this one since it is now "free" from
the CLMD side of the incident.
/AndersBj
Related
Tickets:
#1059The CLMD terminated due to a problem fixed by ticket #946.
The AMFD terminated due to the same problem described in #707
Neel,
Do we have an explanation for why we are returning BAD_OP (20) for rt object here?
Sep 11 14:58:24.709606 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=PL-4,safCluster=myClmCluster failed with rc = 20. Reinit with IMM
Sirisha,
Is this test (of switchovers) involving CLM admin operations also?
Mathi,
The test does not involve any clm admin operations.
On Friday 12 September 2014 12:01 AM, Mathi Naickan wrote:
Related
Tickets:
#1059Mathi,
The problem is same as #1057
Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000030/4294967344 towards PBE-B
Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000030)
Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmnd[4276]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
CLM is trying to create multiple threads for each RT object update failure.
for all the re-init thread share the same OIhandle variable for storing the OI handle.
re-init of one imm-handle is sufficient. may be using of state variable for the main thread to remember, if the re-init thread is created or not.
/Neel