OpenSAF / Tickets / #1059 CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY

Sirisha Alla - 2014-09-10

SC-2 logs

SLOT2.tar.bz2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-10

Component: unknown --> clm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-10

The direct cause of the cluster reset is that CLM exits on receiving
ERR_EXIST on implementerSet.This could a case of #946
(fixed in 5722:d353ca39b3d9).

If not then someone needs to analyze what CLM is doing (it is detaching as main OI and fails to attach as a new OI/Applier).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-10

The CLM problem actually only explains why this SLOT2 went down.
This was a switchover, not a failover, so the other SC should have
reverted the switchover.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-10

And one more comment:
I dont think that this incident is logically related to 2PBE.
A PRTO fails to get created.
That is possibly more likely to happen with 2PBE than 1PBE or 0PBE,
but logically it can at least also happen also with 1PBE.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2014-09-10

A. SLOT1 node went down:

CLM got BAD_HANDLE and finalizes the handle

Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!

Discard implementer is called

Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

But the implemnter actually got disconnected at

Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

CLM tries to re-initializes and receives ERR_EXISTS

Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14

946 fixes the above problem in CLM

B. Slot2 node went down(Quiesced --> Active)

Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active

saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg

Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize

Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.

Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv

Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed

Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2014-09-10
  
  Good analysis.
  
  We can add that the reason that AMFD got BAD_HANLDE when attempting to do an RtObjectUpdate is that
  although the OI handle is valid, it was not associated with any implementer-name at that time.
  So this must be a pure and plain bug in the AMFD. Most likely recently introduced since otherwise we should
  have seen this before.
  
  The AMFD interprets the BAD_HANDLE as the only "expected" reason for BAD_HANDLE, that the handle is invalid
  due to the IMMND having restarted. By "expected" I dont mean common, I mean the only reason the AMFD programmer
  has to re-initialize the handle: a restart of the local IMMND. But that is not what is happening here.
  
  The SAF spec actually explicitly says that BAD_HANDLE is to be used for this particlar case (implementer-name not set
  when trying to perform an OI operation). While this is not wrong. It would be better in this case to use ERR_BAD_OPERATION
  since that is an error that is unambiguously a client error and the spec on that error code for this downcall also fitts the case:
  
  ERR_BAD_OPERATION - The targeted object is not implemented by the invoking process.
  
  So I think we should write an enhancement on the immsv to change the error code for this case.
  It will also be a backwards compatible change. We are talking an interface violation here and that
  should be handled by the process aborting.
  
  We should also write a defect ticket on AMFD, or use this ticket, to track a fix for the AMFD bug - premature
  use of an OI handle. This ticket also poitns to an independent CLM bug. So we should probably have two
  tickets.
  
  /AndersBj
  
  From: Neelakanta Reddy [mailto:neelakanta@users.sf.net]
  Sent: den 10 september 2014 17:58
  To: opensaf-tickets@lists.sourceforge.net
  Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers
  
  A. SLOT1 node went down:
  
  CLM got BAD_HANDLE and finalizes the handle
  
  Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
  Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
  
  Discard implementer is called
  
  Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
  Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
  Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
  
  But the implemnter actually got disconnected at
  
  Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
  Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
  Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
  
  CLM tries to re-initializes and receives ERR_EXISTS
  
  Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
  Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
  Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init
  
  Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
  Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14
  
  946 fixes the above problem in CLM
  
  B. Slot2 node went down(Quiesced --> Active)
  
  Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active
  
  saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg
  
  Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
  Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
  Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
  Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
  Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
  Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
  Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize
  
  Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.
  
  Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv
  
  Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
  Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
  Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
  Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed
  
  Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
  
  [tickets:#1059]http://sourceforge.net/p/opensaf/tickets/1059 2PBE: cluster reset observed during switchovers
  
  Status: unassigned
  Milestone: 4.3.3
  Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
  Last Updated: Wed Sep 10, 2014 10:29 AM UTC
  Owner: nobody
  
  The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
  
  Switchovers with IMM application running is in progress when the issue is observed.
  
  Syslog on SC-1:
  
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
  
  Syslog on SC-2
  
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
  .....
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
  ......
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
  
  Syslog and traces are attached.
  
  Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
  
  To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  Related
  
  Tickets: ~~#1059~~
  Tickets: tickets
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2014-09-10
    
    Neel correctly points out that 946 may fix the CLM problem.
    This is true if the ERR_EXIST on implementer-set is due to the prior OI having detached locally at this node
    but not yet been confirmed globally over fevs. But since this is a switchover, it is more likley that the OI detached
    on the other SC and that the AMF is faster in processing quiessed-ack from old active CLMD and ordering CLMD
    at this node to become new active (and thus allocate OI).
    
    So I think getting ERR_EXIST at new active OI implementer-set may unfortunately be a fact of life for the switch-over case.
    The only fix I see here is that since the new active knows that this is (or could be) a switchover since it has just been
    order to become active. It could in this context interpret ERR_EXIST from implementer-set as effectively TRY_AGAIN.
    
    Perhaps even simpler: Director services could always interpret ERR_EXIST on implementer-set as TRY_AGAIN.
    As always, a TRY_AGAIN loop must be finite. And implementer-set is not blocked by imm-sync so we are not
    talking 60 secods here. At MOST we are talking sub-second, the fevs turnarround latency.
    
    /AndersBj
    
    From: Anders Björnerstedt [mailto:anders.bjornerstedt@ericsson.com]
    Sent: den 10 september 2014 18:57
    To: [opensaf:tickets] ; opensaf-tickets@lists.sourceforge.net
    Subject: Re: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers
    
    Good analysis.
    
    We can add that the reason that AMFD got BAD_HANLDE when attempting to do an RtObjectUpdate is that
    although the OI handle is valid, it was not associated with any implementer-name at that time.
    So this must be a pure and plain bug in the AMFD. Most likely recently introduced since otherwise we should
    have seen this before.
    
    The AMFD interprets the BAD_HANDLE as the only "expected" reason for BAD_HANDLE, that the handle is invalid
    due to the IMMND having restarted. By "expected" I dont mean common, I mean the only reason the AMFD programmer
    has to re-initialize the handle: a restart of the local IMMND. But that is not what is happening here.
    
    The SAF spec actually explicitly says that BAD_HANDLE is to be used for this particlar case (implementer-name not set
    when trying to perform an OI operation). While this is not wrong. It would be better in this case to use ERR_BAD_OPERATION
    since that is an error that is unambiguously a client error and the spec on that error code for this downcall also fitts the case:
    
    ERR_BAD_OPERATION - The targeted object is not implemented by the invoking process.
    
    So I think we should write an enhancement on the immsv to change the error code for this case.
    It will also be a backwards compatible change. We are talking an interface violation here and that
    should be handled by the process aborting.
    
    We should also write a defect ticket on AMFD, or use this ticket, to track a fix for the AMFD bug - premature
    use of an OI handle. This ticket also poitns to an independent CLM bug. So we should probably have two
    tickets.
    
    /AndersBj
    
    From: Neelakanta Reddy [mailto:neelakanta@users.sf.net]
    Sent: den 10 september 2014 17:58
    To: opensaf-tickets@lists.sourceforge.net
    Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers
    
    A. SLOT1 node went down:
    
    CLM got BAD_HANDLE and finalizes the handle
    
    Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
    Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
    
    Discard implementer is called
    
    Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
    Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
    Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
    
    But the implemnter actually got disconnected at
    
    Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
    Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
    Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
    
    CLM tries to re-initializes and receives ERR_EXISTS
    
    Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
    Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
    Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init
    
    Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
    Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14
    
    946 fixes the above problem in CLM
    
    B. Slot2 node went down(Quiesced --> Active)
    
    Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active
    
    saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg
    
    Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
    Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
    Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
    Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
    Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
    Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
    Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize
    
    Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.
    
    Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv
    
    Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
    Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
    Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
    Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed
    
    Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
    
    [tickets:#1059]http://sourceforge.net/p/opensaf/tickets/1059 2PBE: cluster reset observed during switchovers
    
    Status: unassigned
    Milestone: 4.3.3
    Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
    Last Updated: Wed Sep 10, 2014 10:29 AM UTC
    Owner: nobody
    
    The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
    
    Switchovers with IMM application running is in progress when the issue is observed.
    
    Syslog on SC-1:
    
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
    
    Syslog on SC-2
    
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
    .....
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
    ......
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
    Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
    
    Syslog and traces are attached.
    
    Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
    
    To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
    
    Related
    
    Tickets: ~~#1059~~
    Tickets: tickets
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sirisha Alla - 2014-09-11
  
  Neel,
  
  I have seen another issue after a switchover. This time CLM failed with
  BAD_HANDLE on classImplSet(). Is this issue also resolved by #946 or
  should we open a new ticket for it?
  
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
  disconnected. Marking it as doomed 26 <5, 2020f> (safClmService)
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmpbed: IN Delaying class delete
  at slave PBE due to ongoing commit of ccb:15/21
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
  disconnected 26 <5, 2020f> (safClmService)
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
  connected: 30 (safClmService) <324, 2020f>
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafclmd[9606]: ER
  saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: NO
  'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
  'avaDown' : Recovery is 'nodeFailfast'
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: ER
  safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown
  Recovery is:nodeFailfast
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: Rebooting OpenSAF
  NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node
  failfast, OwnNodeId = 131599, SupervisionTime = 60
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
  disconnected. Marking it as doomed 30 <324, 2020f> (safClmService)
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
  down so no response
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
  down so no response
  Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
  disconnected 30 <324, 2020f> (safClmService)
  Sep 11 14:58:24 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
  timeout=60
  
  clmd traces on that node
  
  Sep 11 14:58:24.715267 osafclmd [9606:clms_imm.c:0530] TR Implementer
  not yet set: Switching on the tryagain flag
  Sep 11 14:58:24.715296 osafclmd [9606:clms_imm.c:0533] <<
  clms_node_update_rattr
  Sep 11 14:58:24.715319 osafclmd [9606:clms_imm.c:0449] >>
  clms_admin_state_update_rattr: Admin state 1 update for node
  safNode=SC-1,safCluster=myClmCluster
  Sep 11 14:58:24.715343 osafclmd [9606:clms_util.c:0036] >>
  clms_node_get_by_name: name input safNode=SC-1,safCluster=myClmCluster
  length 36
  Sep 11 14:58:24.715369 osafclmd [9606:clms_util.c:0044] TR nodename
  after patricia tree get safNode=SC-1,safCluster=myClmCluster
  Sep 11 14:58:24.715393 osafclmd [9606:clms_util.c:0047] <<
  clms_node_get_by_name
  Sep 11 14:58:24.715421 osafclmd [9606:imma_oi_api.c:2275] T2
  ERR_BAD_HANDLE: No initialized handle exists!
  Sep 11 14:58:24.715452 osafclmd [9606:clms_imm.c:0488] IN
  saImmOiRtObjectUpdate for safNode=SC-1,safCluster=myClmCluster failed
  with rc = 9. Reinit with IMM
  Sep 11 14:58:24.715480 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
  Sep 11 14:58:24.715505 osafclmd [9606:imma_oi_api.c:0626] T2
  ERR_BAD_HANDLE: No initialized handle exists!
  
  ....
  
  Sep 11 14:58:24.719788 osafclmd [9606:clms_imm.c:0721] TR Implementer is
  not set. Switching on flag in clms_cluster_update_rattr
  Sep 11 14:58:24.719801 osafclmd [9606:clms_imm.c:0724] <<
  clms_cluster_update_rattr
  Sep 11 14:58:24.719815 osafclmd [9606:clms_imm.c:0633] <<
  clms_retry_pending_rtupdates
  Sep 11 14:58:24.719828 osafclmd [9606:clms_main.c:0435] TR There is an
  IMM task to be tried again. setting poll time out to 500
  
  .....
  Sep 11 14:58:24.740335 osafclmd [9606:clms_imm.c:0620] <<
  clms_retry_pending_rtupdates: Implementerset yet to happen, try later
  Sep 11 14:58:24.740345 osafclmd [9606:clms_main.c:0435] TR There is an
  IMM task to be tried again. setting poll time out to 500
  Sep 11 14:58:24.744013 osafclmd [9606:imma_oi_api.c:1636] T2
  ERR_BAD_HANDLE: No implementer is set for this handle
  Sep 11 14:58:24.744055 osafclmd [9606:clms_imm.c:2291] ER
  saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting
  
  Regards,
  Sirisha
  
  On Wednesday 10 September 2014 09:28 PM, Neelakanta Reddy wrote:
  
  A. SLOT1 node went down:
  
  CLM got BAD_HANDLE and finalizes the handle
  
  Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
  Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
  
  Discard implementer is called
  
  Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
  Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
  Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
  
  But the implemnter actually got disconnected at
  
  Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
  Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
  Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer
  
  CLM tries to re-initializes and receives ERR_EXISTS
  
  Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
  Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
  Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init
  
  Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
  Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14
  
  946 fixes the above problem in CLM
  
  B. Slot2 node went down(Quiesced --> Active)
  
  Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active
  
  saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg
  
  Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
  Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
  Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
  Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
  Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
  Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
  Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize
  
  Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.
  
  Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv
  
  Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
  Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
  Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
  Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed
  
  Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.
  
  [tickets:#1059] 2PBE: cluster reset observed during switchovers
  
  Status: unassigned
  Milestone: 4.3.3
  Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
  Last Updated: Wed Sep 10, 2014 10:29 AM UTC
  Owner: nobody
  
  The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
  
  Switchovers with IMM application running is in progress when the issue is observed.
  
  Syslog on SC-1:
  
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
  
  Syslog on SC-2
  
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
  .....
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
  ......
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
  
  Syslog and traces are attached.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Tickets: ~~#1059~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-10

summary: 2PBE: cluster reset observed during switchovers --> CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-10

See mail discussion related to this ticket.
At SC switchover, it can happen that:
1) Old active CLMD quiesces at old active SC and acks to AMFD.
2) This reaches AMFD on onew active (via AMFD<->AMFD direct communication.
3) The Implementer-release done by old active CLMD must pass over fevs.
4) CLMD at new active SC is ordered to become active.
5) CLMD at new active attempts to set implementer before the implementer
release has reached new active SC over fevs.
6) The implementer-set request is rejeted by the local IMMND with
ERR_EXIST.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neelakanta Reddy - 2014-09-11
  
  point (6) above will not get reproduced after #946.
  
  CLM is handling ERR_EXIST and TRY_AGAIN during role change(switchover/failover).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2014-09-11
    
    Neelakanta Reddy wrote:
    
    point (6) above will not get reproduced after #946.
    
    CLM is handling ERR_EXIST and TRY_AGAIN during role change(switchover/failover).
    
    I dont quite follow you here.
    CLMD is exiting on ERR_EXIST (error 14) in this incident so it is not
    handling ERR_EXIST.
    
    You may be correct that #946 in this particular scenario will prevent
    CLMD from having getting an ERR_EXIST.
    It will get an ERR_TRY_AGAIN by the #946 fix, but ONLY IF the
    implementer was released locally,
    i.e. by the same CLMD that is now here in this case trying to set it.
    
    Maybe this is the case here that this CLMD has done implementer-set
    successfully, then released it
    as part of switch-over, but then has to set it again (very soon quickly)
    and is forced to wait on its
    own release (by the #946 fix when it is there).
    
    What troubles me is that for the more normal switchover case and for the
    failover case,
    the CLMD (and I am sure most other directors) are still vulnerable. In
    these cases the
    implementer is released on one SC but allocated at another SC and #946
    does not help there.
    
    The CLMD (and other directors) may then still get ERR_EXIST when trying
    to set implementer.
    And since CLMD (and other directors probably) dont handle ERR_EXIST as
    TRY_AGAIN, they will exit
    the same way that CLMD exits here.
    
    Or am I missing something ?
    
    /AndersBj
    
    [tickets:#1059] CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.
    
    Status: unassigned
    Milestone: 4.3.3
    Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
    Last Updated: Wed Sep 10, 2014 05:30 PM UTC
    Owner: nobody
    
    The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
    
    Switchovers with IMM application running is in progress when the issue is observed.
    
    Syslog on SC-1:
    
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
    
    Syslog on SC-2
    
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
    .....
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
    ......
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
    Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
    
    Syslog and traces are attached.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    Related
    
    Tickets: ~~#1059~~
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-09-11

F.Y.I
The implementerset() operation is done from within two function during two scenarios:

From within functionX, during Initial role or role changes (switchover/failover). FunctionX is coded to handled err_exist and try_again.

From within functionY, during handling of BAD_HANDLE i.e. while reinitializing with IMM. This functionY uses immutils and does not handle ERR_EXIST, because the IMM reinit is attempted only after saImmoIfinalize().

Now, w.r.t the sequence/scenario in this ticket,

I think an 'ideal' thing would be:
(a) The local IMM returns TRY_AGAIN to the OI, instead of ERR_EXIST. Because the OI just cleared itself by calling saImmOiFinalize() and the local IMM knows about it. It is immaterial to the OI about internal processing(FEVS, etc).

OR

(b) Block on saImmOiFinalize() and return to the caller only when the FEVS protocol completes.

I have come to know that #946 is the same as (a). If that is not the case and if either of (a) or (b) involves complex fixes in IMM, we can change the functionY to handle ERR_EXIST.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2014-09-11
  
  Mathi Naickan wrote:
  
  F.Y.I
  The implementerset() operation is done from within two function during two scenarios:
  
  From within functionX, during Initial role or role changes (switchover/failover). FunctionX is coded to handled err_exist and try_again.
  
  From within functionY, during handling of BAD_HANDLE i.e. while reinitializing with IMM. This functionY uses immutils and does not handle ERR_EXIST, because the IMM reinit is attempted only after saImmoIfinalize().
  
  Now, w.r.t the sequence/scenario in this ticket,
  
  I think an 'ideal' thing would be:
  (a) The local IMM returns TRY_AGAIN to the OI, instead of ERR_EXIST. Because the OI just cleared itself by calling saImmOiFinalize() and the local IMM knows about it. It is immaterial to the OI about internal processing(FEVS, etc).
  
  That case is fixed by the fix for #946.
  OR
  
  (b) Block on saImmOiFinalize() and return to the caller only when the FEVS protocol completes.
  
  I have come to know that #946 is the same as (a). If that is not the case and if either of (a) or (b) involves complex fixes in IMM, we can change the functionY to handle ERR_EXIST.
  
  No this case is fixed and if you say (as I think you are doing) that
  ERR_EXIST is already handled by the CLMD,
  then there is no problem to fix in CLMD related to the #1059 incident.
  
  But there is still a problem in the AMFD to fix related to #1059.
  This is the case where the AMFD tries to perform an RTO operation using
  a valid oi-handle but the handle
  is not associated with any implemmenter/implementer-name.
  
  My guess is that this cas some happens (it happens in this incident)
  when there is a failed switchover.
  Probably the AMFD releases implementer (but keeps the handle open and
  valid) in order to switrch
  form OI to applier. But then the switchover fails and is reverted (?),
  the old active AMFD becomes
  active again. Somewhere in this region the AMFD tries to update RT data,
  finds that it is "active" and
  does so prematurely, when it still has not set implementer/implementer-name.
  
  That needs a ticket.
  Either a new ticket, or we can use this one since it is now "free" from
  the CLMD side of the incident.
  
  /AndersBj
  
  [tickets:#1059] CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.
  
  Status: unassigned
  Milestone: 4.3.3
  Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
  Last Updated: Wed Sep 10, 2014 05:30 PM UTC
  Owner: nobody
  
  The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
  
  Switchovers with IMM application running is in progress when the issue is observed.
  
  Syslog on SC-1:
  
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
  
  Syslog on SC-2
  
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
  .....
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
  ......
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
  
  Syslog and traces are attached.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Tickets: ~~#1059~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-11

status: unassigned --> duplicate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-11

The CLMD terminated due to a problem fixed by ticket #946.
The AMFD terminated due to the same problem described in #707

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-09-11

Neel,
Do we have an explanation for why we are returning BAD_OP (20) for rt object here?

Sep 11 14:58:24.709606 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=PL-4,safCluster=myClmCluster failed with rc = 20. Reinit with IMM

Sirisha,
Is this test (of switchovers) involving CLM admin operations also?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sirisha Alla - 2014-09-12
  
  Mathi,
  The test does not involve any clm admin operations.
  
  On Friday 12 September 2014 12:01 AM, Mathi Naickan wrote:
  
  Neel,
  Do we have an explanation for why we are returning BAD_OP (20) for rt object here?
  
  Sep 11 14:58:24.709606 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=PL-4,safCluster=myClmCluster failed with rc = 20. Reinit with IMM
  
  Sirisha,
  Is this test (of switchovers) involving CLM admin operations also?
  
  [tickets:#1059] CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.
  
  Status: duplicate
  Milestone: 4.3.3
  Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
  Last Updated: Thu Sep 11, 2014 01:21 PM UTC
  Owner: nobody
  
  The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.
  
  Switchovers with IMM application running is in progress when the issue is observed.
  
  Syslog on SC-1:
  
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
  Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
  Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
  Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
  Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring
  
  Syslog on SC-2
  
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
  .....
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
  ......
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
  Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
  Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
  
  Syslog and traces are attached.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Tickets: ~~#1059~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neelakanta Reddy - 2014-09-12
  
  Mathi,
  
  The BAD_OPERATION returned because of problem when 2 PBE is configured.
  The problem is same as #1057
  
  Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000030/4294967344 towards PBE-B
  Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000030)
  Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmnd[4276]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
  
  CLM is trying to create multiple threads for each RT object update failure.
  
  for all the re-init thread share the same OIhandle variable for storing the OI handle.
  
  re-init of one imm-handle is sufficient. may be using of state variable for the main thread to remember, if the re-init thread is created or not.
  
  /Neel
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2015-11-02

Milestone: 4.3.3 --> never
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.

Milestone

Searches

Help

#1059 CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.

Related

Discussion

946 fixes the above problem in CLM

Related

Related

946 fixes the above problem in CLM

Related

Related

Related

Related