Menu

#1059 CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.

never
duplicate
nobody
None
defect
clm
-
4.5 FC
major
2015-11-02
2014-09-10
No

The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

Switchovers with IMM application running is in progress when the issue is observed.

Syslog on SC-1:

Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

Syslog on SC-2

Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

Syslog and traces are attached.

1 Attachments

Related

Tickets: #1059

Discussion

  • Sirisha Alla

    Sirisha Alla - 2014-09-10

    SC-2 logs

     
  • Anders Bjornerstedt

    • Component: unknown --> clm
     
  • Anders Bjornerstedt

    The direct cause of the cluster reset is that CLM exits on receiving
    ERR_EXIST on implementerSet.This could a case of #946
    (fixed in 5722:d353ca39b3d9).

    If not then someone needs to analyze what CLM is doing (it is detaching as main OI and fails to attach as a new OI/Applier).

     
  • Anders Bjornerstedt

    The CLM problem actually only explains why this SLOT2 went down.
    This was a switchover, not a failover, so the other SC should have
    reverted the switchover.

     
  • Anders Bjornerstedt

    And one more comment:
    I dont think that this incident is logically related to 2PBE.
    A PRTO fails to get created.
    That is possibly more likely to happen with 2PBE than 1PBE or 0PBE,
    but logically it can at least also happen also with 1PBE.

     
  • Neelakanta Reddy

    A. SLOT1 node went down:

    1. CLM got BAD_HANDLE and finalizes the handle

    Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
    Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!

    1. Discard implementer is called

    Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
    Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
    Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

    1. But the implemnter actually got disconnected at

    Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
    Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
    Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

    1. CLM tries to re-initializes and receives ERR_EXISTS

    Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
    Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
    Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

    Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
    Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14

    946 fixes the above problem in CLM

    B. Slot2 node went down(Quiesced --> Active)

    1. Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active

    2. saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg

    Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
    Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
    Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
    Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
    Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
    Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
    Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize

    1. Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.

    Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv

    Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
    Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
    Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
    Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed

    Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.

     
    • Anders Bjornerstedt

      Good analysis.

      We can add that the reason that AMFD got BAD_HANLDE when attempting to do an RtObjectUpdate is that
      although the OI handle is valid, it was not associated with any implementer-name at that time.
      So this must be a pure and plain bug in the AMFD. Most likely recently introduced since otherwise we should
      have seen this before.

      The AMFD interprets the BAD_HANDLE as the only "expected" reason for BAD_HANDLE, that the handle is invalid
      due to the IMMND having restarted. By "expected" I dont mean common, I mean the only reason the AMFD programmer
      has to re-initialize the handle: a restart of the local IMMND. But that is not what is happening here.

      The SAF spec actually explicitly says that BAD_HANDLE is to be used for this particlar case (implementer-name not set
      when trying to perform an OI operation). While this is not wrong. It would be better in this case to use ERR_BAD_OPERATION
      since that is an error that is unambiguously a client error and the spec on that error code for this downcall also fitts the case:

      ERR_BAD_OPERATION - The targeted object is not implemented by the invoking process.

      So I think we should write an enhancement on the immsv to change the error code for this case.
      It will also be a backwards compatible change. We are talking an interface violation here and that
      should be handled by the process aborting.

      We should also write a defect ticket on AMFD, or use this ticket, to track a fix for the AMFD bug - premature
      use of an OI handle. This ticket also poitns to an independent CLM bug. So we should probably have two
      tickets.

      /AndersBj


      From: Neelakanta Reddy [mailto:neelakanta@users.sf.net]
      Sent: den 10 september 2014 17:58
      To: opensaf-tickets@lists.sourceforge.net
      Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers

      A. SLOT1 node went down:

      1. CLM got BAD_HANDLE and finalizes the handle

      Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
      Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!

      1. Discard implementer is called

      Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
      Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
      Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

      1. But the implemnter actually got disconnected at

      Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
      Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
      Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

      1. CLM tries to re-initializes and receives ERR_EXISTS

      Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
      Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
      Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

      Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
      Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14

      946 fixes the above problem in CLM

      B. Slot2 node went down(Quiesced --> Active)

      1. Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active

      2. saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg

      Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
      Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
      Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
      Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
      Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
      Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
      Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize

      1. Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.

      Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv

      Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
      Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
      Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
      Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed

      Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.


      [tickets:#1059]http://sourceforge.net/p/opensaf/tickets/1059 2PBE: cluster reset observed during switchovers

      Status: unassigned
      Milestone: 4.3.3
      Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
      Last Updated: Wed Sep 10, 2014 10:29 AM UTC
      Owner: nobody

      The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

      Switchovers with IMM application running is in progress when the issue is observed.

      Syslog on SC-1:

      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

      Syslog on SC-2

      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
      .....
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
      ......
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

      Syslog and traces are attached.


      Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets

      To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

       

      Related

      Tickets: #1059
      Tickets: tickets

      • Anders Bjornerstedt

        Neel correctly points out that 946 may fix the CLM problem.
        This is true if the ERR_EXIST on implementer-set is due to the prior OI having detached locally at this node
        but not yet been confirmed globally over fevs. But since this is a switchover, it is more likley that the OI detached
        on the other SC and that the AMF is faster in processing quiessed-ack from old active CLMD and ordering CLMD
        at this node to become new active (and thus allocate OI).

        So I think getting ERR_EXIST at new active OI implementer-set may unfortunately be a fact of life for the switch-over case.
        The only fix I see here is that since the new active knows that this is (or could be) a switchover since it has just been
        order to become active. It could in this context interpret ERR_EXIST from implementer-set as effectively TRY_AGAIN.

        Perhaps even simpler: Director services could always interpret ERR_EXIST on implementer-set as TRY_AGAIN.
        As always, a TRY_AGAIN loop must be finite. And implementer-set is not blocked by imm-sync so we are not
        talking 60 secods here. At MOST we are talking sub-second, the fevs turnarround latency.

        /AndersBj


        From: Anders Björnerstedt [mailto:anders.bjornerstedt@ericsson.com]
        Sent: den 10 september 2014 18:57
        To: [opensaf:tickets] ; opensaf-tickets@lists.sourceforge.net
        Subject: Re: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers

        Good analysis.

        We can add that the reason that AMFD got BAD_HANLDE when attempting to do an RtObjectUpdate is that
        although the OI handle is valid, it was not associated with any implementer-name at that time.
        So this must be a pure and plain bug in the AMFD. Most likely recently introduced since otherwise we should
        have seen this before.

        The AMFD interprets the BAD_HANDLE as the only "expected" reason for BAD_HANDLE, that the handle is invalid
        due to the IMMND having restarted. By "expected" I dont mean common, I mean the only reason the AMFD programmer
        has to re-initialize the handle: a restart of the local IMMND. But that is not what is happening here.

        The SAF spec actually explicitly says that BAD_HANDLE is to be used for this particlar case (implementer-name not set
        when trying to perform an OI operation). While this is not wrong. It would be better in this case to use ERR_BAD_OPERATION
        since that is an error that is unambiguously a client error and the spec on that error code for this downcall also fitts the case:

        ERR_BAD_OPERATION - The targeted object is not implemented by the invoking process.

        So I think we should write an enhancement on the immsv to change the error code for this case.
        It will also be a backwards compatible change. We are talking an interface violation here and that
        should be handled by the process aborting.

        We should also write a defect ticket on AMFD, or use this ticket, to track a fix for the AMFD bug - premature
        use of an OI handle. This ticket also poitns to an independent CLM bug. So we should probably have two
        tickets.

        /AndersBj


        From: Neelakanta Reddy [mailto:neelakanta@users.sf.net]
        Sent: den 10 september 2014 17:58
        To: opensaf-tickets@lists.sourceforge.net
        Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during switchovers

        A. SLOT1 node went down:

        1. CLM got BAD_HANDLE and finalizes the handle

        Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
        Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!

        1. Discard implementer is called

        Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
        Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
        Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
        Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

        1. But the implemnter actually got disconnected at

        Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
        Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
        Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
        Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

        1. CLM tries to re-initializes and receives ERR_EXISTS

        Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
        Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
        Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

        Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
        Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14

        946 fixes the above problem in CLM

        B. Slot2 node went down(Quiesced --> Active)

        1. Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active

        2. saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg

        Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
        Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
        Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
        Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
        Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
        Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
        Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize

        1. Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.

        Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv

        Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
        Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
        Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
        Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed

        Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.


        [tickets:#1059]http://sourceforge.net/p/opensaf/tickets/1059 2PBE: cluster reset observed during switchovers

        Status: unassigned
        Milestone: 4.3.3
        Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
        Last Updated: Wed Sep 10, 2014 10:29 AM UTC
        Owner: nobody

        The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

        Switchovers with IMM application running is in progress when the issue is observed.

        Syslog on SC-1:

        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
        Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

        Syslog on SC-2

        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
        .....
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
        ......
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
        Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

        Syslog and traces are attached.


        Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets

        To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

         

        Related

        Tickets: #1059
        Tickets: tickets

    • Sirisha Alla

      Sirisha Alla - 2014-09-11

      Neel,

      I have seen another issue after a switchover. This time CLM failed with
      BAD_HANDLE on classImplSet(). Is this issue also resolved by #946 or
      should we open a new ticket for it?

      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
      disconnected. Marking it as doomed 26 <5, 2020f> (safClmService)
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmpbed: IN Delaying class delete
      at slave PBE due to ongoing commit of ccb:15/21
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
      disconnected 26 <5, 2020f> (safClmService)
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
      connected: 30 (safClmService) <324, 2020f>
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafclmd[9606]: ER
      saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: NO
      'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
      'avaDown' : Recovery is 'nodeFailfast'
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: ER
      safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown
      Recovery is:nodeFailfast
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node
      failfast, OwnNodeId = 131599, SupervisionTime = 60
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
      disconnected. Marking it as doomed 30 <324, 2020f> (safClmService)
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
      down so no response
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
      down so no response
      Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
      disconnected 30 <324, 2020f> (safClmService)
      Sep 11 14:58:24 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
      timeout=60

      clmd traces on that node

      Sep 11 14:58:24.715267 osafclmd [9606:clms_imm.c:0530] TR Implementer
      not yet set: Switching on the tryagain flag
      Sep 11 14:58:24.715296 osafclmd [9606:clms_imm.c:0533] <<
      clms_node_update_rattr
      Sep 11 14:58:24.715319 osafclmd [9606:clms_imm.c:0449] >>
      clms_admin_state_update_rattr: Admin state 1 update for node
      safNode=SC-1,safCluster=myClmCluster
      Sep 11 14:58:24.715343 osafclmd [9606:clms_util.c:0036] >>
      clms_node_get_by_name: name input safNode=SC-1,safCluster=myClmCluster
      length 36
      Sep 11 14:58:24.715369 osafclmd [9606:clms_util.c:0044] TR nodename
      after patricia tree get safNode=SC-1,safCluster=myClmCluster
      Sep 11 14:58:24.715393 osafclmd [9606:clms_util.c:0047] <<
      clms_node_get_by_name
      Sep 11 14:58:24.715421 osafclmd [9606:imma_oi_api.c:2275] T2
      ERR_BAD_HANDLE: No initialized handle exists!
      Sep 11 14:58:24.715452 osafclmd [9606:clms_imm.c:0488] IN
      saImmOiRtObjectUpdate for safNode=SC-1,safCluster=myClmCluster failed
      with rc = 9. Reinit with IMM
      Sep 11 14:58:24.715480 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
      Sep 11 14:58:24.715505 osafclmd [9606:imma_oi_api.c:0626] T2
      ERR_BAD_HANDLE: No initialized handle exists!

      ....

      Sep 11 14:58:24.719788 osafclmd [9606:clms_imm.c:0721] TR Implementer is
      not set. Switching on flag in clms_cluster_update_rattr
      Sep 11 14:58:24.719801 osafclmd [9606:clms_imm.c:0724] <<
      clms_cluster_update_rattr
      Sep 11 14:58:24.719815 osafclmd [9606:clms_imm.c:0633] <<
      clms_retry_pending_rtupdates
      Sep 11 14:58:24.719828 osafclmd [9606:clms_main.c:0435] TR There is an
      IMM task to be tried again. setting poll time out to 500

      .....
      Sep 11 14:58:24.740335 osafclmd [9606:clms_imm.c:0620] <<
      clms_retry_pending_rtupdates: Implementerset yet to happen, try later
      Sep 11 14:58:24.740345 osafclmd [9606:clms_main.c:0435] TR There is an
      IMM task to be tried again. setting poll time out to 500
      Sep 11 14:58:24.744013 osafclmd [9606:imma_oi_api.c:1636] T2
      ERR_BAD_HANDLE: No implementer is set for this handle
      Sep 11 14:58:24.744055 osafclmd [9606:clms_imm.c:2291] ER
      saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting

      Regards,
      Sirisha

      On Wednesday 10 September 2014 09:28 PM, Neelakanta Reddy wrote:

      A. SLOT1 node went down:

      1. CLM got BAD_HANDLE and finalizes the handle

      Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
      Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!

      1. Discard implementer is called

      Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
      Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
      Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

      1. But the implemnter actually got disconnected at

      Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard implementer for id:190
      Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
      Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer disconnected 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

      1. CLM tries to re-initializes and receives ERR_EXISTS

      Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << saImmOiSelectionObjectGet
      Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet failed rc:14, exiting
      Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

      Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
      Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14

      946 fixes the above problem in CLM

      B. Slot2 node went down(Quiesced --> Active)

      1. Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced --> Active

      2. saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize with IMM and calls avd_imm_reinit_bg

      Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2
      Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any implementer name
      Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2
      Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
      Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
      Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
      Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize

      1. Before the finalize is not completed in clearing the OI handle, impl_set is called by AMFD in the function avd_role_failover_qsd_actv(calling avd_imm_impl_set_task_create). Because of this amfd exited.

      Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << avd_role_failover_qsd_actv

      Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
      Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
      Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet failed 2
      Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since avd_imm_impl_set failed

      Because of using shared Oihandle, across multiple threads in AMFD the saImmOiImplementerSet failed with ERR_LIBRARY.


      [tickets:#1059] 2PBE: cluster reset observed during switchovers

      Status: unassigned
      Milestone: 4.3.3
      Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
      Last Updated: Wed Sep 10, 2014 10:29 AM UTC
      Owner: nobody

      The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

      Switchovers with IMM application running is in progress when the issue is observed.

      Syslog on SC-1:

      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

      Syslog on SC-2

      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
      .....
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
      ......
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

      Syslog and traces are attached.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #1059

  • Anders Bjornerstedt

    • summary: 2PBE: cluster reset observed during switchovers --> CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.
     
  • Anders Bjornerstedt

    See mail discussion related to this ticket.
    At SC switchover, it can happen that:
    1) Old active CLMD quiesces at old active SC and acks to AMFD.
    2) This reaches AMFD on onew active (via AMFD<->AMFD direct communication.
    3) The Implementer-release done by old active CLMD must pass over fevs.
    4) CLMD at new active SC is ordered to become active.
    5) CLMD at new active attempts to set implementer before the implementer
    release has reached new active SC over fevs.
    6) The implementer-set request is rejeted by the local IMMND with
    ERR_EXIST.

     
    • Neelakanta Reddy

      point (6) above will not get reproduced after #946.

      CLM is handling ERR_EXIST and TRY_AGAIN during role change(switchover/failover).

       
      • Anders Bjornerstedt

        Neelakanta Reddy wrote:

        point (6) above will not get reproduced after #946.

        CLM is handling ERR_EXIST and TRY_AGAIN during role change(switchover/failover).

        I dont quite follow you here.
        CLMD is exiting on ERR_EXIST (error 14) in this incident so it is not
        handling ERR_EXIST.

        You may be correct that #946 in this particular scenario will prevent
        CLMD from having getting an ERR_EXIST.
        It will get an ERR_TRY_AGAIN by the #946 fix, but ONLY IF the
        implementer was released locally,
        i.e. by the same CLMD that is now here in this case trying to set it.

        Maybe this is the case here that this CLMD has done implementer-set
        successfully, then released it
        as part of switch-over, but then has to set it again (very soon quickly)
        and is forced to wait on its
        own release (by the #946 fix when it is there).

        What troubles me is that for the more normal switchover case and for the
        failover case,
        the CLMD (and I am sure most other directors) are still vulnerable. In
        these cases the
        implementer is released on one SC but allocated at another SC and #946
        does not help there.

        The CLMD (and other directors) may then still get ERR_EXIST when trying
        to set implementer.
        And since CLMD (and other directors probably) dont handle ERR_EXIST as
        TRY_AGAIN, they will exit
        the same way that CLMD exits here.

        Or am I missing something ?

        /AndersBj


        [tickets:#1059] CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.

        Status: unassigned
        Milestone: 4.3.3
        Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
        Last Updated: Wed Sep 10, 2014 05:30 PM UTC
        Owner: nobody

        The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

        Switchovers with IMM application running is in progress when the issue is observed.

        Syslog on SC-1:

        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
        Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
        Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
        Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
        Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
        Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

        Syslog on SC-2

        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
        .....
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
        ......
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
        Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
        Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

        Syslog and traces are attached.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

         

        Related

        Tickets: #1059

  • Mathi Naickan

    Mathi Naickan - 2014-09-11

    F.Y.I
    The implementerset() operation is done from within two function during two scenarios:

    • From within functionX, during Initial role or role changes (switchover/failover). FunctionX is coded to handled err_exist and try_again.

    • From within functionY, during handling of BAD_HANDLE i.e. while reinitializing with IMM. This functionY uses immutils and does not handle ERR_EXIST, because the IMM reinit is attempted only after saImmoIfinalize().

    Now, w.r.t the sequence/scenario in this ticket,

    I think an 'ideal' thing would be:
    (a) The local IMM returns TRY_AGAIN to the OI, instead of ERR_EXIST. Because the OI just cleared itself by calling saImmOiFinalize() and the local IMM knows about it. It is immaterial to the OI about internal processing(FEVS, etc).

    OR

    (b) Block on saImmOiFinalize() and return to the caller only when the FEVS protocol completes.

    I have come to know that #946 is the same as (a). If that is not the case and if either of (a) or (b) involves complex fixes in IMM, we can change the functionY to handle ERR_EXIST.

     
    • Anders Bjornerstedt

      Mathi Naickan wrote:

      F.Y.I
      The implementerset() operation is done from within two function during two scenarios:

      • From within functionX, during Initial role or role changes (switchover/failover). FunctionX is coded to handled err_exist and try_again.

      • From within functionY, during handling of BAD_HANDLE i.e. while reinitializing with IMM. This functionY uses immutils and does not handle ERR_EXIST, because the IMM reinit is attempted only after saImmoIfinalize().

      Now, w.r.t the sequence/scenario in this ticket,

      I think an 'ideal' thing would be:
      (a) The local IMM returns TRY_AGAIN to the OI, instead of ERR_EXIST. Because the OI just cleared itself by calling saImmOiFinalize() and the local IMM knows about it. It is immaterial to the OI about internal processing(FEVS, etc).

      That case is fixed by the fix for #946.
      OR

      (b) Block on saImmOiFinalize() and return to the caller only when the FEVS protocol completes.

      I have come to know that #946 is the same as (a). If that is not the case and if either of (a) or (b) involves complex fixes in IMM, we can change the functionY to handle ERR_EXIST.

      No this case is fixed and if you say (as I think you are doing) that
      ERR_EXIST is already handled by the CLMD,
      then there is no problem to fix in CLMD related to the #1059 incident.

      But there is still a problem in the AMFD to fix related to #1059.
      This is the case where the AMFD tries to perform an RTO operation using
      a valid oi-handle but the handle
      is not associated with any implemmenter/implementer-name.

      My guess is that this cas some happens (it happens in this incident)
      when there is a failed switchover.
      Probably the AMFD releases implementer (but keeps the handle open and
      valid) in order to switrch
      form OI to applier. But then the switchover fails and is reverted (?),
      the old active AMFD becomes
      active again. Somewhere in this region the AMFD tries to update RT data,
      finds that it is "active" and
      does so prematurely, when it still has not set implementer/implementer-name.

      That needs a ticket.
      Either a new ticket, or we can use this one since it is now "free" from
      the CLMD side of the incident.

      /AndersBj


      [tickets:#1059] CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.

      Status: unassigned
      Milestone: 4.3.3
      Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
      Last Updated: Wed Sep 10, 2014 05:30 PM UTC
      Owner: nobody

      The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

      Switchovers with IMM application running is in progress when the issue is observed.

      Syslog on SC-1:

      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

      Syslog on SC-2

      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
      .....
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
      ......
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

      Syslog and traces are attached.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #1059

  • Anders Bjornerstedt

    • status: unassigned --> duplicate
     
  • Anders Bjornerstedt

    The CLMD terminated due to a problem fixed by ticket #946.
    The AMFD terminated due to the same problem described in #707

     
  • Mathi Naickan

    Mathi Naickan - 2014-09-11

    Neel,
    Do we have an explanation for why we are returning BAD_OP (20) for rt object here?

    Sep 11 14:58:24.709606 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=PL-4,safCluster=myClmCluster failed with rc = 20. Reinit with IMM

    Sirisha,
    Is this test (of switchovers) involving CLM admin operations also?

     
    • Sirisha Alla

      Sirisha Alla - 2014-09-12

      Mathi,
      The test does not involve any clm admin operations.

      On Friday 12 September 2014 12:01 AM, Mathi Naickan wrote:

      Neel,
      Do we have an explanation for why we are returning BAD_OP (20) for rt object here?

      Sep 11 14:58:24.709606 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=PL-4,safCluster=myClmCluster failed with rc = 20. Reinit with IMM

      Sirisha,
      Is this test (of switchovers) involving CLM admin operations also?


      [tickets:#1059] CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.

      Status: duplicate
      Milestone: 4.3.3
      Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
      Last Updated: Thu Sep 11, 2014 01:21 PM UTC
      Owner: nobody

      The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

      Switchovers with IMM application running is in progress when the issue is observed.

      Syslog on SC-1:

      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395
      Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
      Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
      Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108
      Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

      Syslog on SC-2

      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
      .....
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
      ......
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343)
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed
      Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60
      Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

      Syslog and traces are attached.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1059/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #1059

    • Neelakanta Reddy

      Mathi,

      1. The BAD_OPERATION returned because of problem when 2 PBE is configured.
        The problem is same as #1057

      Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000030/4294967344 towards PBE-B
      Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000030)
      Sep 11 14:58:24 SLES-64BIT-SLOT1 osafimmnd[4276]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20

      1. CLM is trying to create multiple threads for each RT object update failure.

      2. for all the re-init thread share the same OIhandle variable for storing the OI handle.

      3. re-init of one imm-handle is sufficient. may be using of state variable for the main thread to remember, if the re-init thread is created or not.

      /Neel

       
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.3.3 --> never
     

Log in to post a comment.