OpenSAF / Tickets / #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Sirisha Alla - 2015-03-30

This issue is reproducible with the following steps:

1) Reboot standby controller
2) When IMMND coordinator on Active Starts synching with the standby controller, kill clmna/amfnd on one of the payloads

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-03-30

summary: IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync --> MDS: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-03-30

IMMD is struck in MDS:

Mar 26 14:58:28.189765 osafimmd [9539:immd_mds.c:0750] >> immd_mds_bcast_send
Mar 26 14:58:28.189777 osafimmd [9539:immsv_evt.c:5486] T8 Sending: IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-03-30

Component: imm --> mds
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2015-03-31

Component: mds --> imm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2015-03-31

Hi,

When A node sync is in-progress , the sync message are being send in very high number/frequency , at that movement , any other node/immnd in the cluster restarts ( TIPC Link down) , the send buffer are getting filled up to Max
with-in TIPC link tolerance of 1.5 sec time , so sednto() at IMMD is getting blocked and we are hitting 16 FEVS Replies pending.

So Just increasing increasing SA_AIS_ERR_TRY_AGAIN sleep time of saImmOmSearchNext_2() in imm_loader above TIPC link tolerance of 1.5 sec resolved the issue.

I did this changes in changeset: 5578, but I am not able to see these chase now in the latest code

changeset: 5578:80f9493def7e
user: A V Mahesh mahesh.valla@oracle.com
date: Thu Aug 14 14:45:05 2014 +0530
summary: imm: tune imm macros according to mds mcast size [#851]

diff --git a/osaf/services/saf/immsv/immloadd/imm_loader.cc b/osaf/services/saf/immsv/immloadd/imm_loader.cc
--- a/osaf/services/saf/immsv/immloadd/imm_loader.cc
+++ b/osaf/services/saf/immsv/immloadd/imm_loader.cc
@@ -2452,7 +2452,13 @@ int syncObjectsOfClass(std::string class
do
{
if(retries) {
- usleep(150000);
+ / If we receive TRY_AGAIN while sync in progress means
+ IMMD might have been reached IMMSV_DEFAULT_FEVS_MAX_PENDING fevs_replies_pending.
+ In general fevs_replies_pending will be hit in the case of the messages have accumulated in the sender queue
+ (The most possible reason will be receiver disconnected but the sender link is in TIPC link tolerance of 1.5 sec)
+ So give enough time to recover as if sync is not a priority messages and possibility of hitting this case because of multicast messaging.
+ /
+ sleep(2);
}
/ Synchronous for throttling sync /
err = saImmOmSearchNext_2(searchHandle, &objectName, &attributes);

Please check these tow tickets for more details.

https://sourceforge.net/p/opensaf/tickets/1172/
https://sourceforge.net/p/opensaf/tickets/952/

-AVM

Related

Tickets: ~~#851~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-03-31

Hi,

Having sync always wait for two seconds causes a major performance reduction
for sync on some configurations. Because of this the patch is reverted #1188.
In #1188, retry for each TRY_AGIN is increased exponentially from 10 msec to
0.5 sec and the number of TRY_AGAIN are 320.

In the present problem scenario, there is no link-loss and TIPC tolerance
time may not be a case but the MDS send/recive buffer got filled because of
multicast.

The following adjustment are made as part of the changeset 5578.

+/ Adjust to 90% of MDS_DIRECT_BUF_MAXSIZE /
+#define IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE ((MDS_DIRECT_BUF_MAXSIZE / 100) * 90)
+#define IMMSV_MAX_OBJS_IN_SYNCBATCH (IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE / 10)

since the MAX_SYNC_BATCH_SIZE is fixed, and some machines may or may not have larger
send and recevier buffer sizes. MAX_SYNC_BATCH_SIZE needs to be more fine tuned when
multicast is used.

/Neel.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2015-03-31

Neel,

In the present problem scenario, there is no link-loss and TIPC tolerance
time may not be a case but the MDS send/recive buffer got filled because of
multicast.

According to tester`s provided reproducible steps,
the link-loss and TIPC tolerance are in case.

1) Reboot standby controller
2) When IMMND coordinator on Active Starts synching with the standby controller, kill clmna/amfnd on one of the payloads

The kill of amfnd on one of the payloads will raise link-loss and TIPC tolerance

If sender & receiver are in good condition or the receiver dest down is detected by sender buffer will never be filled to maximum because of multicast,
we will run out of buffers If the reviver is went down and the sender doesn't acknowledged the down.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2015-03-31

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2015-03-31

Milestone: 4.6.RC1 --> 4.7-Tentative
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Srikanth R - 2015-04-02

IMMD health check timeout issue is also observed in #1301

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-04-09

Defects must have milestone set to the oldest supported branch where
the problem is expected to be possible to occur.

I belive the latest sync batch optimizations where done in 4.5.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-04-09

Milestone: 4.7-Tentative --> 4.5.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-04-09

A simple solution for this ticket could perhaps be to just increase the
healthcheck callback timeout for IMMD.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-04-09

I think the problem is triggered by the IMM sync optimizations done ind
4.5 (?) by Oracle. The fix may reside in tweaking the sync optimixzation,
or by some adjustment in MDS, or by increasing the AMF healthcheck
callback timeout for IMMD.

Because AMF is involved, MDS is involved, IMM sync optimizations done by Oracle is involved, the problem has only been reproduced at Oracle, I
think it is best that this ticket is handled at Oracle.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-04-09

summary: MDS: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync --> IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

status: unassigned --> assigned

assigned_to: Neelakanta Reddy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2015-04-10

Milestone: 4.5.1 --> 4.5.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2015-04-10

Release management- 4.5.1 tag is applied. Moving to 4.5.2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-14

I suggest we close this defect ticket as "not reproducible".
That is unless someone can reproduce it.
My best guess is that this is yet another side effect of testing an overloaded system.

Since we have no load regulation mechanism and no overload protection mechanism,
it is relatively easy to push the system until it starts to break down. This is what is hapening here.
The IMMND misses a timeloy response on a helathchek heartbeat.
Such a heartbeat timeout I expect (hope) is set to 2 or 3 minutes.

The only reason for the healthcheck existence is to detect and restart a hung/looping process.
If a process is hunbg or looping it will be so indefinitely. So we dont want false positives shooting
down the service just because the system is temporarily overloaded. This just adds more load.

If a process has not repsonded in a few minutes then we really assume it is hung.
The assumption here is also that a really hung process is a very rare kind of problem.
This assumption is correct relative tho the IMMND because the IMMND only blocks
on calls to MDS and (ironically) on some syncronous AMF calls.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-18

I propose that we increase the "saAmfHctDefMaxDuration" value from 3 minutes to 5 minutes in:

safHealthcheckKey=Default,safVersion=4.0.0,safCompType=OpenSafCompTypeIMMND

This is the only thing that can be done in the IMMSv.

The other alternatives are:
(1)
Place the ticket on MDS (since IMMND is could only be blocked on MDS asynchronous send).
Yes asynchronous send. It myst be bocked in the MDS library processing of packing a huge
message /(sync buffer ?) for asynchronous send. Stuck on MDS for 3 minutes.

(2) Clöose the ticket.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-19

status: assigned --> not-reproducible

Component: imm --> mds

Milestone: 4.5.2 --> never
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-19

Closing this ticket as not reproducible.
The original analysis for this ticket focused on imm-sync. But the problem is not with sync
or with fevs flow control, that part is working and actually slowing down the sync process in its
push.

The question is rather why the local IMMND hangs for more than 3 minutes?
The only posibility that makes sense is that IMMND is blocked on one MDS request which
is "asyncronous". This sounds proposterous, but it is the only possibility that I can see.
This is also why I am changing the component to MDS.

I am closing this major ticket as not reproducible since it has not been reproduced in the 5 months that the ticket has existed.

If someone reproduces it then try to get a core-dump of the IMMND process. The backtrace
should show where the process is blocked.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sirisha Alla - 2015-08-19

This issue is reproduced on changeset 6744. Syslog as follows:

Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
......
Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
.....
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

In the above logs, is this the reason for IMMND hanging for 3 minutes?

Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2015-08-19
  
  Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
  not reach the AMFND ?
  
  /AndersBj
  
  From: Sirisha Alla [mailto:allas@users.sf.net]
  Sent: den 19 augusti 2015 10:50
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
  
  This issue is reproduced on changeset 6744. Syslog as follows:
  
  Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
  Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
  Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
  Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
  ......
  Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
  Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
  Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
  Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
  Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
  Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
  Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
  Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
  Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
  .....
  Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
  Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
  
  In the above logs, is this the reason for IMMND hanging for 3 minutes?
  
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
  
  [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
  
  Status: not-reproducible
  Milestone: never
  Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
  Last Updated: Wed Aug 19, 2015 08:40 AM UTC
  Owner: Neelakanta Reddy
  Attachments:
  
  immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)
  
  The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
  
  As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
  
  Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
  Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
  Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
  Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
  Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
  
  syslog, immnd and immd traces of SC-1 attached.
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Tickets: ~~#1291~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

IMM: IMMD healthcheck callback timeout when standby controller rebooted in...

Milestone

Searches

Help

#1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Related

Discussion

Related

Related