OpenSAF / Tickets / #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Anders Bjornerstedt - 2015-08-19

Changeset "6744" is generated today.
So I assume this means you reproduced this today.

The IMMND main poll handling processes in sequence on each descriptor, so it should not be possible
For traffic on one descriptor to "starve out" a job on another.

/AndersBj

From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
Sent: den 19 augusti 2015 10:54
To: [opensaf:tickets]
Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
not reach the AMFND ?

/AndersBj

From: Sirisha Alla [mailto:allas@users.sf.net]
Sent: den 19 augusti 2015 10:50
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

This issue is reproduced on changeset 6744. Syslog as follows:

Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
......
Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
.....
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

In the above logs, is this the reason for IMMND hanging for 3 minutes?

Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...

[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:40 AM UTC
Owner: Neelakanta Reddy
Attachments:

immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)

The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

syslog, immnd and immd traces of SC-1 attached.

Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:49 AM UTC
Owner: Neelakanta Reddy
Attachments:

immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)

The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

syslog, immnd and immd traces of SC-1 attached.

Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

Related

Tickets: ~~#1291~~

alternate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sirisha Alla - 2015-08-19
  
  Yes, I tried this today. The healthcheck timeout happened on IMMD not on
  IMMND.
  
  /Sirisha
  
  On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:
  
  Changeset "6744" is generated today.
  So I assume this means you reproduced this today.
  
  The IMMND main poll handling processes in sequence on each descriptor,
  so it should not be possible
  For traffic on one descriptor to "starve out" a job on another.
  
  /AndersBj
  
  From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
  Sent: den 19 augusti 2015 10:54
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback
  timeout when standby controller rebooted in middle of IMMND sync
  
  Ok but then the question simply becomes why does the healthcheck
  callback not reach the IMMND or why does the IMMND reply
  not reach the AMFND ?
  
  /AndersBj
  
  From: Sirisha Alla [mailto:allas@users.sf.net]
  Sent: den 19 augusti 2015 10:50
  To: [opensaf:tickets]
  Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback
  timeout when standby controller rebooted in middle of IMMND sync
  
  This issue is reproduced on changeset 6744. Syslog as follows:
  
  Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for
  class 'SaSmfSwBundle' is safSmfService => class extent is safe.
  Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned
  'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
  'safSu=SC-1,safSg=2N,safApp=OpenSAF'
  Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services
  successfully started
  Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
  announced dump at node 2010f. New Epoch:27
  ......
  Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC:
  Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process
  is on STANDBY Controller at 2020f
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro
  from node 2020f
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on
  controller (not currently coord) requests sync
  Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request
  sync sync-pid:5221 epoch:0
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync,
  epoch:30
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE:
  IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
  IMM_NODE_R_AVAILABLE
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
  announced sync. New ruling epoch:30
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled
  to file /var/log/opensaf/osafimmnd, mask=0xffffffff
  Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left
  the cluster
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
  down. Not sending track callback for agents on that node
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
  down. Not sending track callback for agents on that node
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard
  node received for nodeId:2030f pid:16584
  Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
  disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
  Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC:
  Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
  Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost
  link <1.1.1:eth0-1.1.3:eth0> on network plane A
  Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost
  contact with <1.1.3>
  Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC:
  Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
  on saImmOmSearchNext - aborting
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY
  FAILED status:1
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE:
  IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
  IMM_NODE_FULLY_AVAILABLE (2484)
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30
  in ImmModel
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
  broadcasting ABORT_SYNC, epoch:30
  Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30
  committing with ccbId:100000006/4294967302
  Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC:
  Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
  Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost
  link <1.1.1:eth0-1.1.3:eth0> on network plane A
  Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost
  contact with <1.1.3>
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
  appears stuck on runtime data handling - sending SIGTERM
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
  SIG_TERM, closing db handle
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
  EXITING...
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
  locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent
  back-end process has apparently died.
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
  broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
  ImmModel::getPbeOi reports missing PbeOi locally => unsafe
  Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
  broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
  Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
  ImmModel::getPbeOi reports missing PbeOi locally => unsafe
  .....
  Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
  ImmModel::getPbeOi reports missing PbeOi locally => unsafe
  Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
  broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover
  probation timer started (timeout: 1200000000000 ns)
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing
  failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
  escalated from 'componentFailover' to 'suFailover'
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
  'healthCheckcallbackTimeout' : Recovery is 'suFailover'
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER
  safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
  to:healthCheckcallbackTimeout Recovery is:suFailover
  Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF
  NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
  node failfast, OwnNodeId = 131343, SupervisionTime = 60
  
  In the above logs, is this the reason for IMMND hanging for 3 minutes?
  
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
  appears stuck on runtime data handling - sending SIGTERM
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
  SIG_TERM, closing db handle
  Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
  EXITING...
  
  [tickets:#1291]
  http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
  IMM: IMMD healthcheck callback timeout when standby controller
  rebooted in middle of IMMND sync
  
  Status: not-reproducible
  Milestone: never
  Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
  Last Updated: Wed Aug 19, 2015 08:40 AM UTC
  Owner: Neelakanta Reddy
  Attachments:
  
  immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
  (6.8 MB; application/x-bzip)
  
  The issue is observed with 4.6 FC changeset 6377. The system is up and
  running with single pbe and 50k objects. This issue is seen after
  http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
  application is running on standby controller and immcfg command is run
  from payload to set CompRestartMax value to 1000. IMMND is killed
  twice on standby controller leading to #1290.
  
  As a result, standby controller left the cluster in middle of sync,
  IMMD reported healthcheck callback timeout and the active controller
  too went for reboot. Following is the syslog of SC-1:
  
  Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
  node id 2020f:
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
  NodeId = 131599 EE Name = , Reason: Received Node Down for peer
  controller, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
  Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
  link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
  contact with <1.1.2>
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
  the cluster
  Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
  in the absence of PLM is outside the scope of OpenSAF
  Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
  Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
  dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
  processed='center(queued)=2197', processed='center(received)=1172',
  processed='destination(messages)=1172',
  processed='destination(mailinfo)=0',
  processed='destination(mailwarn)=0',
  processed='destination(localmessages)=955',
  processed='destination(newserr)=0',
  processed='destination(mailerr)=0', processed='destination(netmgm)=0',
  processed='destination(warn)=44', processed='destination(console)=13',
  processed='destination(null)=0', processed='destination(mail)=0',
  processed='destination(xconsole)=13',
  processed='destination(firewall)=0', processed='destination(acpid)=0',
  processed='destination(newscrit)=0',
  processed='destination(newsnotice)=0', processed='source(src)=1172'
  Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
  on saImmOmSearchNext - aborting
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
  FAILED status:1
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
  IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
  IMM_NODE_FULLY_AVAILABLE (2484)
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
  in ImmModel
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
  broadcasting ABORT_SYNC, epoch:12
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
  committing with ccbId:100000054/4294967380
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
  probation timer started (timeout: 1200000000000 ns)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
  failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
  escalated from 'componentFailover' to 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
  'healthCheckcallbackTimeout' : Recovery is 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
  safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
  to:healthCheckcallbackTimeout Recovery is:suFailover
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
  NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
  node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
  timeout=60
  
  syslog, immnd and immd traces of SC-1 attached.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/opensaf/tickets/1291/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  [tickets:#1291]
  http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
  IMM: IMMD healthcheck callback timeout when standby controller
  rebooted in middle of IMMND sync
  
  Status: not-reproducible
  Milestone: never
  Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
  Last Updated: Wed Aug 19, 2015 08:49 AM UTC
  Owner: Neelakanta Reddy
  Attachments:
  
  immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
  (6.8 MB; application/x-bzip)
  
  The issue is observed with 4.6 FC changeset 6377. The system is up and
  running with single pbe and 50k objects. This issue is seen after
  http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
  application is running on standby controller and immcfg command is run
  from payload to set CompRestartMax value to 1000. IMMND is killed
  twice on standby controller leading to #1290.
  
  As a result, standby controller left the cluster in middle of sync,
  IMMD reported healthcheck callback timeout and the active controller
  too went for reboot. Following is the syslog of SC-1:
  
  Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
  node id 2020f:
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
  NodeId = 131599 EE Name = , Reason: Received Node Down for peer
  controller, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
  Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
  link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
  contact with <1.1.2>
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
  the cluster
  Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
  in the absence of PLM is outside the scope of OpenSAF
  Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
  Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
  dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
  processed='center(queued)=2197', processed='center(received)=1172',
  processed='destination(messages)=1172',
  processed='destination(mailinfo)=0',
  processed='destination(mailwarn)=0',
  processed='destination(localmessages)=955',
  processed='destination(newserr)=0',
  processed='destination(mailerr)=0', processed='destination(netmgm)=0',
  processed='destination(warn)=44', processed='destination(console)=13',
  processed='destination(null)=0', processed='destination(mail)=0',
  processed='destination(xconsole)=13',
  processed='destination(firewall)=0', processed='destination(acpid)=0',
  processed='destination(newscrit)=0',
  processed='destination(newsnotice)=0', processed='source(src)=1172'
  Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
  on saImmOmSearchNext - aborting
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
  FAILED status:1
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
  IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
  IMM_NODE_FULLY_AVAILABLE (2484)
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
  in ImmModel
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
  broadcasting ABORT_SYNC, epoch:12
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
  committing with ccbId:100000054/4294967380
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
  probation timer started (timeout: 1200000000000 ns)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
  failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
  escalated from 'componentFailover' to 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
  'healthCheckcallbackTimeout' : Recovery is 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
  safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
  to:healthCheckcallbackTimeout Recovery is:suFailover
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
  NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
  node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
  timeout=60
  
  syslog, immnd and immd traces of SC-1 attached.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/opensaf/tickets/1291/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  [tickets:#1291] http://sourceforge.net/p/opensaf/tickets/1291/ IMM:
  IMMD healthcheck callback timeout when standby controller rebooted in
  middle of IMMND sync
  
  Status: not-reproducible
  Milestone: never
  Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
  Last Updated: Wed Aug 19, 2015 08:49 AM UTC
  Owner: Neelakanta Reddy
  Attachments:
  
  immlogs.tar.bz2
  http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
  (6.8 MB; application/x-bzip)
  
  The issue is observed with 4.6 FC changeset 6377. The system is up and
  running with single pbe and 50k objects. This issue is seen after
  http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
  application is running on standby controller and immcfg command is run
  from payload to set CompRestartMax value to 1000. IMMND is killed
  twice on standby controller leading to #1290.
  
  As a result, standby controller left the cluster in middle of sync,
  IMMD reported healthcheck callback timeout and the active controller
  too went for reboot. Following is the syslog of SC-1:
  
  Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
  node id 2020f:
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
  NodeId = 131599 EE Name = , Reason: Received Node Down for peer
  controller, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
  Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
  link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
  contact with <1.1.2>
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
  down. Not sending track callback for agents on that node
  Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
  the cluster
  Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
  in the absence of PLM is outside the scope of OpenSAF
  Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
  Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
  Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
  dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
  processed='center(queued)=2197', processed='center(received)=1172',
  processed='destination(messages)=1172',
  processed='destination(mailinfo)=0',
  processed='destination(mailwarn)=0',
  processed='destination(localmessages)=955',
  processed='destination(newserr)=0',
  processed='destination(mailerr)=0', processed='destination(netmgm)=0',
  processed='destination(warn)=44', processed='destination(console)=13',
  processed='destination(null)=0', processed='destination(mail)=0',
  processed='destination(xconsole)=13',
  processed='destination(firewall)=0', processed='destination(acpid)=0',
  processed='destination(newscrit)=0',
  processed='destination(newsnotice)=0', processed='source(src)=1172'
  Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
  on saImmOmSearchNext - aborting
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
  FAILED status:1
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
  IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
  IMM_NODE_FULLY_AVAILABLE (2484)
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
  in ImmModel
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
  broadcasting ABORT_SYNC, epoch:12
  Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
  committing with ccbId:100000054/4294967380
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
  probation timer started (timeout: 1200000000000 ns)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
  failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
  escalated from 'componentFailover' to 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
  'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
  'healthCheckcallbackTimeout' : Recovery is 'suFailover'
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
  safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
  to:healthCheckcallbackTimeout Recovery is:suFailover
  Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
  NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
  node failfast, OwnNodeId = 131343, SupervisionTime = 60
  Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
  timeout=60
  
  syslog, immnd and immd traces of SC-1 attached.
  
  Sent from sourceforge.net because
  opensaf-tickets@lists.sourceforge.net is subscribed to
  http://sourceforge.net/p/opensaf/tickets/
  
  To unsubscribe from further messages, a project admin can change
  settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
  Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  Opensaf-tickets mailing list
  Opensaf-tickets@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
  
  Related
  
  Tickets: ~~#1291~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2015-08-19
    
    Please reproduce withe IMMD trace on.
    
    /AndersBj
    
    From: Sirisha Alla [mailto:sirisha.alla@oracle.com]
    Sent: den 19 augusti 2015 11:07
    To: [opensaf:tickets]; opensaf-tickets@lists.sourceforge.net
    Subject: Re: [tickets] [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    
    Yes, I tried this today. The healthcheck timeout happened on IMMD not on IMMND.
    
    /Sirisha
    
    On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:
    
    Changeset "6744" is generated today.
    So I assume this means you reproduced this today.
    
    The IMMND main poll handling processes in sequence on each descriptor, so it should not be possible
    For traffic on one descriptor to "starve out" a job on another.
    
    /AndersBj
    
    From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
    Sent: den 19 augusti 2015 10:54
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    
    Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
    not reach the AMFND ?
    
    /AndersBj
    
    From: Sirisha Alla [mailto:allas@users.sf.net]
    Sent: den 19 augusti 2015 10:50
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    
    This issue is reproduced on changeset 6744. Syslog as follows:
    
    Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
    Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
    Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
    ......
    Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
    Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    .....
    Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    
    In the above logs, is this the reason for IMMND hanging for 3 minutes?
    
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
    
    [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    
    Status: not-reproducible
    Milestone: never
    Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
    Last Updated: Wed Aug 19, 2015 08:40 AM UTC
    Owner: Neelakanta Reddy
    Attachments:
    
    immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)
    
    The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
    
    As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
    
    Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
    Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
    Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
    Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
    
    syslog, immnd and immd traces of SC-1 attached.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    
    Status: not-reproducible
    Milestone: never
    Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
    Last Updated: Wed Aug 19, 2015 08:49 AM UTC
    Owner: Neelakanta Reddy
    Attachments:
    
    immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)
    
    The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
    
    As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
    
    Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
    Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
    Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
    Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
    
    syslog, immnd and immd traces of SC-1 attached.
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
    
    [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    
    Status: not-reproducible
    Milestone: never
    Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
    Last Updated: Wed Aug 19, 2015 08:49 AM UTC
    Owner: Neelakanta Reddy
    Attachments:
    
    immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)
    
    The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
    
    As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
    
    Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
    Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
    Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
    Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
    
    syslog, immnd and immd traces of SC-1 attached.
    
    Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to http://sourceforge.net/p/opensaf/tickets/
    
    To unsubscribe from further messages, a project admin can change settings at http://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
    
    Opensaf-tickets mailing list
    
    Opensaf-tickets@lists.sourceforge.netOpensaf-tickets@lists.sourceforge.net
    
    https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
    
    Related
    
    Tickets: ~~#1291~~
    Tickets: tickets
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Sirisha Alla - 2015-08-19
      
      snippet from IMMD trace at the time of test:
      
      Aug 19 12:00:18.707828 osafimmd [5958:immd_mbcsv.c:0463] T5
      **ENC SYNC COUNT 1142
      Aug 19 12:00:18.707832 osafimmd [5958:immd_mbcsv.c:0482] T5 ENCODE
      IMMD_A2S_MSG_FEVS: send count: 52246 handle: 2005749858575
      Aug 19 12:00:18.707836 osafimmd [5958:immd_mbcsv.c:0605] <<
      mbcsv_enc_async_update
      Aug 19 12:00:18.707840 osafimmd [5958:immd_mbcsv.c:0843] <<
      immd_mbcsv_encode_proc
      Aug 19 12:00:18.707844 osafimmd [5958:immd_mbcsv.c:0428] T5 IMMD - MBCSv
      Callback Success
      Aug 19 12:00:18.707848 osafimmd [5958:immd_mbcsv.c:0429] <<
      immd_mbcsv_callback
      Aug 19 12:00:18.707851 osafimmd [5958:mbcsv_util.c:0438] TR send the
      encoded message to any other peer with same s/w version
      Aug 19 12:00:18.707855 osafimmd [5958:mbcsv_util.c:0441] TR dispatching
      FSM for NCSMBCSV_SEND_ASYNC_UPDATE
      Aug 19 12:00:18.707859 osafimmd [5958:mbcsv_act.c:0101] TR ASYNC update
      to be sent. role: 1, svc_id: 42, pwe_hdl: 65549
      Aug 19 12:00:18.707863 osafimmd [5958:mbcsv_mds.c:0185] >>
      mbcsv_mds_send_msg: sending to vdest:d
      Aug 19 12:00:18.707868 osafimmd [5958:mbcsv_mds.c:0209] TR send type
      MDS_SENDTYPE_REDRSP:
      Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_mds.c:0244] <<
      mbcsv_mds_send_msg: success
      Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_util.c:0492] <<
      mbcsv_send_ckpt_data_to_all_peers
      Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_api.c:0868] <<
      mbcsv_process_snd_ckpt_request: retval: 1
      Aug 19 12:00:18.708472 osafimmd [5958:immd_mbcsv.c:0063] <<
      immd_mbcsv_sync_update
      Aug 19 12:00:18.708472 osafimmd [5958:immd_mds.c:0750] >>
      immd_mds_bcast_send
      Aug 19 12:00:18.708472 osafimmd [5958:immsv_evt.c:5400] T8 Sending:
      IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0
      Aug 19 12:05:59.014948 osafimmd [2456:immd_main.c:0111] >> immd_initialize
      Aug 19 12:05:59.036253 osafimmd [2456:ncs_main_pub.c:0223] TR
      NCS:PROCESS_ID=2456
      Aug 19 12:05:59.036310 osafimmd [2456:sysf_def.c:0090] TR INITIALIZING
      LEAP ENVIRONMENT
      
      I have shared the IMMD and IMMND traces with Neel. I will try uploading
      IMMD traces to the ticket.
      
      Regards,
      Sirisha
      
      On Wednesday 19 August 2015 02:40 PM, Anders Bjornerstedt wrote:
      
      Please reproduce withe IMMD trace on.
      
      /AndersBj
      
      From: Sirisha Alla [mailto:sirisha.alla@oracle.com]
      Sent: den 19 augusti 2015 11:07
      To: [opensaf:tickets]; opensaf-tickets@lists.sourceforge.net
      Subject: Re: [tickets]
      http://sourceforge.net/p/opensaf/tickets/_discuss/ [opensaf:tickets]
      Re: #1291 IMM: IMMD healthcheck callback timeout when standby
      controller rebooted in middle of IMMND sync
      
      Yes, I tried this today. The healthcheck timeout happened on IMMD not
      on IMMND.
      
      /Sirisha
      
      On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:
      
      Changeset "6744" is generated today.
      So I assume this means you reproduced this today.
      
      The IMMND main poll handling processes in sequence on each descriptor,
      so it should not be possible
      For traffic on one descriptor to "starve out" a job on another.
      
      /AndersBj
      
      From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
      Sent: den 19 augusti 2015 10:54
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback
      timeout when standby controller rebooted in middle of IMMND sync
      
      Ok but then the question simply becomes why does the healthcheck
      callback not reach the IMMND or why does the IMMND reply
      not reach the AMFND ?
      
      /AndersBj
      
      From: Sirisha Alla [mailto:allas@users.sf.net]
      Sent: den 19 augusti 2015 10:50
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback
      timeout when standby controller rebooted in middle of IMMND sync
      
      This issue is reproduced on changeset 6744. Syslog as follows:
      
      Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for
      class 'SaSmfSwBundle' is safSmfService => class extent is safe.
      Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned
      'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
      'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services
      successfully started
      Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
      announced dump at node 2010f. New Epoch:27
      ......
      Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process
      is on STANDBY Controller at 2020f
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro
      from node 2020f
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on
      controller (not currently coord) requests sync
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request
      sync sync-pid:5221 epoch:0
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync,
      epoch:30
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE:
      IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
      IMM_NODE_R_AVAILABLE
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
      announced sync. New ruling epoch:30
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled
      to file /var/log/opensaf/osafimmnd, mask=0xffffffff
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left
      the cluster
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
      down. Not sending track callback for agents on that node
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
      down. Not sending track callback for agents on that node
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard
      node received for nodeId:2030f pid:16584
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
      disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC:
      Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost
      link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost
      contact with <1.1.3>
      Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC:
      Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY
      FAILED status:1
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30
      in ImmModel
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting ABORT_SYNC, epoch:30
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30
      committing with ccbId:100000006/4294967302
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC:
      Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost
      link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost
      contact with <1.1.3>
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
      appears stuck on runtime data handling - sending SIGTERM
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
      SIG_TERM, closing db handle
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
      EXITING...
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
      locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent
      back-end process has apparently died.
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
      ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
      ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      .....
      Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
      ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      
      In the above logs, is this the reason for IMMND hanging for 3 minutes?
      
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
      appears stuck on runtime data handling - sending SIGTERM
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
      SIG_TERM, closing db handle
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
      EXITING...
      
      [tickets:#1291]
      http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
      IMM: IMMD healthcheck callback timeout when standby controller
      rebooted in middle of IMMND sync
      
      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:40 AM UTC
      Owner: Neelakanta Reddy
      Attachments:
      
      immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
      (6.8 MB; application/x-bzip)
      
      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.
      
      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:
      
      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60
      
      syslog, immnd and immd traces of SC-1 attached.
      
      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/opensaf/tickets/1291/
      
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      
      [tickets:#1291]
      http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
      IMM: IMMD healthcheck callback timeout when standby controller
      rebooted in middle of IMMND sync
      
      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:49 AM UTC
      Owner: Neelakanta Reddy
      Attachments:
      
      immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
      (6.8 MB; application/x-bzip)
      
      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.
      
      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:
      
      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60
      
      syslog, immnd and immd traces of SC-1 attached.
      
      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/opensaf/tickets/1291/
      
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      
      [tickets:#1291]
      http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
      IMM: IMMD healthcheck callback timeout when standby controller
      rebooted in middle of IMMND sync
      
      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:49 AM UTC
      Owner: Neelakanta Reddy
      Attachments:
      
      immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
      (6.8 MB; application/x-bzip)
      
      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.
      
      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:
      
      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60
      
      syslog, immnd and immd traces of SC-1 attached.
      
      Sent from sourceforge.net because
      opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net
      opensaf-tickets@lists.sourceforge.net is subscribed to
      http://sourceforge.net/p/opensaf/tickets/
      
      To unsubscribe from further messages, a project admin can change
      settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
      Or, if this is a mailing list, you can unsubscribe from the mailing list.
      
      ------------------------------------------------------------------------
      
      Opensaf-tickets mailing list
      
      Opensaf-tickets@lists.sourceforge.netOpensaf-tickets@lists.sourceforge.net
      Opensaf-tickets@lists.sourceforge.net
      
      https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
      
      [tickets:#1291] http://sourceforge.net/p/opensaf/tickets/1291/ IMM:
      IMMD healthcheck callback timeout when standby controller rebooted in
      middle of IMMND sync
      
      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:49 AM UTC
      Owner: Neelakanta Reddy
      Attachments:
      
      immlogs.tar.bz2
      http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
      (6.8 MB; application/x-bzip)
      
      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.
      
      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:
      
      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60
      
      syslog, immnd and immd traces of SC-1 attached.
      
      Sent from sourceforge.net because
      opensaf-tickets@lists.sourceforge.net is subscribed to
      http://sourceforge.net/p/opensaf/tickets/
      
      To unsubscribe from further messages, a project admin can change
      settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
      Or, if this is a mailing list, you can unsubscribe from the mailing list.
      
      Opensaf-tickets mailing list
      Opensaf-tickets@lists.sourceforge.net
      https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
      
      Related
      
      Tickets: ~~#1291~~
      Tickets: tickets
      
      alternate
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-19

status: not-reproducible --> unassigned

Component: mds --> imm

Part: - --> d

Milestone: never --> 4.5.2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sirisha Alla - 2015-08-19

IMMD trace on changeset 6477

immd.tar.bz2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-19

The IMMD sends one fevs message at a time for each poll cycle.
Also in each poll cycle it checks the AMF descriptor for healthcheck callbacks.

This means that the IMMD is blocked for more than 3 minutes on broadcasting one fevs message.

The IMMSV_DEFAULT_FEVS_MAX_PENDING) affects the IMMND process, not the IMMD.

The FEVS_MAX_PENDING is there precisely not to overload the IMMD.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-19

The IMMD is blocked on the (asyncronous) broadcast of one fevs message for more than 3 minutes. Changing component to MDS.

A reelvant question is what is otherwise special about this test.
Is MDS TCP used and not TIPC (MDS broadcast uses TIPC multicast whic his faster).

Clearly something is over/under dimensioned in this system.
This test condifuration probably needs special configuration for MDS and or IMM (max sync bnatch size).

Again I dont see that this is a defect on IMM.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-08-19

Component: imm --> mds
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-09-21

assigned_to: Neelakanta Reddy --> nobody
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neelakanta Reddy - 2015-09-21

The present MDS_DIRECT_BUF_MAXSIZE = 65479

A. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 90%MDS_DIRECT_BUF_MAXSIZE = 58931 :

immnd traces (SC-1):

Sep 15 14:24:12.205653 osafimmnd [3625:immnd_evt.c:3553] << immnd_fevs_local_checks
Sep 15 14:24:12.205688 osafimmnd [3625:immnd_evt.c:3002] TR Replies pending:16
Sep 15 14:24:12.205742 osafimmnd [3625:immnd_evt.c:3021] T2 SENDING FEVS TO IMMD
Sep 15 14:24:12.205761 osafimmnd [3625:immsv_evt.c:5395] T8 Sending: IMMD_EVT_ND2D_FEVS_REQ_2 to 0
Sep 15 14:24:12.206122 immsync [3993:imma_om_api.c:6395] << immsv_sync

immd traces(SC-1)

Sep 15 14:24:10.792610 osafimmd [3614:immd_mds.c:0750] >> immd_mds_bcast_send
Sep 15 14:24:10.792624 osafimmd [3614:immsv_evt.c:5400] T8 Sending: IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0

syslog(SC-1):

Sep 15 14:24:06 SLES-SLOT-1 osafamfd[3748]: NO Node 'PL-3' left the cluster
Sep 15 14:24:06 SLES-SLOT-1 osafclmd[3729]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 14:24:06 SLES-SLOT-1 osafclmd[3729]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 14:24:06 SLES-SLOT-1 osafimmnd[3625]: NO Global discard node received for nodeId:2030f pid:3192
Sep 15 14:24:06 SLES-SLOT-1 osafimmnd[3625]: NO Implementer disconnected 10 <0, 2030f(down)> (MsgQueueService131855)
Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629130] TIPC: Resetting link <1.1.1:eth1-1.1.3:eth2>, peer not responding
Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629142] TIPC: Lost link <1.1.1:eth1-1.1.3:eth2> on network plane A
Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629773] TIPC: Lost contact with <1.1.3>

number of messages 14:24:06 to 14:24:12

28 messages of approximately 90% size is sent before IMMD is blocked (total 131 messages sent)

Sep 15 14:24:06.999676 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61636 size:56464
Sep 15 14:24:07.231197 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61637 size:56414
Sep 15 14:24:07.390262 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61638 size:56464
Sep 15 14:24:07.541229 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61639 size:56444
Sep 15 14:24:07.692762 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61640 size:56468
Sep 15 14:24:07.845123 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61641 size:56430
Sep 15 14:24:08.008833 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61642 size:56448
Sep 15 14:24:08.149347 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61643 size:56464
Sep 15 14:24:08.308973 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61644 size:56464
Sep 15 14:24:08.450764 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61645 size:56414
Sep 15 14:24:08.576393 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61646 size:56464
Sep 15 14:24:08.715694 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61647 size:56468
Sep 15 14:24:08.845637 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61648 size:56444
Sep 15 14:24:08.987037 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61649 size:56468
Sep 15 14:24:09.115771 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61650 size:56430
Sep 15 14:24:09.262554 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61651 size:56444
Sep 15 14:24:09.391070 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61652 size:56468
Sep 15 14:24:09.528649 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61653 size:56444
Sep 15 14:24:09.674827 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61654 size:56434
Sep 15 14:24:09.798674 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61655 size:56464
Sep 15 14:24:09.920046 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61656 size:56448
Sep 15 14:24:10.062081 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61657 size:56464
Sep 15 14:24:10.190174 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61658 size:56430
Sep 15 14:24:10.306722 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61659 size:56448
Sep 15 14:24:10.438324 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61660 size:56464
Sep 15 14:24:10.554421 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61661 size:56468
Sep 15 14:24:10.674673 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61662 size:56410
Sep 15 14:24:10.791234 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61663 size:56464

immnd traces PL-4 (total 129 messages received)

Sep 15 14:24:06.083682 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.217729 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.364716 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.487655 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.610578 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.754247 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.878103 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.994528 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:07.127492 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:07.242638 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:30:37.768066 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:30:37.768181 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:30:37.768418 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0

B. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 50%MDS_DIRECT_BUF_MAXSIZE = 32739:

Loading wihout node restart -- 36 seconds:
Sep 15 18:25:28 SLES-SLOT-1 osafimmloadd: NO ** Loading from PBE file imm.db at /opt/imm_share **
Sep 15 18:29:02 SLES-SLOT-1 osafimmnd[3760]: NO Ccb 1 COMMITTED (IMMLOADER)
--

Sep 15 18:29:04 SLES-SLOT-1 osafimmd[3749]: NO Ruling epoch changed to:12
Sep 15 18:29:04 SLES-SLOT-1 osafimmd[3749]: NO ACT: SaImmRepositoryInitModeT changed and noted as being: SA_IMM_KEEP_REPOSITORY
Sep 15 18:29:04 SLES-SLOT-1 osafimmloadd: NO Load ending normally

syncing without node restart-- 46 seconds:

Sep 15 18:29:57 SLES-SLOT-1 osafimmloadd: NO Sync starting
Sep 15 18:30:43 SLES-SLOT-1 osafimmloadd: IN Synced 60367 objects in total
Sep 15 18:30:43 SLES-SLOT-1 osafimmnd[3760]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 16818
Sep 15 18:30:43 SLES-SLOT-1 osafimmloadd: NO Sync ending normally
Sep 15 18:30:43 SLES-SLOT-1 osafimmnd[3760]: NO Epoch set to 14 in ImmModel

for the #1291 test case the sync is taking 3 minutes and 42 seconds

Sep 15 16:58:59 SLES-SLOT-1 osafimmloadd: NO Sync starting
Sep 15 16:59:01 SLES-SLOT-1 osafamfd[9581]: NO Node 'PL-3' left the cluster
Sep 15 16:59:01 SLES-SLOT-1 osafclmd[9562]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 16:59:01 SLES-SLOT-1 osafclmd[9562]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 16:59:01 SLES-SLOT-1 osafimmnd[9482]: NO Global discard node received for nodeId:2030f pid:8315
Sep 15 16:59:01 SLES-SLOT-1 osafimmnd[9482]: NO Implementer disconnected 11 <0, 2030f(down)> (MsgQueueService131855)
Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056187] TIPC: Resetting link <1.1.1:eth1-1.1.3:eth2>, peer not responding
Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056198] TIPC: Lost link <1.1.1:eth1-1.1.3:eth2> on network plane A
Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056983] TIPC: Lost contact with <1.1.3>
Sep 15 17:00:42 SLES-SLOT-1 kernel: [ 8987.444292] TIPC: Established link <1.1.1:eth1-1.1.3:eth2> on network plane A
Sep 15 17:00:54 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:55 SLES-SLOT-1 sshd[9887]: Accepted publickey for root from 192.168.56.105 port 41897 ssh2
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:05 SLES-SLOT-1 osafimmd[9471]: NO Node 2030f request sync sync-pid:3228 epoch:0
Sep 15 17:02:40 SLES-SLOT-1 osafimmloadd: IN Synced 60383 objects in total
Sep 15 17:02:40 SLES-SLOT-1 osafimmnd[9482]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 16818
Sep 15 17:02:40 SLES-SLOT-1 osafimmloadd: NO Sync ending normally

Observation:

while analyzing for 50% MDS_SIZE the sync is taking 3 minutes and 40 seconds.
Because of these CLM and AMFD are getting TRY_AGAIN for more then 3 minutes and the cluster went for reboot.

System buffers:

SLES-SLOT-1:~ # cat /proc/sys/net/ipv4/tcp_mem
18165 24223 36330

The default and maximum amount for the receive socket memory:
SLES-SLOT-1:~ # cat /proc/sys/net/core/rmem_default
126976
SLES-SLOT-1:~ # cat /proc/sys/net/core/rmem_max
131071
The default and maximum amount for the send socket memory:

SLES-SLOT-1:~ # cat /proc/sys/net/core/wmem_default
126976
SLES-SLOT-1:~ # cat /proc/sys/net/core/wmem_max
131071

The maximum amount of option memory buffers:

SLES-SLOT-1:~ # cat /proc/sys/net/core/optmem_max
20480
SLES-SLOT-1:~ #
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-09-21

status: unassigned --> invalid
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2015-09-21

If the problem is to be declared as a configuration error, i.e. solvable by adjusting one or more
configuration values that are documented by OpenSAF, then the ticket should be closed as
invalid.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2015-11-02

Milestone: 4.5.2 --> never
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

IMM: IMMD healthcheck callback timeout when standby controller rebooted in...

Milestone

Searches

Help

#1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Related

Discussion

Related

Related

Related

------------------------------------------------------------------------

Related

B. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 50%MDS_DIRECT_BUF_MAXSIZE = 32739: