The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
syslog, immnd and immd traces of SC-1 attached.
Changeset "6744" is generated today.
So I assume this means you reproduced this today.
The IMMND main poll handling processes in sequence on each descriptor, so it should not be possible
For traffic on one descriptor to "starve out" a job on another.
/AndersBj
From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
Sent: den 19 augusti 2015 10:54
To: [opensaf:tickets]
Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
not reach the AMFND ?
/AndersBj
From: Sirisha Alla [mailto:allas@users.sf.net]
Sent: den 19 augusti 2015 10:50
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
This issue is reproduced on changeset 6744. Syslog as follows:
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
......
Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
.....
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
In the above logs, is this the reason for IMMND hanging for 3 minutes?
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:40 AM UTC
Owner: Neelakanta Reddy
Attachments:
The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
syslog, immnd and immd traces of SC-1 attached.
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:49 AM UTC
Owner: Neelakanta Reddy
Attachments:
The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
syslog, immnd and immd traces of SC-1 attached.
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
Related
Tickets:
#1291Yes, I tried this today. The healthcheck timeout happened on IMMD not on
IMMND.
/Sirisha
On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:
Related
Tickets:
#1291Please reproduce withe IMMD trace on.
/AndersBj
From: Sirisha Alla [mailto:sirisha.alla@oracle.com]
Sent: den 19 augusti 2015 11:07
To: [opensaf:tickets]; opensaf-tickets@lists.sourceforge.net
Subject: Re: [tickets] [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Yes, I tried this today. The healthcheck timeout happened on IMMD not on IMMND.
/Sirisha
On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:
Changeset "6744" is generated today.
So I assume this means you reproduced this today.
The IMMND main poll handling processes in sequence on each descriptor, so it should not be possible
For traffic on one descriptor to "starve out" a job on another.
/AndersBj
From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
Sent: den 19 augusti 2015 10:54
To: [opensaf:tickets]
Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
not reach the AMFND ?
/AndersBj
From: Sirisha Alla [mailto:allas@users.sf.net]
Sent: den 19 augusti 2015 10:50
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
This issue is reproduced on changeset 6744. Syslog as follows:
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
......
Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
.....
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
In the above logs, is this the reason for IMMND hanging for 3 minutes?
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:40 AM UTC
Owner: Neelakanta Reddy
Attachments:
The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
syslog, immnd and immd traces of SC-1 attached.
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:49 AM UTC
Owner: Neelakanta Reddy
Attachments:
The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
syslog, immnd and immd traces of SC-1 attached.
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
[tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:49 AM UTC
Owner: Neelakanta Reddy
Attachments:
The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60
syslog, immnd and immd traces of SC-1 attached.
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to http://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at http://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.netOpensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
Related
Tickets:
#1291Tickets: tickets
snippet from IMMD trace at the time of test:
Aug 19 12:00:18.707828 osafimmd [5958:immd_mbcsv.c:0463] T5
**ENC SYNC COUNT 1142
Aug 19 12:00:18.707832 osafimmd [5958:immd_mbcsv.c:0482] T5 ENCODE
IMMD_A2S_MSG_FEVS: send count: 52246 handle: 2005749858575
Aug 19 12:00:18.707836 osafimmd [5958:immd_mbcsv.c:0605] <<
mbcsv_enc_async_update
Aug 19 12:00:18.707840 osafimmd [5958:immd_mbcsv.c:0843] <<
immd_mbcsv_encode_proc
Aug 19 12:00:18.707844 osafimmd [5958:immd_mbcsv.c:0428] T5 IMMD - MBCSv
Callback Success
Aug 19 12:00:18.707848 osafimmd [5958:immd_mbcsv.c:0429] <<
immd_mbcsv_callback
Aug 19 12:00:18.707851 osafimmd [5958:mbcsv_util.c:0438] TR send the
encoded message to any other peer with same s/w version
Aug 19 12:00:18.707855 osafimmd [5958:mbcsv_util.c:0441] TR dispatching
FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Aug 19 12:00:18.707859 osafimmd [5958:mbcsv_act.c:0101] TR ASYNC update
to be sent. role: 1, svc_id: 42, pwe_hdl: 65549
Aug 19 12:00:18.707863 osafimmd [5958:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:d
Aug 19 12:00:18.707868 osafimmd [5958:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_mds.c:0244] <<
mbcsv_mds_send_msg: success
Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_util.c:0492] <<
mbcsv_send_ckpt_data_to_all_peers
Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_api.c:0868] <<
mbcsv_process_snd_ckpt_request: retval: 1
Aug 19 12:00:18.708472 osafimmd [5958:immd_mbcsv.c:0063] <<
immd_mbcsv_sync_update
Aug 19 12:00:18.708472 osafimmd [5958:immd_mds.c:0750] >>
immd_mds_bcast_send
Aug 19 12:00:18.708472 osafimmd [5958:immsv_evt.c:5400] T8 Sending:
IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0
Aug 19 12:05:59.014948 osafimmd [2456:immd_main.c:0111] >> immd_initialize
Aug 19 12:05:59.036253 osafimmd [2456:ncs_main_pub.c:0223] TR
NCS:PROCESS_ID=2456
Aug 19 12:05:59.036310 osafimmd [2456:sysf_def.c:0090] TR INITIALIZING
LEAP ENVIRONMENT
I have shared the IMMD and IMMND traces with Neel. I will try uploading
IMMD traces to the ticket.
Regards,
Sirisha
On Wednesday 19 August 2015 02:40 PM, Anders Bjornerstedt wrote:
Related
Tickets:
#1291Tickets: tickets
IMMD trace on changeset 6477
The IMMD sends one fevs message at a time for each poll cycle.
Also in each poll cycle it checks the AMF descriptor for healthcheck callbacks.
This means that the IMMD is blocked for more than 3 minutes on broadcasting one fevs message.
The IMMSV_DEFAULT_FEVS_MAX_PENDING) affects the IMMND process, not the IMMD.
The FEVS_MAX_PENDING is there precisely not to overload the IMMD.
The IMMD is blocked on the (asyncronous) broadcast of one fevs message for more than 3 minutes. Changing component to MDS.
A reelvant question is what is otherwise special about this test.
Is MDS TCP used and not TIPC (MDS broadcast uses TIPC multicast whic his faster).
Clearly something is over/under dimensioned in this system.
This test condifuration probably needs special configuration for MDS and or IMM (max sync bnatch size).
Again I dont see that this is a defect on IMM.
The present MDS_DIRECT_BUF_MAXSIZE = 65479
A. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 90%MDS_DIRECT_BUF_MAXSIZE = 58931 :
Sep 15 14:24:12.205653 osafimmnd [3625:immnd_evt.c:3553] << immnd_fevs_local_checks
Sep 15 14:24:12.205688 osafimmnd [3625:immnd_evt.c:3002] TR Replies pending:16
Sep 15 14:24:12.205742 osafimmnd [3625:immnd_evt.c:3021] T2 SENDING FEVS TO IMMD
Sep 15 14:24:12.205761 osafimmnd [3625:immsv_evt.c:5395] T8 Sending: IMMD_EVT_ND2D_FEVS_REQ_2 to 0
Sep 15 14:24:12.206122 immsync [3993:imma_om_api.c:6395] << immsv_sync
Sep 15 14:24:10.792610 osafimmd [3614:immd_mds.c:0750] >> immd_mds_bcast_send
Sep 15 14:24:10.792624 osafimmd [3614:immsv_evt.c:5400] T8 Sending: IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0
Sep 15 14:24:06 SLES-SLOT-1 osafamfd[3748]: NO Node 'PL-3' left the cluster
Sep 15 14:24:06 SLES-SLOT-1 osafclmd[3729]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 14:24:06 SLES-SLOT-1 osafclmd[3729]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 14:24:06 SLES-SLOT-1 osafimmnd[3625]: NO Global discard node received for nodeId:2030f pid:3192
Sep 15 14:24:06 SLES-SLOT-1 osafimmnd[3625]: NO Implementer disconnected 10 <0, 2030f(down)> (MsgQueueService131855)
Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629130] TIPC: Resetting link <1.1.1:eth1-1.1.3:eth2>, peer not responding
Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629142] TIPC: Lost link <1.1.1:eth1-1.1.3:eth2> on network plane A
Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629773] TIPC: Lost contact with <1.1.3>
28 messages of approximately 90% size is sent before IMMD is blocked (total 131 messages sent)
Sep 15 14:24:06.999676 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61636 size:56464
Sep 15 14:24:07.231197 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61637 size:56414
Sep 15 14:24:07.390262 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61638 size:56464
Sep 15 14:24:07.541229 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61639 size:56444
Sep 15 14:24:07.692762 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61640 size:56468
Sep 15 14:24:07.845123 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61641 size:56430
Sep 15 14:24:08.008833 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61642 size:56448
Sep 15 14:24:08.149347 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61643 size:56464
Sep 15 14:24:08.308973 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61644 size:56464
Sep 15 14:24:08.450764 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61645 size:56414
Sep 15 14:24:08.576393 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61646 size:56464
Sep 15 14:24:08.715694 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61647 size:56468
Sep 15 14:24:08.845637 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61648 size:56444
Sep 15 14:24:08.987037 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61649 size:56468
Sep 15 14:24:09.115771 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61650 size:56430
Sep 15 14:24:09.262554 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61651 size:56444
Sep 15 14:24:09.391070 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61652 size:56468
Sep 15 14:24:09.528649 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61653 size:56444
Sep 15 14:24:09.674827 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61654 size:56434
Sep 15 14:24:09.798674 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61655 size:56464
Sep 15 14:24:09.920046 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61656 size:56448
Sep 15 14:24:10.062081 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61657 size:56464
Sep 15 14:24:10.190174 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61658 size:56430
Sep 15 14:24:10.306722 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61659 size:56448
Sep 15 14:24:10.438324 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61660 size:56464
Sep 15 14:24:10.554421 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61661 size:56468
Sep 15 14:24:10.674673 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61662 size:56410
Sep 15 14:24:10.791234 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61663 size:56464
Sep 15 14:24:06.083682 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.217729 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.364716 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.487655 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.610578 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.754247 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.878103 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:06.994528 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:07.127492 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:24:07.242638 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:30:37.768066 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:30:37.768181 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
Sep 15 14:30:37.768418 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
B. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 50%MDS_DIRECT_BUF_MAXSIZE = 32739:
Sep 15 18:25:28 SLES-SLOT-1 osafimmloadd: NO ** Loading from PBE file imm.db at /opt/imm_share **
Sep 15 18:29:02 SLES-SLOT-1 osafimmnd[3760]: NO Ccb 1 COMMITTED (IMMLOADER)
--
Sep 15 18:29:04 SLES-SLOT-1 osafimmd[3749]: NO Ruling epoch changed to:12
Sep 15 18:29:04 SLES-SLOT-1 osafimmd[3749]: NO ACT: SaImmRepositoryInitModeT changed and noted as being: SA_IMM_KEEP_REPOSITORY
Sep 15 18:29:04 SLES-SLOT-1 osafimmloadd: NO Load ending normally
Sep 15 18:29:57 SLES-SLOT-1 osafimmloadd: NO Sync starting
Sep 15 18:30:43 SLES-SLOT-1 osafimmloadd: IN Synced 60367 objects in total
Sep 15 18:30:43 SLES-SLOT-1 osafimmnd[3760]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 16818
Sep 15 18:30:43 SLES-SLOT-1 osafimmloadd: NO Sync ending normally
Sep 15 18:30:43 SLES-SLOT-1 osafimmnd[3760]: NO Epoch set to 14 in ImmModel
Sep 15 16:58:59 SLES-SLOT-1 osafimmloadd: NO Sync starting
Sep 15 16:59:01 SLES-SLOT-1 osafamfd[9581]: NO Node 'PL-3' left the cluster
Sep 15 16:59:01 SLES-SLOT-1 osafclmd[9562]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 16:59:01 SLES-SLOT-1 osafclmd[9562]: NO Node 131855 went down. Not sending track callback for agents on that node
Sep 15 16:59:01 SLES-SLOT-1 osafimmnd[9482]: NO Global discard node received for nodeId:2030f pid:8315
Sep 15 16:59:01 SLES-SLOT-1 osafimmnd[9482]: NO Implementer disconnected 11 <0, 2030f(down)> (MsgQueueService131855)
Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056187] TIPC: Resetting link <1.1.1:eth1-1.1.3:eth2>, peer not responding
Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056198] TIPC: Lost link <1.1.1:eth1-1.1.3:eth2> on network plane A
Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056983] TIPC: Lost contact with <1.1.3>
Sep 15 17:00:42 SLES-SLOT-1 kernel: [ 8987.444292] TIPC: Established link <1.1.1:eth1-1.1.3:eth2> on network plane A
Sep 15 17:00:54 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:55 SLES-SLOT-1 sshd[9887]: Accepted publickey for root from 192.168.56.105 port 41897 ssh2
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
Sep 15 17:01:05 SLES-SLOT-1 osafimmd[9471]: NO Node 2030f request sync sync-pid:3228 epoch:0
Sep 15 17:02:40 SLES-SLOT-1 osafimmloadd: IN Synced 60383 objects in total
Sep 15 17:02:40 SLES-SLOT-1 osafimmnd[9482]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 16818
Sep 15 17:02:40 SLES-SLOT-1 osafimmloadd: NO Sync ending normally
Observation:
while analyzing for 50% MDS_SIZE the sync is taking 3 minutes and 40 seconds.
Because of these CLM and AMFD are getting TRY_AGAIN for more then 3 minutes and the cluster went for reboot.
System buffers:
SLES-SLOT-1:~ # cat /proc/sys/net/ipv4/tcp_mem
18165 24223 36330
The default and maximum amount for the receive socket memory:
SLES-SLOT-1:~ # cat /proc/sys/net/core/rmem_default
126976
SLES-SLOT-1:~ # cat /proc/sys/net/core/rmem_max
131071
The default and maximum amount for the send socket memory:
SLES-SLOT-1:~ # cat /proc/sys/net/core/wmem_default
126976
SLES-SLOT-1:~ # cat /proc/sys/net/core/wmem_max
131071
The maximum amount of option memory buffers:
SLES-SLOT-1:~ # cat /proc/sys/net/core/optmem_max
20480
SLES-SLOT-1:~ #
If the problem is to be declared as a configuration error, i.e. solvable by adjusting one or more
configuration values that are documented by OpenSAF, then the ticket should be closed as
invalid.