Menu

#1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

never
invalid
nobody
None
defect
mds
d
4.6 FC
major
2015-11-02
2015-03-30
No

The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

syslog, immnd and immd traces of SC-1 attached.

1 Attachments

Related

Tickets: #1291

Discussion

<< < 1 2 (Page 2 of 2)
  • Anders Bjornerstedt

    Changeset "6744" is generated today.
    So I assume this means you reproduced this today.

    The IMMND main poll handling processes in sequence on each descriptor, so it should not be possible
    For traffic on one descriptor to "starve out" a job on another.

    /AndersBj

    From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
    Sent: den 19 augusti 2015 10:54
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

    Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
    not reach the AMFND ?

    /AndersBj

    From: Sirisha Alla [mailto:allas@users.sf.net]
    Sent: den 19 augusti 2015 10:50
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

    This issue is reproduced on changeset 6744. Syslog as follows:

    Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
    Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
    Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
    ......
    Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
    Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    .....
    Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

    In the above logs, is this the reason for IMMND hanging for 3 minutes?

    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...


    [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

    Status: not-reproducible
    Milestone: never
    Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
    Last Updated: Wed Aug 19, 2015 08:40 AM UTC
    Owner: Neelakanta Reddy
    Attachments:

    • immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)

    The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

    As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

    Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
    Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
    Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
    Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

    syslog, immnd and immd traces of SC-1 attached.


    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/


    [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

    Status: not-reproducible
    Milestone: never
    Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
    Last Updated: Wed Aug 19, 2015 08:49 AM UTC
    Owner: Neelakanta Reddy
    Attachments:

    The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

    As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

    Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
    Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
    Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
    Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
    Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
    Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

    syslog, immnd and immd traces of SC-1 attached.


    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

     

    Related

    Tickets: #1291

    • Sirisha Alla

      Sirisha Alla - 2015-08-19

      Yes, I tried this today. The healthcheck timeout happened on IMMD not on
      IMMND.

      /Sirisha

      On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:

      Changeset "6744" is generated today.
      So I assume this means you reproduced this today.

      The IMMND main poll handling processes in sequence on each descriptor,
      so it should not be possible
      For traffic on one descriptor to "starve out" a job on another.

      /AndersBj

      From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
      Sent: den 19 augusti 2015 10:54
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback
      timeout when standby controller rebooted in middle of IMMND sync

      Ok but then the question simply becomes why does the healthcheck
      callback not reach the IMMND or why does the IMMND reply
      not reach the AMFND ?

      /AndersBj

      From: Sirisha Alla [mailto:allas@users.sf.net]
      Sent: den 19 augusti 2015 10:50
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback
      timeout when standby controller rebooted in middle of IMMND sync

      This issue is reproduced on changeset 6744. Syslog as follows:

      Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for
      class 'SaSmfSwBundle' is safSmfService => class extent is safe.
      Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned
      'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
      'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services
      successfully started
      Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
      announced dump at node 2010f. New Epoch:27
      ......
      Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process
      is on STANDBY Controller at 2020f
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro
      from node 2020f
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on
      controller (not currently coord) requests sync
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request
      sync sync-pid:5221 epoch:0
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync,
      epoch:30
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE:
      IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
      IMM_NODE_R_AVAILABLE
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
      announced sync. New ruling epoch:30
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled
      to file /var/log/opensaf/osafimmnd, mask=0xffffffff
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left
      the cluster
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
      down. Not sending track callback for agents on that node
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
      down. Not sending track callback for agents on that node
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard
      node received for nodeId:2030f pid:16584
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
      disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC:
      Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost
      link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost
      contact with <1.1.3>
      Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC:
      Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY
      FAILED status:1
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30
      in ImmModel
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting ABORT_SYNC, epoch:30
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30
      committing with ccbId:100000006/4294967302
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC:
      Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost
      link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost
      contact with <1.1.3>
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
      appears stuck on runtime data handling - sending SIGTERM
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
      SIG_TERM, closing db handle
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
      EXITING...
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
      locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent
      back-end process has apparently died.
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
      ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
      ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      .....
      Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
      ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
      broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60

      In the above logs, is this the reason for IMMND hanging for 3 minutes?

      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
      appears stuck on runtime data handling - sending SIGTERM
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
      SIG_TERM, closing db handle
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
      EXITING...


      [tickets:#1291]
      http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
      IMM: IMMD healthcheck callback timeout when standby controller
      rebooted in middle of IMMND sync

      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:40 AM UTC
      Owner: Neelakanta Reddy
      Attachments:

      • immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
        (6.8 MB; application/x-bzip)

      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.

      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:

      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60

      syslog, immnd and immd traces of SC-1 attached.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/opensaf/tickets/1291/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/


      [tickets:#1291]
      http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
      IMM: IMMD healthcheck callback timeout when standby controller
      rebooted in middle of IMMND sync

      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:49 AM UTC
      Owner: Neelakanta Reddy
      Attachments:

      • immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
        (6.8 MB; application/x-bzip)

      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.

      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:

      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60

      syslog, immnd and immd traces of SC-1 attached.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/opensaf/tickets/1291/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/


      [tickets:#1291] http://sourceforge.net/p/opensaf/tickets/1291/ IMM:
      IMMD healthcheck callback timeout when standby controller rebooted in
      middle of IMMND sync

      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:49 AM UTC
      Owner: Neelakanta Reddy
      Attachments:

      The issue is observed with 4.6 FC changeset 6377. The system is up and
      running with single pbe and 50k objects. This issue is seen after
      http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
      application is running on standby controller and immcfg command is run
      from payload to set CompRestartMax value to 1000. IMMND is killed
      twice on standby controller leading to #1290.

      As a result, standby controller left the cluster in middle of sync,
      IMMD reported healthcheck callback timeout and the active controller
      too went for reboot. Following is the syslog of SC-1:

      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
      node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
      NodeId = 131599 EE Name = , Reason: Received Node Down for peer
      controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
      Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
      link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
      contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
      down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
      the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
      in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
      Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
      dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
      processed='center(queued)=2197', processed='center(received)=1172',
      processed='destination(messages)=1172',
      processed='destination(mailinfo)=0',
      processed='destination(mailwarn)=0',
      processed='destination(localmessages)=955',
      processed='destination(newserr)=0',
      processed='destination(mailerr)=0', processed='destination(netmgm)=0',
      processed='destination(warn)=44', processed='destination(console)=13',
      processed='destination(null)=0', processed='destination(mail)=0',
      processed='destination(xconsole)=13',
      processed='destination(firewall)=0', processed='destination(acpid)=0',
      processed='destination(newscrit)=0',
      processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
      on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
      FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
      IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
      IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
      in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
      broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
      committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
      probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
      failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
      escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
      'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
      'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
      safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
      to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
      NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
      node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
      timeout=60

      syslog, immnd and immd traces of SC-1 attached.


      Sent from sourceforge.net because
      opensaf-tickets@lists.sourceforge.net is subscribed to
      http://sourceforge.net/p/opensaf/tickets/

      To unsubscribe from further messages, a project admin can change
      settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
      Or, if this is a mailing list, you can unsubscribe from the mailing list.



      Opensaf-tickets mailing list
      Opensaf-tickets@lists.sourceforge.net
      https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

       

      Related

      Tickets: #1291

      • Anders Bjornerstedt

        Please reproduce withe IMMD trace on.

        /AndersBj

        From: Sirisha Alla [mailto:sirisha.alla@oracle.com]
        Sent: den 19 augusti 2015 11:07
        To: [opensaf:tickets]; opensaf-tickets@lists.sourceforge.net
        Subject: Re: [tickets] [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

        Yes, I tried this today. The healthcheck timeout happened on IMMD not on IMMND.

        /Sirisha

        On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:

        Changeset "6744" is generated today.
        So I assume this means you reproduced this today.

        The IMMND main poll handling processes in sequence on each descriptor, so it should not be possible
        For traffic on one descriptor to "starve out" a job on another.

        /AndersBj

        From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
        Sent: den 19 augusti 2015 10:54
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

        Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
        not reach the AMFND ?

        /AndersBj

        From: Sirisha Alla [mailto:allas@users.sf.net]
        Sent: den 19 augusti 2015 10:50
        To: [opensaf:tickets]
        Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

        This issue is reproduced on changeset 6744. Syslog as follows:

        Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
        Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
        Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
        Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
        ......
        Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
        Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
        Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
        Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
        Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
        Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
        Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
        Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
        Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
        Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
        Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
        Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
        Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
        Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
        Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
        Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
        Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
        Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
        Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
        Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
        Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
        Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
        Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
        Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
        Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
        Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
        Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
        Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
        .....
        Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
        Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
        Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
        Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
        Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
        Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
        Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
        Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

        In the above logs, is this the reason for IMMND hanging for 3 minutes?

        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
        Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...


        [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

        Status: not-reproducible
        Milestone: never
        Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
        Last Updated: Wed Aug 19, 2015 08:40 AM UTC
        Owner: Neelakanta Reddy
        Attachments:

        • immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)

        The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

        As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

        Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
        Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
        Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
        Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
        Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

        syslog, immnd and immd traces of SC-1 attached.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/


        [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

        Status: not-reproducible
        Milestone: never
        Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
        Last Updated: Wed Aug 19, 2015 08:49 AM UTC
        Owner: Neelakanta Reddy
        Attachments:

        • immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2 (6.8 MB; application/x-bzip)

        The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

        As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

        Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
        Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
        Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
        Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
        Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

        syslog, immnd and immd traces of SC-1 attached.


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/


        [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

        Status: not-reproducible
        Milestone: never
        Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
        Last Updated: Wed Aug 19, 2015 08:49 AM UTC
        Owner: Neelakanta Reddy
        Attachments:

        The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

        As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

        Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
        Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
        Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
        Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
        Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
        Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
        Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
        Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
        Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

        syslog, immnd and immd traces of SC-1 attached.


        Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to http://sourceforge.net/p/opensaf/tickets/

        To unsubscribe from further messages, a project admin can change settings at http://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.



        Opensaf-tickets mailing list

        Opensaf-tickets@lists.sourceforge.netOpensaf-tickets@lists.sourceforge.net

        https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

         

        Related

        Tickets: #1291
        Tickets: tickets

        • Sirisha Alla

          Sirisha Alla - 2015-08-19

          snippet from IMMD trace at the time of test:

          Aug 19 12:00:18.707828 osafimmd [5958:immd_mbcsv.c:0463] T5
          **ENC SYNC COUNT 1142
          Aug 19 12:00:18.707832 osafimmd [5958:immd_mbcsv.c:0482] T5 ENCODE
          IMMD_A2S_MSG_FEVS: send count: 52246 handle: 2005749858575
          Aug 19 12:00:18.707836 osafimmd [5958:immd_mbcsv.c:0605] <<
          mbcsv_enc_async_update
          Aug 19 12:00:18.707840 osafimmd [5958:immd_mbcsv.c:0843] <<
          immd_mbcsv_encode_proc
          Aug 19 12:00:18.707844 osafimmd [5958:immd_mbcsv.c:0428] T5 IMMD - MBCSv
          Callback Success
          Aug 19 12:00:18.707848 osafimmd [5958:immd_mbcsv.c:0429] <<
          immd_mbcsv_callback
          Aug 19 12:00:18.707851 osafimmd [5958:mbcsv_util.c:0438] TR send the
          encoded message to any other peer with same s/w version
          Aug 19 12:00:18.707855 osafimmd [5958:mbcsv_util.c:0441] TR dispatching
          FSM for NCSMBCSV_SEND_ASYNC_UPDATE
          Aug 19 12:00:18.707859 osafimmd [5958:mbcsv_act.c:0101] TR ASYNC update
          to be sent. role: 1, svc_id: 42, pwe_hdl: 65549
          Aug 19 12:00:18.707863 osafimmd [5958:mbcsv_mds.c:0185] >>
          mbcsv_mds_send_msg: sending to vdest:d
          Aug 19 12:00:18.707868 osafimmd [5958:mbcsv_mds.c:0209] TR send type
          MDS_SENDTYPE_REDRSP:
          Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_mds.c:0244] <<
          mbcsv_mds_send_msg: success
          Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_util.c:0492] <<
          mbcsv_send_ckpt_data_to_all_peers
          Aug 19 12:00:18.708472 osafimmd [5958:mbcsv_api.c:0868] <<
          mbcsv_process_snd_ckpt_request: retval: 1
          Aug 19 12:00:18.708472 osafimmd [5958:immd_mbcsv.c:0063] <<
          immd_mbcsv_sync_update
          Aug 19 12:00:18.708472 osafimmd [5958:immd_mds.c:0750] >>
          immd_mds_bcast_send
          Aug 19 12:00:18.708472 osafimmd [5958:immsv_evt.c:5400] T8 Sending:
          IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0
          Aug 19 12:05:59.014948 osafimmd [2456:immd_main.c:0111] >> immd_initialize
          Aug 19 12:05:59.036253 osafimmd [2456:ncs_main_pub.c:0223] TR
          NCS:PROCESS_ID=2456
          Aug 19 12:05:59.036310 osafimmd [2456:sysf_def.c:0090] TR INITIALIZING
          LEAP ENVIRONMENT

          I have shared the IMMD and IMMND traces with Neel. I will try uploading
          IMMD traces to the ticket.

          Regards,
          Sirisha

          On Wednesday 19 August 2015 02:40 PM, Anders Bjornerstedt wrote:

          Please reproduce withe IMMD trace on.

          /AndersBj

          From: Sirisha Alla [mailto:sirisha.alla@oracle.com]
          Sent: den 19 augusti 2015 11:07
          To: [opensaf:tickets]; opensaf-tickets@lists.sourceforge.net
          Subject: Re: [tickets]
          http://sourceforge.net/p/opensaf/tickets/_discuss/ [opensaf:tickets]
          Re: #1291 IMM: IMMD healthcheck callback timeout when standby
          controller rebooted in middle of IMMND sync

          Yes, I tried this today. The healthcheck timeout happened on IMMD not
          on IMMND.

          /Sirisha

          On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:

          Changeset "6744" is generated today.
          So I assume this means you reproduced this today.

          The IMMND main poll handling processes in sequence on each descriptor,
          so it should not be possible
          For traffic on one descriptor to "starve out" a job on another.

          /AndersBj

          From: Anders Bjornerstedt [mailto:andersbj@users.sf.net]
          Sent: den 19 augusti 2015 10:54
          To: [opensaf:tickets]
          Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback
          timeout when standby controller rebooted in middle of IMMND sync

          Ok but then the question simply becomes why does the healthcheck
          callback not reach the IMMND or why does the IMMND reply
          not reach the AMFND ?

          /AndersBj

          From: Sirisha Alla [mailto:allas@users.sf.net]
          Sent: den 19 augusti 2015 10:50
          To: [opensaf:tickets]
          Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback
          timeout when standby controller rebooted in middle of IMMND sync

          This issue is reproduced on changeset 6744. Syslog as follows:

          Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for
          class 'SaSmfSwBundle' is safSmfService => class extent is safe.
          Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned
          'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
          'safSu=SC-1,safSg=2N,safApp=OpenSAF'
          Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services
          successfully started
          Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
          announced dump at node 2010f. New Epoch:27
          ......
          Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC:
          Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process
          is on STANDBY Controller at 2020f
          Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro
          from node 2020f
          Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on
          controller (not currently coord) requests sync
          Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request
          sync sync-pid:5221 epoch:0
          Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync,
          epoch:30
          Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE:
          IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
          Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
          IMM_NODE_R_AVAILABLE
          Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
          announced sync. New ruling epoch:30
          Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled
          to file /var/log/opensaf/osafimmnd, mask=0xffffffff
          Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
          Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left
          the cluster
          Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
          down. Not sending track callback for agents on that node
          Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
          down. Not sending track callback for agents on that node
          Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard
          node received for nodeId:2030f pid:16584
          Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
          disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
          Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC:
          Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
          Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost
          link <1.1.1:eth0-1.1.3:eth0> on network plane A
          Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost
          contact with <1.1.3>
          Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC:
          Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
          on saImmOmSearchNext - aborting
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY
          FAILED status:1
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE:
          IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
          IMM_NODE_FULLY_AVAILABLE (2484)
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30
          in ImmModel
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
          broadcasting ABORT_SYNC, epoch:30
          Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30
          committing with ccbId:100000006/4294967302
          Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC:
          Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
          Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost
          link <1.1.1:eth0-1.1.3:eth0> on network plane A
          Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost
          contact with <1.1.3>
          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
          appears stuck on runtime data handling - sending SIGTERM
          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
          SIG_TERM, closing db handle
          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
          EXITING...
          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
          locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
          Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent
          back-end process has apparently died.
          Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
          broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
          Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
          ImmModel::getPbeOi reports missing PbeOi locally => unsafe
          Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
          broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
          Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
          ImmModel::getPbeOi reports missing PbeOi locally => unsafe
          .....
          Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
          ImmModel::getPbeOi reports missing PbeOi locally => unsafe
          Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
          broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
          Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover
          probation timer started (timeout: 1200000000000 ns)
          Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing
          failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
          Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
          escalated from 'componentFailover' to 'suFailover'
          Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
          'healthCheckcallbackTimeout' : Recovery is 'suFailover'
          Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER
          safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
          to:healthCheckcallbackTimeout Recovery is:suFailover
          Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF
          NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
          node failfast, OwnNodeId = 131343, SupervisionTime = 60

          In the above logs, is this the reason for IMMND hanging for 3 minutes?

          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
          appears stuck on runtime data handling - sending SIGTERM
          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
          SIG_TERM, closing db handle
          Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
          EXITING...


          [tickets:#1291]
          http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
          IMM: IMMD healthcheck callback timeout when standby controller
          rebooted in middle of IMMND sync

          Status: not-reproducible
          Milestone: never
          Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
          Last Updated: Wed Aug 19, 2015 08:40 AM UTC
          Owner: Neelakanta Reddy
          Attachments:

          • immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
            (6.8 MB; application/x-bzip)

          The issue is observed with 4.6 FC changeset 6377. The system is up and
          running with single pbe and 50k objects. This issue is seen after
          http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
          application is running on standby controller and immcfg command is run
          from payload to set CompRestartMax value to 1000. IMMND is killed
          twice on standby controller leading to #1290.

          As a result, standby controller left the cluster in middle of sync,
          IMMD reported healthcheck callback timeout and the active controller
          too went for reboot. Following is the syslog of SC-1:

          Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
          node id 2020f:
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
          NodeId = 131599 EE Name = , Reason: Received Node Down for peer
          controller, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
          Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
          link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
          contact with <1.1.2>
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
          the cluster
          Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
          in the absence of PLM is outside the scope of OpenSAF
          Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
          Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
          dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
          processed='center(queued)=2197', processed='center(received)=1172',
          processed='destination(messages)=1172',
          processed='destination(mailinfo)=0',
          processed='destination(mailwarn)=0',
          processed='destination(localmessages)=955',
          processed='destination(newserr)=0',
          processed='destination(mailerr)=0', processed='destination(netmgm)=0',
          processed='destination(warn)=44', processed='destination(console)=13',
          processed='destination(null)=0', processed='destination(mail)=0',
          processed='destination(xconsole)=13',
          processed='destination(firewall)=0', processed='destination(acpid)=0',
          processed='destination(newscrit)=0',
          processed='destination(newsnotice)=0', processed='source(src)=1172'
          Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
          on saImmOmSearchNext - aborting
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
          FAILED status:1
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
          IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
          IMM_NODE_FULLY_AVAILABLE (2484)
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
          in ImmModel
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
          broadcasting ABORT_SYNC, epoch:12
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
          committing with ccbId:100000054/4294967380
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
          probation timer started (timeout: 1200000000000 ns)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
          failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
          escalated from 'componentFailover' to 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
          'healthCheckcallbackTimeout' : Recovery is 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
          safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
          to:healthCheckcallbackTimeout Recovery is:suFailover
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
          NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
          node failfast, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
          timeout=60

          syslog, immnd and immd traces of SC-1 attached.


          Sent from sourceforge.net because you indicated interest in
          https://sourceforge.net/p/opensaf/tickets/1291/

          To unsubscribe from further messages, please visit
          https://sourceforge.net/auth/subscriptions/


          [tickets:#1291]
          http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
          IMM: IMMD healthcheck callback timeout when standby controller
          rebooted in middle of IMMND sync

          Status: not-reproducible
          Milestone: never
          Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
          Last Updated: Wed Aug 19, 2015 08:49 AM UTC
          Owner: Neelakanta Reddy
          Attachments:

          • immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
            (6.8 MB; application/x-bzip)

          The issue is observed with 4.6 FC changeset 6377. The system is up and
          running with single pbe and 50k objects. This issue is seen after
          http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
          application is running on standby controller and immcfg command is run
          from payload to set CompRestartMax value to 1000. IMMND is killed
          twice on standby controller leading to #1290.

          As a result, standby controller left the cluster in middle of sync,
          IMMD reported healthcheck callback timeout and the active controller
          too went for reboot. Following is the syslog of SC-1:

          Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
          node id 2020f:
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
          NodeId = 131599 EE Name = , Reason: Received Node Down for peer
          controller, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
          Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
          link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
          contact with <1.1.2>
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
          the cluster
          Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
          in the absence of PLM is outside the scope of OpenSAF
          Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
          Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
          dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
          processed='center(queued)=2197', processed='center(received)=1172',
          processed='destination(messages)=1172',
          processed='destination(mailinfo)=0',
          processed='destination(mailwarn)=0',
          processed='destination(localmessages)=955',
          processed='destination(newserr)=0',
          processed='destination(mailerr)=0', processed='destination(netmgm)=0',
          processed='destination(warn)=44', processed='destination(console)=13',
          processed='destination(null)=0', processed='destination(mail)=0',
          processed='destination(xconsole)=13',
          processed='destination(firewall)=0', processed='destination(acpid)=0',
          processed='destination(newscrit)=0',
          processed='destination(newsnotice)=0', processed='source(src)=1172'
          Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
          on saImmOmSearchNext - aborting
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
          FAILED status:1
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
          IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
          IMM_NODE_FULLY_AVAILABLE (2484)
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
          in ImmModel
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
          broadcasting ABORT_SYNC, epoch:12
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
          committing with ccbId:100000054/4294967380
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
          probation timer started (timeout: 1200000000000 ns)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
          failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
          escalated from 'componentFailover' to 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
          'healthCheckcallbackTimeout' : Recovery is 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
          safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
          to:healthCheckcallbackTimeout Recovery is:suFailover
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
          NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
          node failfast, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
          timeout=60

          syslog, immnd and immd traces of SC-1 attached.


          Sent from sourceforge.net because you indicated interest in
          https://sourceforge.net/p/opensaf/tickets/1291/

          To unsubscribe from further messages, please visit
          https://sourceforge.net/auth/subscriptions/


          [tickets:#1291]
          http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
          IMM: IMMD healthcheck callback timeout when standby controller
          rebooted in middle of IMMND sync

          Status: not-reproducible
          Milestone: never
          Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
          Last Updated: Wed Aug 19, 2015 08:49 AM UTC
          Owner: Neelakanta Reddy
          Attachments:

          • immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
            (6.8 MB; application/x-bzip)

          The issue is observed with 4.6 FC changeset 6377. The system is up and
          running with single pbe and 50k objects. This issue is seen after
          http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
          application is running on standby controller and immcfg command is run
          from payload to set CompRestartMax value to 1000. IMMND is killed
          twice on standby controller leading to #1290.

          As a result, standby controller left the cluster in middle of sync,
          IMMD reported healthcheck callback timeout and the active controller
          too went for reboot. Following is the syslog of SC-1:

          Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
          node id 2020f:
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
          NodeId = 131599 EE Name = , Reason: Received Node Down for peer
          controller, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
          Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
          link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
          contact with <1.1.2>
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
          the cluster
          Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
          in the absence of PLM is outside the scope of OpenSAF
          Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
          Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
          dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
          processed='center(queued)=2197', processed='center(received)=1172',
          processed='destination(messages)=1172',
          processed='destination(mailinfo)=0',
          processed='destination(mailwarn)=0',
          processed='destination(localmessages)=955',
          processed='destination(newserr)=0',
          processed='destination(mailerr)=0', processed='destination(netmgm)=0',
          processed='destination(warn)=44', processed='destination(console)=13',
          processed='destination(null)=0', processed='destination(mail)=0',
          processed='destination(xconsole)=13',
          processed='destination(firewall)=0', processed='destination(acpid)=0',
          processed='destination(newscrit)=0',
          processed='destination(newsnotice)=0', processed='source(src)=1172'
          Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
          on saImmOmSearchNext - aborting
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
          FAILED status:1
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
          IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
          IMM_NODE_FULLY_AVAILABLE (2484)
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
          in ImmModel
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
          broadcasting ABORT_SYNC, epoch:12
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
          committing with ccbId:100000054/4294967380
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
          probation timer started (timeout: 1200000000000 ns)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
          failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
          escalated from 'componentFailover' to 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
          'healthCheckcallbackTimeout' : Recovery is 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
          safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
          to:healthCheckcallbackTimeout Recovery is:suFailover
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
          NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
          node failfast, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
          timeout=60

          syslog, immnd and immd traces of SC-1 attached.


          Sent from sourceforge.net because
          opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net
          opensaf-tickets@lists.sourceforge.net is subscribed to
          http://sourceforge.net/p/opensaf/tickets/

          To unsubscribe from further messages, a project admin can change
          settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
          Or, if this is a mailing list, you can unsubscribe from the mailing list.

          ------------------------------------------------------------------------

          Opensaf-tickets mailing list

          Opensaf-tickets@lists.sourceforge.netOpensaf-tickets@lists.sourceforge.net
          Opensaf-tickets@lists.sourceforge.net

          https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


          [tickets:#1291] http://sourceforge.net/p/opensaf/tickets/1291/ IMM:
          IMMD healthcheck callback timeout when standby controller rebooted in
          middle of IMMND sync

          Status: not-reproducible
          Milestone: never
          Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
          Last Updated: Wed Aug 19, 2015 08:49 AM UTC
          Owner: Neelakanta Reddy
          Attachments:

          The issue is observed with 4.6 FC changeset 6377. The system is up and
          running with single pbe and 50k objects. This issue is seen after
          http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
          application is running on standby controller and immcfg command is run
          from payload to set CompRestartMax value to 1000. IMMND is killed
          twice on standby controller leading to #1290.

          As a result, standby controller left the cluster in middle of sync,
          IMMD reported healthcheck callback timeout and the active controller
          too went for reboot. Following is the syslog of SC-1:

          Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
          node id 2020f:
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
          NodeId = 131599 EE Name = , Reason: Received Node Down for peer
          controller, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
          Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
          link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
          contact with <1.1.2>
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
          down. Not sending track callback for agents on that node
          Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
          the cluster
          Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
          in the absence of PLM is outside the scope of OpenSAF
          Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
          Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
          Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
          dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
          processed='center(queued)=2197', processed='center(received)=1172',
          processed='destination(messages)=1172',
          processed='destination(mailinfo)=0',
          processed='destination(mailwarn)=0',
          processed='destination(localmessages)=955',
          processed='destination(newserr)=0',
          processed='destination(mailerr)=0', processed='destination(netmgm)=0',
          processed='destination(warn)=44', processed='destination(console)=13',
          processed='destination(null)=0', processed='destination(mail)=0',
          processed='destination(xconsole)=13',
          processed='destination(firewall)=0', processed='destination(acpid)=0',
          processed='destination(newscrit)=0',
          processed='destination(newsnotice)=0', processed='source(src)=1172'
          Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
          on saImmOmSearchNext - aborting
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
          FAILED status:1
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
          IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
          IMM_NODE_FULLY_AVAILABLE (2484)
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
          in ImmModel
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
          broadcasting ABORT_SYNC, epoch:12
          Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
          committing with ccbId:100000054/4294967380
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
          probation timer started (timeout: 1200000000000 ns)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
          failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
          escalated from 'componentFailover' to 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
          'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
          'healthCheckcallbackTimeout' : Recovery is 'suFailover'
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
          safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
          to:healthCheckcallbackTimeout Recovery is:suFailover
          Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
          NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
          node failfast, OwnNodeId = 131343, SupervisionTime = 60
          Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
          timeout=60

          syslog, immnd and immd traces of SC-1 attached.


          Sent from sourceforge.net because
          opensaf-tickets@lists.sourceforge.net is subscribed to
          http://sourceforge.net/p/opensaf/tickets/

          To unsubscribe from further messages, a project admin can change
          settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
          Or, if this is a mailing list, you can unsubscribe from the mailing list.



          Opensaf-tickets mailing list
          Opensaf-tickets@lists.sourceforge.net
          https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

           

          Related

          Tickets: #1291
          Tickets: tickets

  • Anders Bjornerstedt

    • status: not-reproducible --> unassigned
    • Component: mds --> imm
    • Part: - --> d
    • Milestone: never --> 4.5.2
     
  • Sirisha Alla

    Sirisha Alla - 2015-08-19

    IMMD trace on changeset 6477

     
  • Anders Bjornerstedt

    The IMMD sends one fevs message at a time for each poll cycle.
    Also in each poll cycle it checks the AMF descriptor for healthcheck callbacks.

    This means that the IMMD is blocked for more than 3 minutes on broadcasting one fevs message.

    The IMMSV_DEFAULT_FEVS_MAX_PENDING) affects the IMMND process, not the IMMD.

    The FEVS_MAX_PENDING is there precisely not to overload the IMMD.

     
  • Anders Bjornerstedt

    The IMMD is blocked on the (asyncronous) broadcast of one fevs message for more than 3 minutes. Changing component to MDS.

    A reelvant question is what is otherwise special about this test.
    Is MDS TCP used and not TIPC (MDS broadcast uses TIPC multicast whic his faster).

    Clearly something is over/under dimensioned in this system.
    This test condifuration probably needs special configuration for MDS and or IMM (max sync bnatch size).

    Again I dont see that this is a defect on IMM.

     
  • Anders Bjornerstedt

    • Component: imm --> mds
     
  • Neelakanta Reddy

    • assigned_to: Neelakanta Reddy --> nobody
     
  • Neelakanta Reddy

    The present MDS_DIRECT_BUF_MAXSIZE = 65479

    A. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 90%MDS_DIRECT_BUF_MAXSIZE = 58931 :

    1. immnd traces (SC-1):

    Sep 15 14:24:12.205653 osafimmnd [3625:immnd_evt.c:3553] << immnd_fevs_local_checks
    Sep 15 14:24:12.205688 osafimmnd [3625:immnd_evt.c:3002] TR Replies pending:16
    Sep 15 14:24:12.205742 osafimmnd [3625:immnd_evt.c:3021] T2 SENDING FEVS TO IMMD
    Sep 15 14:24:12.205761 osafimmnd [3625:immsv_evt.c:5395] T8 Sending: IMMD_EVT_ND2D_FEVS_REQ_2 to 0
    Sep 15 14:24:12.206122 immsync [3993:imma_om_api.c:6395] << immsv_sync

    1. immd traces(SC-1)

    Sep 15 14:24:10.792610 osafimmd [3614:immd_mds.c:0750] >> immd_mds_bcast_send
    Sep 15 14:24:10.792624 osafimmd [3614:immsv_evt.c:5400] T8 Sending: IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0

    1. syslog(SC-1):

    Sep 15 14:24:06 SLES-SLOT-1 osafamfd[3748]: NO Node 'PL-3' left the cluster
    Sep 15 14:24:06 SLES-SLOT-1 osafclmd[3729]: NO Node 131855 went down. Not sending track callback for agents on that node
    Sep 15 14:24:06 SLES-SLOT-1 osafclmd[3729]: NO Node 131855 went down. Not sending track callback for agents on that node
    Sep 15 14:24:06 SLES-SLOT-1 osafimmnd[3625]: NO Global discard node received for nodeId:2030f pid:3192
    Sep 15 14:24:06 SLES-SLOT-1 osafimmnd[3625]: NO Implementer disconnected 10 <0, 2030f(down)> (MsgQueueService131855)
    Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629130] TIPC: Resetting link <1.1.1:eth1-1.1.3:eth2>, peer not responding
    Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629142] TIPC: Lost link <1.1.1:eth1-1.1.3:eth2> on network plane A
    Sep 15 14:24:12 SLES-SLOT-1 kernel: [ 521.629773] TIPC: Lost contact with <1.1.3>

    1. number of messages 14:24:06 to 14:24:12

    28 messages of approximately 90% size is sent before IMMD is blocked (total 131 messages sent)

    Sep 15 14:24:06.999676 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61636 size:56464
    Sep 15 14:24:07.231197 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61637 size:56414
    Sep 15 14:24:07.390262 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61638 size:56464
    Sep 15 14:24:07.541229 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61639 size:56444
    Sep 15 14:24:07.692762 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61640 size:56468
    Sep 15 14:24:07.845123 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61641 size:56430
    Sep 15 14:24:08.008833 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61642 size:56448
    Sep 15 14:24:08.149347 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61643 size:56464
    Sep 15 14:24:08.308973 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61644 size:56464
    Sep 15 14:24:08.450764 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61645 size:56414
    Sep 15 14:24:08.576393 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61646 size:56464
    Sep 15 14:24:08.715694 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61647 size:56468
    Sep 15 14:24:08.845637 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61648 size:56444
    Sep 15 14:24:08.987037 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61649 size:56468
    Sep 15 14:24:09.115771 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61650 size:56430
    Sep 15 14:24:09.262554 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61651 size:56444
    Sep 15 14:24:09.391070 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61652 size:56468
    Sep 15 14:24:09.528649 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61653 size:56444
    Sep 15 14:24:09.674827 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61654 size:56434
    Sep 15 14:24:09.798674 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61655 size:56464
    Sep 15 14:24:09.920046 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61656 size:56448
    Sep 15 14:24:10.062081 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61657 size:56464
    Sep 15 14:24:10.190174 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61658 size:56430
    Sep 15 14:24:10.306722 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61659 size:56448
    Sep 15 14:24:10.438324 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61660 size:56464
    Sep 15 14:24:10.554421 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61661 size:56468
    Sep 15 14:24:10.674673 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61662 size:56410
    Sep 15 14:24:10.791234 osafimmd [3614:immd_evt.c:0273] T5 immd_evt_proc_fevs_req send_count:61663 size:56464

    1. immnd traces PL-4 (total 129 messages received)

    Sep 15 14:24:06.083682 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.217729 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.364716 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.487655 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.610578 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.754247 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.878103 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:06.994528 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:07.127492 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:24:07.242638 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:30:37.768066 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:30:37.768181 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0
    Sep 15 14:30:37.768418 osafimmnd [3580:immnd_evt.c:8709] T2 REMOTE FEVS received. Messages from me still pending:0

    B. IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE = 50%MDS_DIRECT_BUF_MAXSIZE = 32739:

    1. Loading wihout node restart -- 36 seconds:
      Sep 15 18:25:28 SLES-SLOT-1 osafimmloadd: NO ** Loading from PBE file imm.db at /opt/imm_share **
      Sep 15 18:29:02 SLES-SLOT-1 osafimmnd[3760]: NO Ccb 1 COMMITTED (IMMLOADER)
      --

    Sep 15 18:29:04 SLES-SLOT-1 osafimmd[3749]: NO Ruling epoch changed to:12
    Sep 15 18:29:04 SLES-SLOT-1 osafimmd[3749]: NO ACT: SaImmRepositoryInitModeT changed and noted as being: SA_IMM_KEEP_REPOSITORY
    Sep 15 18:29:04 SLES-SLOT-1 osafimmloadd: NO Load ending normally

    1. syncing without node restart-- 46 seconds:

    Sep 15 18:29:57 SLES-SLOT-1 osafimmloadd: NO Sync starting
    Sep 15 18:30:43 SLES-SLOT-1 osafimmloadd: IN Synced 60367 objects in total
    Sep 15 18:30:43 SLES-SLOT-1 osafimmnd[3760]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 16818
    Sep 15 18:30:43 SLES-SLOT-1 osafimmloadd: NO Sync ending normally
    Sep 15 18:30:43 SLES-SLOT-1 osafimmnd[3760]: NO Epoch set to 14 in ImmModel

    1. for the #1291 test case the sync is taking 3 minutes and 42 seconds

    Sep 15 16:58:59 SLES-SLOT-1 osafimmloadd: NO Sync starting
    Sep 15 16:59:01 SLES-SLOT-1 osafamfd[9581]: NO Node 'PL-3' left the cluster
    Sep 15 16:59:01 SLES-SLOT-1 osafclmd[9562]: NO Node 131855 went down. Not sending track callback for agents on that node
    Sep 15 16:59:01 SLES-SLOT-1 osafclmd[9562]: NO Node 131855 went down. Not sending track callback for agents on that node
    Sep 15 16:59:01 SLES-SLOT-1 osafimmnd[9482]: NO Global discard node received for nodeId:2030f pid:8315
    Sep 15 16:59:01 SLES-SLOT-1 osafimmnd[9482]: NO Implementer disconnected 11 <0, 2030f(down)> (MsgQueueService131855)
    Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056187] TIPC: Resetting link <1.1.1:eth1-1.1.3:eth2>, peer not responding
    Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056198] TIPC: Lost link <1.1.1:eth1-1.1.3:eth2> on network plane A
    Sep 15 16:59:08 SLES-SLOT-1 kernel: [ 8893.056983] TIPC: Lost contact with <1.1.3>
    Sep 15 17:00:42 SLES-SLOT-1 kernel: [ 8987.444292] TIPC: Established link <1.1.1:eth1-1.1.3:eth2> on network plane A
    Sep 15 17:00:54 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:55 SLES-SLOT-1 sshd[9887]: Accepted publickey for root from 192.168.56.105 port 41897 ssh2
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:00:58 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:03 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:04 SLES-SLOT-1 osafimmd[9471]: NO Extended intro from node 2030f
    Sep 15 17:01:05 SLES-SLOT-1 osafimmd[9471]: NO Node 2030f request sync sync-pid:3228 epoch:0
    Sep 15 17:02:40 SLES-SLOT-1 osafimmloadd: IN Synced 60383 objects in total
    Sep 15 17:02:40 SLES-SLOT-1 osafimmnd[9482]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 16818
    Sep 15 17:02:40 SLES-SLOT-1 osafimmloadd: NO Sync ending normally

    Observation:

    while analyzing for 50% MDS_SIZE the sync is taking 3 minutes and 40 seconds.
    Because of these CLM and AMFD are getting TRY_AGAIN for more then 3 minutes and the cluster went for reboot.

    System buffers:

    SLES-SLOT-1:~ # cat /proc/sys/net/ipv4/tcp_mem
    18165 24223 36330

    The default and maximum amount for the receive socket memory:
    SLES-SLOT-1:~ # cat /proc/sys/net/core/rmem_default
    126976
    SLES-SLOT-1:~ # cat /proc/sys/net/core/rmem_max
    131071
    The default and maximum amount for the send socket memory:

    SLES-SLOT-1:~ # cat /proc/sys/net/core/wmem_default
    126976
    SLES-SLOT-1:~ # cat /proc/sys/net/core/wmem_max
    131071

    The maximum amount of option memory buffers:

    SLES-SLOT-1:~ # cat /proc/sys/net/core/optmem_max
    20480
    SLES-SLOT-1:~ #

     
  • Anders Bjornerstedt

    • status: unassigned --> invalid
     
  • Anders Bjornerstedt

    If the problem is to be declared as a configuration error, i.e. solvable by adjusting one or more
    configuration values that are documented by OpenSAF, then the ticket should be closed as
    invalid.

     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.2 --> never
     
<< < 1 2 (Page 2 of 2)

Log in to post a comment.