Menu

#1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

never
invalid
nobody
None
defect
mds
d
4.6 FC
major
2015-11-02
2015-03-30
No

The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

syslog, immnd and immd traces of SC-1 attached.

1 Attachments

Related

Tickets: #1291

Discussion

1 2 > >> (Page 1 of 2)
  • Sirisha Alla

    Sirisha Alla - 2015-03-30

    This issue is reproducible with the following steps:

    1) Reboot standby controller
    2) When IMMND coordinator on Active Starts synching with the standby controller, kill clmna/amfnd on one of the payloads

     
  • Neelakanta Reddy

    • summary: IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync --> MDS: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
     
  • Neelakanta Reddy

    IMMD is struck in MDS:

    Mar 26 14:58:28.189765 osafimmd [9539:immd_mds.c:0750] >> immd_mds_bcast_send
    Mar 26 14:58:28.189777 osafimmd [9539:immsv_evt.c:5486] T8 Sending: IMMND_EVT_D2ND_GLOB_FEVS_REQ_2 to 0

     
  • Neelakanta Reddy

    • Component: imm --> mds
     
  • A V Mahesh (AVM)

    • Component: mds --> imm
     
  • A V Mahesh (AVM)

    Hi,

    When A node sync is in-progress , the sync message are being send in very high number/frequency , at that movement , any other node/immnd in the cluster restarts ( TIPC Link down) , the send buffer are getting filled up to Max
    with-in TIPC link tolerance of 1.5 sec time , so sednto() at IMMD is getting blocked and we are hitting 16 FEVS Replies pending.

    So Just increasing increasing SA_AIS_ERR_TRY_AGAIN sleep time of saImmOmSearchNext_2() in imm_loader above TIPC link tolerance of 1.5 sec resolved the issue.

    I did this changes in changeset: 5578, but I am not able to see these chase now in the latest code

    changeset: 5578:80f9493def7e
    user: A V Mahesh mahesh.valla@oracle.com
    date: Thu Aug 14 14:45:05 2014 +0530
    summary: imm: tune imm macros according to mds mcast size [#851]

    diff --git a/osaf/services/saf/immsv/immloadd/imm_loader.cc b/osaf/services/saf/immsv/immloadd/imm_loader.cc
    --- a/osaf/services/saf/immsv/immloadd/imm_loader.cc
    +++ b/osaf/services/saf/immsv/immloadd/imm_loader.cc
    @@ -2452,7 +2452,13 @@ int syncObjectsOfClass(std::string class
    do
    {
    if(retries) {
    - usleep(150000);
    + / If we receive TRY_AGAIN while sync in progress means
    + IMMD might have been reached IMMSV_DEFAULT_FEVS_MAX_PENDING fevs_replies_pending.
    + In general fevs_replies_pending will be hit in the case of the messages have accumulated in the sender queue
    + (The most possible reason will be receiver disconnected but the sender link is in TIPC link tolerance of 1.5 sec)
    + So give enough time to recover as if sync is not a priority messages and possibility of hitting this case because of multicast messaging.
    +
    /
    + sleep(2);
    }
    / Synchronous for throttling sync /
    err = saImmOmSearchNext_2(searchHandle, &objectName, &attributes);

    Please check these tow tickets for more details.

    https://sourceforge.net/p/opensaf/tickets/1172/
    https://sourceforge.net/p/opensaf/tickets/952/

    -AVM

     

    Related

    Tickets: #851

  • Neelakanta Reddy

    Hi,

    Having sync always wait for two seconds causes a major performance reduction
    for sync on some configurations. Because of this the patch is reverted #1188.
    In #1188, retry for each TRY_AGIN is increased exponentially from 10 msec to
    0.5 sec and the number of TRY_AGAIN are 320.

    In the present problem scenario, there is no link-loss and TIPC tolerance
    time may not be a case but the MDS send/recive buffer got filled because of
    multicast.

    The following adjustment are made as part of the changeset 5578.

    +/ Adjust to 90% of MDS_DIRECT_BUF_MAXSIZE /
    +#define IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE ((MDS_DIRECT_BUF_MAXSIZE / 100) * 90)
    +#define IMMSV_MAX_OBJS_IN_SYNCBATCH (IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE / 10)

    since the MAX_SYNC_BATCH_SIZE is fixed, and some machines may or may not have larger
    send and recevier buffer sizes. MAX_SYNC_BATCH_SIZE needs to be more fine tuned when
    multicast is used.

    /Neel.

     
  • A V Mahesh (AVM)

    Neel,

    In the present problem scenario, there is no link-loss and TIPC tolerance
    time may not be a case but the MDS send/recive buffer got filled because of
    multicast.

    According to tester`s provided reproducible steps,
    the link-loss and TIPC tolerance are in case.

    1) Reboot standby controller
    2) When IMMND coordinator on Active Starts synching with the standby controller, kill clmna/amfnd on one of the payloads

    The kill of amfnd on one of the payloads will raise link-loss and TIPC tolerance

    If sender & receiver are in good condition or the receiver dest down is detected by sender buffer will never be filled to maximum because of multicast,
    we will run out of buffers If the reviver is went down and the sender doesn't acknowledged the down.

     
  • A V Mahesh (AVM)

     
  • Mathi Naickan

    Mathi Naickan - 2015-03-31
    • Milestone: 4.6.RC1 --> 4.7-Tentative
     
  • Srikanth R

    Srikanth R - 2015-04-02

    IMMD health check timeout issue is also observed in #1301

     
  • Anders Bjornerstedt

    Defects must have milestone set to the oldest supported branch where
    the problem is expected to be possible to occur.

    I belive the latest sync batch optimizations where done in 4.5.

     
  • Anders Bjornerstedt

    • Milestone: 4.7-Tentative --> 4.5.1
     
  • Anders Bjornerstedt

    A simple solution for this ticket could perhaps be to just increase the
    healthcheck callback timeout for IMMD.

     
  • Anders Bjornerstedt

    I think the problem is triggered by the IMM sync optimizations done ind
    4.5 (?) by Oracle. The fix may reside in tweaking the sync optimixzation,
    or by some adjustment in MDS, or by increasing the AMF healthcheck
    callback timeout for IMMD.

    Because AMF is involved, MDS is involved, IMM sync optimizations done by Oracle is involved, the problem has only been reproduced at Oracle, I
    think it is best that this ticket is handled at Oracle.

     
  • Anders Bjornerstedt

    • summary: MDS: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync --> IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync
    • status: unassigned --> assigned
    • assigned_to: Neelakanta Reddy
     
  • Mathi Naickan

    Mathi Naickan - 2015-04-10
    • Milestone: 4.5.1 --> 4.5.2
     
  • Mathi Naickan

    Mathi Naickan - 2015-04-10

    Release management- 4.5.1 tag is applied. Moving to 4.5.2

     
  • Anders Bjornerstedt

    I suggest we close this defect ticket as "not reproducible".
    That is unless someone can reproduce it.
    My best guess is that this is yet another side effect of testing an overloaded system.

    Since we have no load regulation mechanism and no overload protection mechanism,
    it is relatively easy to push the system until it starts to break down. This is what is hapening here.
    The IMMND misses a timeloy response on a helathchek heartbeat.
    Such a heartbeat timeout I expect (hope) is set to 2 or 3 minutes.

    The only reason for the healthcheck existence is to detect and restart a hung/looping process.
    If a process is hunbg or looping it will be so indefinitely. So we dont want false positives shooting
    down the service just because the system is temporarily overloaded. This just adds more load.

    If a process has not repsonded in a few minutes then we really assume it is hung.
    The assumption here is also that a really hung process is a very rare kind of problem.
    This assumption is correct relative tho the IMMND because the IMMND only blocks
    on calls to MDS and (ironically) on some syncronous AMF calls.

     
  • Anders Bjornerstedt

     
  • Anders Bjornerstedt

    I propose that we increase the "saAmfHctDefMaxDuration" value from 3 minutes to 5 minutes in:

    safHealthcheckKey=Default,safVersion=4.0.0,safCompType=OpenSafCompTypeIMMND
    

    This is the only thing that can be done in the IMMSv.

    The other alternatives are:
    (1)
    Place the ticket on MDS (since IMMND is could only be blocked on MDS asynchronous send).
    Yes asynchronous send. It myst be bocked in the MDS library processing of packing a huge
    message /(sync buffer ?) for asynchronous send. Stuck on MDS for 3 minutes.

    (2) Clöose the ticket.

     
  • Anders Bjornerstedt

    • status: assigned --> not-reproducible
    • Component: imm --> mds
    • Milestone: 4.5.2 --> never
     
  • Anders Bjornerstedt

    Closing this ticket as not reproducible.
    The original analysis for this ticket focused on imm-sync. But the problem is not with sync
    or with fevs flow control, that part is working and actually slowing down the sync process in its
    push.

    The question is rather why the local IMMND hangs for more than 3 minutes?
    The only posibility that makes sense is that IMMND is blocked on one MDS request which
    is "asyncronous". This sounds proposterous, but it is the only possibility that I can see.
    This is also why I am changing the component to MDS.

    I am closing this major ticket as not reproducible since it has not been reproduced in the 5 months that the ticket has existed.

    If someone reproduces it then try to get a core-dump of the IMMND process. The backtrace
    should show where the process is blocked.

     
  • Sirisha Alla

    Sirisha Alla - 2015-08-19

    This issue is reproduced on changeset 6744. Syslog as follows:

    Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
    Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
    Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
    ......
    Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
    Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
    Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
    Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
    Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
    Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
    Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    .....
    Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
    Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
    Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

    In the above logs, is this the reason for IMMND hanging for 3 minutes?

    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
    Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...

     
    • Anders Bjornerstedt

      Ok but then the question simply becomes why does the healthcheck callback not reach the IMMND or why does the IMMND reply
      not reach the AMFND ?

      /AndersBj

      From: Sirisha Alla [mailto:allas@users.sf.net]
      Sent: den 19 augusti 2015 10:50
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

      This issue is reproduced on changeset 6744. Syslog as follows:

      Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for class 'SaSmfSwBundle' is safSmfService => class extent is safe.
      Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
      Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services successfully started
      Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced dump at node 2010f. New Epoch:27
      ......
      Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process is on STANDBY Controller at 2020f
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro from node 2020f
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on controller (not currently coord) requests sync
      Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request sync sync-pid:5221 epoch:0
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync, epoch:30
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully announced sync. New ruling epoch:30
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled to file /var/log/opensaf/osafimmnd, mask=0xffffffff
      Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left the cluster
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went down. Not sending track callback for agents on that node
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard node received for nodeId:2030f pid:16584
      Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost contact with <1.1.3>
      Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC: Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY FAILED status:1
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30 in ImmModel
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting ABORT_SYNC, epoch:30
      Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30 committing with ccbId:100000006/4294967302
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC: Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost link <1.1.1:eth0-1.1.3:eth0> on network plane A
      Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost contact with <1.1.3>
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent back-end process has apparently died.
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      .....
      Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO ImmModel::getPbeOi reports missing PbeOi locally => unsafe
      Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover probation timer started (timeout: 1200000000000 ns)
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
      Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

      In the above logs, is this the reason for IMMND hanging for 3 minutes?

      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994 appears stuck on runtime data handling - sending SIGTERM
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received SIG_TERM, closing db handle
      Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process EXITING...


      [tickets:#1291]http://sourceforge.net/p/opensaf/tickets/1291/ IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

      Status: not-reproducible
      Milestone: never
      Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
      Last Updated: Wed Aug 19, 2015 08:40 AM UTC
      Owner: Neelakanta Reddy
      Attachments:

      The issue is observed with 4.6 FC changeset 6377. The system is up and running with single pbe and 50k objects. This issue is seen after http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is running on standby controller and immcfg command is run from payload to set CompRestartMax value to 1000. IMMND is killed twice on standby controller leading to #1290.

      As a result, standby controller left the cluster in middle of sync, IMMD reported healthcheck callback timeout and the active controller too went for reboot. Following is the syslog of SC-1:

      Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 2020f:
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with <1.1.2>
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
      Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not sending track callback for agents on that node
      Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
      Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF
      Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
      Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=2197', processed='center(received)=1172', processed='destination(messages)=1172', processed='destination(mailinfo)=0', processed='destination(mailwarn)=0', processed='destination(localmessages)=955', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=44', processed='destination(console)=13', processed='destination(null)=0', processed='destination(mail)=0', processed='destination(xconsole)=13', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=1172'
      Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on saImmOmSearchNext - aborting
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED status:1
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE (2484)
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting ABORT_SYNC, epoch:12
      Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing with ccbId:100000054/4294967380
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation timer started (timeout: 1200000000000 ns)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
      Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
      Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60

      syslog, immnd and immd traces of SC-1 attached.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1291/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #1291

1 2 > >> (Page 1 of 2)

Log in to post a comment.