Menu

#2101 AMF: Heartbeat timeout observed after ImmNd restart

5.2.RC1
not-reproducible
None
defect
amf
-
major
2017-03-10
2016-10-07
No

OS : Suse 64bit
Changeset : 8190
Setup : 4 physical nodee 1 PBE enabled with 1Lakh load

Step
1. Bringu opensaf on four nodes
2. Imm test cases running with ndrestart scenario on standby controller
3. After 1 of the ndrestart amfd heart beat timeout happened. Below is the backtrace.

Coredump:
0 0x00007fa7c680161c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
1 0x000000000044da7d in avd_imm_reinit_bg () at imm.cc:1949
2 0x0000000000453e33 in main_loop () at main.cc:737
3 0x00000000004541ff in main (argc=2, argv=0x7fff1cbfa268) at main.cc:848
(gdb) thread apply all bt

Thread 6 (Thread 0x7fa7c78fcb00 (LWP 3837)):
0 0x00007fa7c57124f6 in poll () from /lib64/libc.so.6
1 0x00007fa7c74f80d3 in osaf_ppoll (io_fds=0x7fa7c78fc1c0, i_nfds=1, i_timeout_ts=0x7fa7c78fc180, i_sigmask=0x0) at osaf_poll.c:105
2 0x00007fa7c74f7ff9 in osaf_poll (io_fds=0x7fa7c78fc1c0, i_nfds=1, i_timeout=30000) at osaf_poll.c:44
3 0x00007fa7c74f81c8 in osaf_poll_one_fd (i_fd=13, i_timeout=30000) at osaf_poll.c:128
4 0x00007fa7c65f3a0c in rda_read_msg (sockfd=13, msg=0x7fa7c78fc260 "10 2", size=64) at rda_papi.cc:673
5 0x00007fa7c65f31ec in rda_callback_task (rda_callback_cb=0x7963e0) at rda_papi.cc:150
6 0x00007fa7c67fd7b6 in start_thread () from /lib64/libpthread.so.0
7 0x00007fa7c571b9cd in clone () from /lib64/libc.so.6
8 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7fa7c4e0b700 (LWP 5557)):
0 0x00007fa7c57124f6 in poll () from /lib64/libc.so.6
1 0x00007fa7c74f80d3 in osaf_ppoll (io_fds=0x7fa7c4e09d40, i_nfds=1, i_timeout_ts=0x7fa7c4e09d00, i_sigmask=0x0) at osaf_poll.c:105
2 0x00007fa7c74f7ff9 in osaf_poll (io_fds=0x7fa7c4e09d40, i_nfds=1, i_timeout=10000) at osaf_poll.c:44
3 0x00007fa7c74f81c8 in osaf_poll_one_fd (i_fd=29, i_timeout=10000) at osaf_poll.c:128
4 0x00007fa7c7535605 in mds_mcm_time_wait (sel_obj=0x80c3d8, time_val=1000) at mds_c_sndrcv.c:2570
5 0x00007fa7c75351b1 in mcm_pvt_normal_svc_sndrsp (env_hdl=131071, fr_svc_id=26, msg=0x7fa7c4e0a0c0, to_dest=565214705385538, to_svc_id=25, req=0x7fa7c4e09e60, pri=MDS_SEND_PRIORITY_MEDIUM)
at mds_c_sndrcv.c:2457
6 0x00007fa7c7530d87 in mds_mcm_send (info=0x7fa7c4e09f90) at mds_c_sndrcv.c:690
7 0x00007fa7c7530170 in mds_send (info=0x7fa7c4e09f90) at mds_c_sndrcv.c:390
8 0x00007fa7c752fde1 in ncsmds_api (svc_to_mds_info=0x7fa7c4e09f90) at mds_papi.c:191
9 0x00007fa7c6e527a7 in imma_mds_msg_sync_send (imma_mds_hdl=131071, destination=0x7fa7c707a850 <imma_cb+144>, i_evt=0x7fa7c4e0a0c0, o_evt=0x7fa7c4e0a058, timeout=1000) at imma_mds.c:604
10 0x00007fa7c6e47bd6 in search_next_common (searchHandle=1475865130618188739, objectName=0x7fa7c4e0a298, attributes=0x7fa7c4e0a318, bUseString=false) at imma_om_api.c:7584
11 0x00007fa7c6e475f9 in saImmOmSearchNext_2 (searchHandle=1475865130618188739, objectName=0x7fa7c4e0a370, attributes=0x7fa7c4e0a318) at imma_om_api.c:7444
12 0x00000000004fb140 in immutil_saImmOmSearchNext_2 (searchHandle=1475865130618188739, objectName=0x7fa7c4e0a370, attributes=0x7fa7c4e0a318) at immutil.c:1818
13 0x0000000000431bc2 in avd_compcstype_config_get (name="safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF", comp=0x806e50) at compcstype.cc:306
14 0x0000000000429c5c in avd_comp_config_get (su_name="safSu=SC-2,safSg=2N,safApp=OpenSAF", su=0x7b08c0) at comp.cc:756
15 0x00000000004d9f63 in avd_su_config_get (sg_name="safSg=2N,safApp=OpenSAF", sg=0x7bb4c0) at su.cc:717
16 0x000000000048c7dc in avd_sg_config_get (app_dn="safApp=OpenSAF", app=0x7c08b0) at sg.cc:457
17 0x000000000040a88a in avd_app_config_get () at app.cc:460
18 0x000000000044c154 in avd_imm_config_get () at imm.cc:1574
19 0x000000000044d6e6 in avd_imm_reinit_bg_thread (_cb=0x75dba0 <_control_block>) at imm.cc:1891
20 0x00007fa7c67fd7b6 in start_thread () from /lib64/libpthread.so.0
21 0x00007fa7c571b9cd in clone () from /lib64/libc.so.6
22 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fa7c791cb00 (LWP 3836)):
0 0x00007fa7c57124f6 in poll () from /lib64/libc.so.6
1 0x00007fa7c7549696 in mdtm_process_recv_events () at mds_dt_tipc.c:665
2 0x00007fa7c67fd7b6 in start_thread () from /lib64/libpthread.so.0
3 0x00007fa7c571b9cd in clone () from /lib64/libc.so.6
4 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7fa7bffff700 (LWP 5586)):
0 0x00007fa7c6804294 in __lll_lock_wait () from /lib64/libpthread.so.0
1 0x00007fa7c67ff619 in _L_lock_1008 () from /lib64/libpthread.so.0
2 0x00007fa7c67ff42e in pthread_mutex_lock () from /lib64/libpthread.so.0
3 0x000000000044d358 in osaf_mutex_lock_ordie (io_mutex=0x75d580 <imm_reinit_mutex>) at ../../../../../osaf/libs/core/common/include/osaf_utility.h:72
4 avd_imm_reinit_bg_thread (_cb=0x75dba0 <_control_block>) at imm.cc:1838
5 0x00007fa7c67fd7b6 in start_thread () from /lib64/libpthread.so.0
6 0x00007fa7c571b9cd in clone () from /lib64/libc.so.6
7 0x0000000000000000 in ?? ()
---Type <return> to continue, or q <return> to quit---

Thread 2 (Thread 0x7fa7c798bb00 (LWP 3835)):
0 0x00007fa7c57124f6 in poll () from /lib64/libc.so.6
1 0x00007fa7c74f80d3 in osaf_ppoll (io_fds=0x7fa7c798b270, i_nfds=1, i_timeout_ts=0x7fa7c798b2a0, i_sigmask=0x0) at osaf_poll.c:105
2 0x00007fa7c7502627 in ncs_tmr_wait () at sysf_tmr.c:409
3 0x00007fa7c67fd7b6 in start_thread () from /lib64/libpthread.so.0
4 0x00007fa7c571b9cd in clone () from /lib64/libc.so.6
5 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7fa7c7954720 (LWP 3833)):
0 0x00007fa7c680161c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
1 0x000000000044da7d in avd_imm_reinit_bg () at imm.cc:1949
2 0x0000000000453e33 in main_loop () at main.cc:737
3 0x00000000004541ff in main (argc=2, argv=0x7fff1cbfa268) at main.cc:84

Syslog:

Oct 8 00:02:10 SLES-SLOT2 sudo: tet : TTY=unknown ; PWD=/tmp/05420aa ; USER=root ; COMMAND=/bin/kill -9 5691
Oct 8 00:02:10 SLES-SLOT2 osafimmd[3774]: NO MDS event from svc_id 25 (change:4, dest:565214705385538)
Oct 8 00:02:10 SLES-SLOT2 osafamfnd[3843]: NO Restarting a component of 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' (comp restart count: 2)
Oct 8 00:02:10 SLES-SLOT2 osafimmd[3774]: IN Standby IMMD recording IMMND DOWN for node 2020f
Oct 8 00:02:10 SLES-SLOT2 osafamfnd[3843]: NO 'safComp=IMMND,safSu=SC-2,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: Started
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: NO Persistent Back-End capability configured, Pbe file:imm.db (suffix may get added)
Oct 8 00:02:10 SLES-SLOT2 osafimmd[3774]: NO MDS event from svc_id 25 (change:3, dest:565214705262608)
Oct 8 00:02:10 SLES-SLOT2 osafimmd[3774]: IN Added IMMND node with dest 565214705262608
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Oct 8 00:02:10 SLES-SLOT2 osafimmnd[5717]: NO NODE STATE-> IMM_NODE_ISOLATED
Oct 8 00:02:12 SLES-SLOT2 osafimmd[3774]: NO SBY: Ruling epoch noted as:44
Oct 8 00:02:12 SLES-SLOT2 osafimmd[3774]: NO IMMND coord at 2010f
Oct 8 00:02:12 SLES-SLOT2 osafimmnd[5717]: NO NODE STATE-> IMM_NODE_W_AVAILABLE
Oct 8 00:02:12 SLES-SLOT2 osafimmnd[5717]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Oct 8 00:02:18 SLES-SLOT2 osafamfnd[3843]: ER AMF director heart beat timeout, generating core for amfd
Oct 8 00:02:19 SLES-SLOT2 osafamfnd[3843]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 131599, SupervisionTime = 60

Oct 8 00:02:19 SLES-SLOT2 opensaf_reboot: Rebooting local node; timeout=60
Oct 8 00:02:19 SLES-SLOT2 osafimmd[3774]: NO SBY: New Epoch for IMMND process at node 2010f old epoch: 43 new epoch:44

Syslog and amd traces attached. Will share immnd traces and other controller syslog on request as they are huge

2 Attachments

Discussion

  • Chani Srivastava

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -2,6 +2,10 @@
     Changeset : 8190 ( 5.1.FC)
     Setup : 4 physical nodee 1 PBE enabled with 1Lakh load
    
    +Step 
    +1. Bringu opensaf on four nodes
    +2. Imm test cases running with ndrestart scenario on standby controller
    +3. After 1 of the ndrestart amfd heart beat timeout happened. Below is the backtrace.
    
     Coredump:
     0 0x00007fa7c680161c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
    
     
  • Chani Srivastava

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,5 +1,5 @@
     OS : Suse 64bit
    -Changeset : 8190 ( 5.1.FC)
    +Changeset : 8190
     Setup : 4 physical nodee 1 PBE enabled with 1Lakh load
    
     Step 
    
     
  • Anders Widell

    Anders Widell - 2017-02-28
    • Milestone: 5.2.FC --> 5.2.RC1
     
  • Nagendra Kumar

    Nagendra Kumar - 2017-03-01

    This might have been fixed by:
    changeset: 8365:1108027c16f0
    branch: opensaf-5.1.x
    parent: 8362:13742b479d92
    user: Praveen Malviya praveen.malviya@oracle.com
    date: Fri Nov 25 15:56:05 2016 +0530
    summary: amfd: do not spawn multiple threads for imm init[#2188] V2

    Please retest on 5.2 FC release.

     

    Related

    Tickets: #2188

  • Nagendra Kumar

    Nagendra Kumar - 2017-03-01
    • status: unassigned --> accepted
    • assigned_to: Nagendra Kumar
     
  • Chani Srivastava

    Not able to reproduce it with latest changeset 8634 ( 5.2.FC)

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-03-10

    Thanks for the information Chani.
    Please repoen the ticket with Amf and Imm traces if it gets reproduced later on also.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-03-10
    • status: accepted --> not-reproducible
     

Log in to post a comment.