Menu

#1025 Amf : osafamfnd crashed during restart failure

4.3.3
fixed
None
defect
amf
nd
4.4.0
major
2015-02-19
2014-08-27
KANG-SEN LU
No

We are running opensaf 4.4.0.

Here is a gdb stack trace of osafamfnd crash:

==========================
(gdb) bt
0 0x00007f457067f425 in __GI_raise (sig=<optimized out="">) at

../nptl/sysdeps/unix/sysv/linux/raise.c:64
1 0x00007f4570682b8b in __GI_abort () at abort.c:91
2 0x00007f4572105f21 in __osafassert_fail (

__file=0x448498

"/home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/di.cc", line=569,

func=0x4488b0 <avnd_di_susi_resp_send(avnd_cb_tag*, avnd_su_tag*,="" avnd_su_si_rec*)::__FUNCTION__=""> "avnd_di_susi_resp_send",
__assertion=0x44837a "m_AVND_SU_IS_ASSIGN_PEND(su)")
at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/libs/core/leap/sysf_def.c:278
3 0x0000000000427a42 in avnd_di_susi_resp_send

(cb=cb@entry=0x65e4a0 <_avnd_cb>, su=su@entry=0x2444980,
si=si@entry=0x2439720)
at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/di.cc:569
4 0x0000000000438c21 in avnd_su_pres_st_chng_prc

(final_st=SA_AMF_PRESENCE_INSTANTIATION_FAILED,
prv_st=SA_AMF_PRESENCE_INSTANTIATED, su=0x2444980,
cb=0x65e4a0 <_avnd_cb>) at
/home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/susm.cc:1608
5 avnd_su_pres_fsm_run (cb=cb@entry=0x65e4a0 <_avnd_cb>,

su=0x2444980, comp=comp@entry=0x2444bb0, ev=<optimized out="">)
at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/susm.cc:1394
6 0x00000000004188b3 in avnd_comp_clc_st_chng_prc

(cb=cb@entry=0x65e4a0 <_avnd_cb>, comp=comp@entry=0x2444bb0,
prv_st=prv_st@entry=SA_AMF_PRESENCE_RESTARTING,
final_st=final_st@entry=SA_AMF_PRESENCE_INSTANTIATION_FAILED)
at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/clc.cc:1298
7 0x000000000041a512 in avnd_comp_clc_fsm_run

(cb=cb@entry=0x65e4a0 <_avnd_cb>, comp=comp@entry=0x2444bb0,
ev=<optimized out="">)
at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/clc.cc:862
8 0x000000000041aa39 in avnd_evt_clc_resp_evh (cb=0x65e4a0

<_avnd_cb>, evt=0x7f45640008c0)
at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/clc.cc:416
9 0x000000000042c23c in avnd_evt_process (evt=0x7f45640008c0)

at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-

4.4.0/osaf/services/saf/amf/amfnd/main.cc:678
10 avnd_main_process () at

/home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-
4.4.0/osaf/services/saf/amf/amfnd/main.cc:619
11 0x0000000000405328 in main (argc=1, argv=0x7fff7bcfc988)

at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-

4.4.0/osaf/services/saf/amf/amfnd/main.cc:178
(gdb)

Related

Tickets: #1025
Wiki: ChangeLog-4.4.1

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2014-08-28

    Can you please update the test steps to reproduce the problem.

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-28
     
  • KANG-SEN LU

    KANG-SEN LU - 2014-08-28

    This osafamfd crash has been observed in our lab several times. It could be triggered easily. We have some applications started by opensaf. If one application failed, and opensaf tries to resatrt it. But the applucation failed to restart, the osafamfd always crash by the same assert di.cc line 569.

    Here is the syslog:

    Aug 27 13:34:00 slot4-MW984 osafamfnd[22493]: NO 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' faulted due to 'passiveMonitorFailed' : Recovery is 'componentRestart'
    Aug 27 13:34:00 slot4-MW984 zookeeper_sector_clean: Cleanup for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
    Aug 27 13:34:00 slot4-MW984 charon: 02[KNL] 169.254.91.248 disappeared from bond0
    Aug 27 13:34:01 slot4-MW984 CRON[26217]: (root) CMD (/usr/share/platform-config/atca/update-ssh-keys)
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_clean: Cleanup Complete for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: copying zkCleanup.movik.sector.sh to /usr/share/zookeeper/bin
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: zkId=2
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: using myId 2
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: executing script for type sector interface bond0:2
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: performing zkCleanup of /var/lib/zookeeper/movik.sector/
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_IP_BASE = 169.254.91.247
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_IP_MASK = 255.255.255.0
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_IP_CNT = 3
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_EXT_PORT = 2889
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_INT_PORT = 3889
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MY_IP = 169.254.91.248
    Aug 27 13:34:05 slot4-MW984 charon: 02[KNL] 169.254.91.248 appeared on bond0
    Aug 27 13:34:05 slot4-MW984 charon: 02[KNL] 169.254.91.248 disappeared from bond0
    Aug 27 13:34:05 slot4-MW984 charon: 02[KNL] 169.254.91.248 appeared on bond0
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: copying zookeeper_environment.movik.sector to /etc/zookeeper/conf
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: copying zkServer.sh to /usr/share/zookeeper/bin
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: overwriting /etc/zookeeper/conf/conf.movik.sector/zoo.movik.sector.cfg
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: touching file /var/run/zookeeper.movik.sector/zookeeper_server.pid
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: \nserver.1=169.254.91.247:2889:3889\nserver.2=169.254.91.248:2889:3889\nserver.3=169.254.91.249:2889:3889\n
    Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: Instantiating CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
    Aug 27 13:34:06 slot4-MW984 zookeeper_sector_inst: COMP_PID_MAP_FILE=/var/run/zookeeper.movik.sector/zookeeper_server.pid, PID=26336
    Aug 27 13:34:06 slot4-MW984 amfpm: saAmfPmStart FAILED 12
    Aug 27 13:34:06 slot4-MW984 osafamfnd[22493]: NO Instantiation of 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' failed
    Aug 27 13:34:06 slot4-MW984 osafamfnd[22493]: NO Reason:'Exec of script success, but script exits with non-zero status'
    Aug 27 13:34:06 slot4-MW984 osafamfnd[22493]: NO Exit code: 1
    Aug 27 13:34:06 slot4-MW984 zookeeper_sector_clean: Cleanup for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
    Aug 27 13:34:06 slot4-MW984 charon: 02[KNL] 169.254.91.248 disappeared from bond0
    Aug 27 13:34:11 slot4-MW984 zookeeper_sector_clean: Cleanup Complete for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp

    Aug 27 13:34:17 slot4-MW984 zookeeper_sector_clean: Cleanup Complete for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
    Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: WA 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' Presence State RESTARTING => INSTANTIATION_FAILED
    Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: NO Component Failover trigerred for 'safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp': Failed component: 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp'
    Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: NO 'safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' Presence State INSTANTIATED => INSTANTIATION_FAILED
    Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-4.4.0/osaf/services/saf/amf/amfnd/di.cc:569: avnd_di_susi_resp_send: Assertion 'm_AVND_SU_IS_ASSIGN_PEND(su)' failed.
    Aug 27 13:34:17 slot4-MW984 compress-core.sh: Running /etc/compressed-coredump.d/001_kdp_bypass_for_gtppx_crash
    Aug 27 13:34:17 slot4-MW984 osafamfwd[22526]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: AMF unexpectedly crashed, OwnNodeId = 66561, SupervisionTime = 60
    Aug 27 13:34:17 slot4-MW984 osafimmnd[22172]: AL AMF Node Director is down, terminate this process

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-08-28

    Thanks for the information. Is this Su a PI or NPI SU?

     
  • KANG-SEN LU

    KANG-SEN LU - 2014-08-28

    I am sure what is "PI" vs "NPI". How do I find that info for you?

     
  • KANG-SEN LU

    KANG-SEN LU - 2014-08-28

    I meant to say "not sure".

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-01
    • status: unassigned --> assigned
    • assigned_to: Nagendra Kumar
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-01
    • summary: osafamfd crashed during restart failure --> Amf : osafamfnd crashed during restart failure
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,6 +1,6 @@
     We are running opensaf 4.4.0.
    
    -Here is a gdb stack trace of osafamfd crash:
    +Here is a gdb stack trace of osafamfnd crash:
    
         ==========================
         (gdb) bt
    
    • Component: unknown --> amf
    • Part: - --> nd
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-01

    Can you please provide exact release of 4.4 or changeset number of 4.4 on which the ticket has been filed ?

    Thanks
    -Nagu

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-02
    • status: assigned --> review
     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-02

    Patch floated for review.

     
  • KANG-SEN LU

    KANG-SEN LU - 2014-09-02

    We have downloaded opensaf-4.4.0.tar.gz from sourceforge.

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-03

    The patch floated will not get applied directly on 4.4 GA release.

    You need to take the following patches first:
    1. #820 (changeset: 5092:90d97fea11dd)
    2. #885 (changeset: 5292:dae9b0a66445)
    3. #358 (changeset: 5617:80d69568d9f7)
    And then you can apply the patch floated in the community for this ticket.

     

    Last edit: Nagendra Kumar 2014-09-03
  • KANG-SEN LU

    KANG-SEN LU - 2014-09-11

    We run the patch provided by Nagu, and test result is positive.

    “On analysing the syslog it was observed that Zookeeper was restarted and “RESTARTING => TERMINATION_FAILED” , and opensaf did not crash.”

     
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-11

    Can i push it now ? Anybody else to comment, other wise I will push.

    Thanks
    -Nagu

     

    Last edit: Nagendra Kumar 2014-09-11
  • Nagendra Kumar

    Nagendra Kumar - 2014-09-15

    changeset: 5779:6c62a01ef630
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Mon Sep 15 13:34:49 2014 +0530
    summary: amfnd: perform su failover if npi su translates into inst fail state [#1025]

    changeset: 5780:9ac53ee22ac2
    branch: opensaf-4.5.x
    parent: 5777:7b76c9933b05
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Mon Sep 15 13:35:18 2014 +0530
    summary: amfnd: perform su failover if npi su translates into inst fail state [#1025]

    changeset: 5781:b62f09e680af
    branch: opensaf-4.4.x
    tag: tip
    parent: 5771:ca844aed9b16
    user: Nagendra Kumarnagendra.k@oracle.com
    date: Mon Sep 15 13:35:38 2014 +0530
    summary: amfnd: perform su failover if npi su translates into inst fail state [#1025]

    [staging:6c62a0]
    [staging:9ac53e]
    [staging:b62f09]

     

    Related

    Tickets: #1025
    Commit: [6c62a0]
    Commit: [9ac53e]
    Commit: [b62f09]

  • Nagendra Kumar

    Nagendra Kumar - 2014-09-15
    • status: review --> fixed
     

Log in to post a comment.