We are running opensaf 4.4.0.
Here is a gdb stack trace of osafamfnd crash:
========================== (gdb) bt 0 0x00007f457067f425 in __GI_raise (sig=<optimized out="">) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 1 0x00007f4570682b8b in __GI_abort () at abort.c:91 2 0x00007f4572105f21 in __osafassert_fail ( __file=0x448498 "/home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/di.cc", line=569, func=0x4488b0 <avnd_di_susi_resp_send(avnd_cb_tag*, avnd_su_tag*,="" avnd_su_si_rec*)::__FUNCTION__=""> "avnd_di_susi_resp_send", __assertion=0x44837a "m_AVND_SU_IS_ASSIGN_PEND(su)") at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/libs/core/leap/sysf_def.c:278 3 0x0000000000427a42 in avnd_di_susi_resp_send (cb=cb@entry=0x65e4a0 <_avnd_cb>, su=su@entry=0x2444980, si=si@entry=0x2439720) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/di.cc:569 4 0x0000000000438c21 in avnd_su_pres_st_chng_prc (final_st=SA_AMF_PRESENCE_INSTANTIATION_FAILED, prv_st=SA_AMF_PRESENCE_INSTANTIATED, su=0x2444980, cb=0x65e4a0 <_avnd_cb>) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/susm.cc:1608 5 avnd_su_pres_fsm_run (cb=cb@entry=0x65e4a0 <_avnd_cb>, su=0x2444980, comp=comp@entry=0x2444bb0, ev=<optimized out="">) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/susm.cc:1394 6 0x00000000004188b3 in avnd_comp_clc_st_chng_prc (cb=cb@entry=0x65e4a0 <_avnd_cb>, comp=comp@entry=0x2444bb0, prv_st=prv_st@entry=SA_AMF_PRESENCE_RESTARTING, final_st=final_st@entry=SA_AMF_PRESENCE_INSTANTIATION_FAILED) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/clc.cc:1298 7 0x000000000041a512 in avnd_comp_clc_fsm_run (cb=cb@entry=0x65e4a0 <_avnd_cb>, comp=comp@entry=0x2444bb0, ev=<optimized out="">) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/clc.cc:862 8 0x000000000041aa39 in avnd_evt_clc_resp_evh (cb=0x65e4a0 <_avnd_cb>, evt=0x7f45640008c0) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/clc.cc:416 9 0x000000000042c23c in avnd_evt_process (evt=0x7f45640008c0) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/main.cc:678 10 avnd_main_process () at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/main.cc:619 11 0x0000000000405328 in main (argc=1, argv=0x7fff7bcfc988) at /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf- 4.4.0/osaf/services/saf/amf/amfnd/main.cc:178 (gdb)
Can you please update the test steps to reproduce the problem.
This osafamfd crash has been observed in our lab several times. It could be triggered easily. We have some applications started by opensaf. If one application failed, and opensaf tries to resatrt it. But the applucation failed to restart, the osafamfd always crash by the same assert di.cc line 569.
Here is the syslog:
Aug 27 13:34:00 slot4-MW984 osafamfnd[22493]: NO 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' faulted due to 'passiveMonitorFailed' : Recovery is 'componentRestart'
Aug 27 13:34:00 slot4-MW984 zookeeper_sector_clean: Cleanup for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
Aug 27 13:34:00 slot4-MW984 charon: 02[KNL] 169.254.91.248 disappeared from bond0
Aug 27 13:34:01 slot4-MW984 CRON[26217]: (root) CMD (/usr/share/platform-config/atca/update-ssh-keys)
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_clean: Cleanup Complete for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: copying zkCleanup.movik.sector.sh to /usr/share/zookeeper/bin
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: zkId=2
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: using myId 2
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: executing script for type sector interface bond0:2
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: performing zkCleanup of /var/lib/zookeeper/movik.sector/
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_IP_BASE = 169.254.91.247
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_IP_MASK = 255.255.255.0
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_IP_CNT = 3
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_EXT_PORT = 2889
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MVK_ZK_INT_PORT = 3889
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: MY_IP = 169.254.91.248
Aug 27 13:34:05 slot4-MW984 charon: 02[KNL] 169.254.91.248 appeared on bond0
Aug 27 13:34:05 slot4-MW984 charon: 02[KNL] 169.254.91.248 disappeared from bond0
Aug 27 13:34:05 slot4-MW984 charon: 02[KNL] 169.254.91.248 appeared on bond0
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: copying zookeeper_environment.movik.sector to /etc/zookeeper/conf
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: copying zkServer.sh to /usr/share/zookeeper/bin
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: overwriting /etc/zookeeper/conf/conf.movik.sector/zoo.movik.sector.cfg
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: touching file /var/run/zookeeper.movik.sector/zookeeper_server.pid
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: \nserver.1=169.254.91.247:2889:3889\nserver.2=169.254.91.248:2889:3889\nserver.3=169.254.91.249:2889:3889\n
Aug 27 13:34:05 slot4-MW984 zookeeper_sector_inst: Instantiating CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
Aug 27 13:34:06 slot4-MW984 zookeeper_sector_inst: COMP_PID_MAP_FILE=/var/run/zookeeper.movik.sector/zookeeper_server.pid, PID=26336
Aug 27 13:34:06 slot4-MW984 amfpm: saAmfPmStart FAILED 12
Aug 27 13:34:06 slot4-MW984 osafamfnd[22493]: NO Instantiation of 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' failed
Aug 27 13:34:06 slot4-MW984 osafamfnd[22493]: NO Reason:'Exec of script success, but script exits with non-zero status'
Aug 27 13:34:06 slot4-MW984 osafamfnd[22493]: NO Exit code: 1
Aug 27 13:34:06 slot4-MW984 zookeeper_sector_clean: Cleanup for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
Aug 27 13:34:06 slot4-MW984 charon: 02[KNL] 169.254.91.248 disappeared from bond0
Aug 27 13:34:11 slot4-MW984 zookeeper_sector_clean: Cleanup Complete for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
Aug 27 13:34:17 slot4-MW984 zookeeper_sector_clean: Cleanup Complete for CompName: safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp
Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: WA 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' Presence State RESTARTING => INSTANTIATION_FAILED
Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: NO Component Failover trigerred for 'safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp': Failed component: 'safComp=ZookeeperSector_PL-4,safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp'
Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: NO 'safSu=ZookeeperSectorSU_PL-4,safSg=ZookeeperSectorSG,safApp=ZookeeperSectorApp' Presence State INSTANTIATED => INSTANTIATION_FAILED
Aug 27 13:34:17 slot4-MW984 osafamfnd[22493]: /home/ksenlu/sandbox/klu_main/cae/extern/opensaf4/opensaf-4.4.0/osaf/services/saf/amf/amfnd/di.cc:569: avnd_di_susi_resp_send: Assertion 'm_AVND_SU_IS_ASSIGN_PEND(su)' failed.
Aug 27 13:34:17 slot4-MW984 compress-core.sh: Running /etc/compressed-coredump.d/001_kdp_bypass_for_gtppx_crash
Aug 27 13:34:17 slot4-MW984 osafamfwd[22526]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: AMF unexpectedly crashed, OwnNodeId = 66561, SupervisionTime = 60
Aug 27 13:34:17 slot4-MW984 osafimmnd[22172]: AL AMF Node Director is down, terminate this process
Thanks for the information. Is this Su a PI or NPI SU?
I am sure what is "PI" vs "NPI". How do I find that info for you?
I meant to say "not sure".
Diff:
Can you please provide exact release of 4.4 or changeset number of 4.4 on which the ticket has been filed ?
Thanks
-Nagu
Patch floated for review.
We have downloaded opensaf-4.4.0.tar.gz from sourceforge.
The patch floated will not get applied directly on 4.4 GA release.
You need to take the following patches first:
1. #820 (changeset: 5092:90d97fea11dd)
2. #885 (changeset: 5292:dae9b0a66445)
3. #358 (changeset: 5617:80d69568d9f7)
And then you can apply the patch floated in the community for this ticket.
Last edit: Nagendra Kumar 2014-09-03
We run the patch provided by Nagu, and test result is positive.
“On analysing the syslog it was observed that Zookeeper was restarted and “RESTARTING => TERMINATION_FAILED” , and opensaf did not crash.”
Can i push it now ? Anybody else to comment, other wise I will push.
Thanks
-Nagu
Last edit: Nagendra Kumar 2014-09-11
changeset: 5779:6c62a01ef630
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 15 13:34:49 2014 +0530
summary: amfnd: perform su failover if npi su translates into inst fail state [#1025]
changeset: 5780:9ac53ee22ac2
branch: opensaf-4.5.x
parent: 5777:7b76c9933b05
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 15 13:35:18 2014 +0530
summary: amfnd: perform su failover if npi su translates into inst fail state [#1025]
changeset: 5781:b62f09e680af
branch: opensaf-4.4.x
tag: tip
parent: 5771:ca844aed9b16
user: Nagendra Kumarnagendra.k@oracle.com
date: Mon Sep 15 13:35:38 2014 +0530
summary: amfnd: perform su failover if npi su translates into inst fail state [#1025]
[staging:6c62a0]
[staging:9ac53e]
[staging:b62f09]
Related
Tickets:
#1025Commit: [6c62a0]
Commit: [9ac53e]
Commit: [b62f09]