Menu

#2110 AMF : amfd aborted on both controllers after opensafd stopped on payload

future
assigned
None
defect
amf
-
major
2022-11-24
2016-10-11
Srikanth R
No

Changeset : 5.1GA 8190
Setup : 4 nodes setup with PBE enabled ( 1 lakh objects) and headless feature enabled .

Steps performed :
-> Brought up opensaf on 4 node setup
-> Ran IMM test application on Oct 8th and also performed middleware failovers.
-> For two days, setup is left idle.
-> On Oct 10 14:07:38, stopped opensaf on PL-4 for which amfd on both controllers aborted

Oct 10 14:07:38 SLES-SLOT1 osafimmnd[2748]: NO Global discard node received for nodeId:2040f pid:3261
Oct 10 14:07:38 SLES-SLOT1 osafamfd[2788]: NO Node 'PL-4' left the cluster
Oct 10 14:07:38 SLES-SLOT1 osafamfd[2788]: su.cc:2006: dec_curr_act_si: Assertion 'saAmfSUNumCurrActiveSIs > 0' failed.
Oct 10 14:07:38 SLES-SLOT1 osafamfnd[2798]: WA AMF director unexpectedly crashed

Below is the back trace :

2 0x00007f7426025197 in osafassert_fail (file=0x51b4ed "su.cc", line=2006,
func=0x51ce30 <AVD_SU::dec_curr_act_si()::__FUNCTION__> "dec_curr_act_si", __assertion=0x51c884 "saAmfSUNumCurrActiveSIs > 0") at sysf_def.c:281
3 0x00000000004de88c in AVD_SU::dec_curr_act_si (this=0x7bde40) at su.cc:2006
4 0x00000000004c504e in avd_susi_delete (cb=0x75dba0 <_control_block>, susi=0x7eb940, ckpt=false) at siass.cc:554
5 0x000000000049a326 in SG_NORED::node_fail (this=0x7bc210, cb=0x75dba0 <_control_block>, su=0x7bde40) at sg_nored_fsm.cc:781
6 0x00000000004bd4d7 in avd_node_down_mw_susi_failover (cb=0x75dba0 <_control_block>, avnd=0x7b04d0) at sgproc.cc:1983
7 0x0000000000461a77 in avd_node_failover (node=0x7b04d0) at ndproc.cc:1142
8 0x0000000000459d63 in avd_mds_avnd_down_evh (cb=0x75dba0 <_control_block>, evt=0x7f741c002270) at ndfsm.cc:684
9 0x0000000000453f60 in process_event (cb_now=0x75dba0 <_control_block>, evt=0x7f741c002270) at main.cc:775
10 0x0000000000453c83 in main_loop () at main.cc:696
11 0x00000000004541ff in main (argc=2, argv=0x7fffedc7f828) at main.cc:848

Below is the amfnd trace :

Oct 10 14:07:38.712919 osafamfd [2788:imm.cc:1751] << avd_saImmOiRtObjectDelete
Oct 10 14:07:38.712922 osafamfd [2788:csi.cc:1292] << avd_compcsi_delete
Oct 10 14:07:38.712925 osafamfd [2788:mbcsv_api.c:0773] >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified
Oct 10 14:07:38.712928 osafamfd [2788:mbcsv_api.c:0803] TR svc_id:10, pwe_hdl:65537
Oct 10 14:07:38.712931 osafamfd [2788:mbcsv_util.c:0343] >> mbcsv_send_ckpt_data_to_all_peers
Oct 10 14:07:38.712934 osafamfd [2788:mbcsv_util.c:0387] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Oct 10 14:07:38.712936 osafamfd [2788:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 10, pwe_hdl: 65537
Oct 10 14:07:38.712939 osafamfd [2788:mbcsv_util.c:0399] TR calling encode callback
Oct 10 14:07:38.712942 osafamfd [2788:chkop.cc:0228] TR Async update
Oct 10 14:07:38.712945 osafamfd [2788:ckpt_enc.cc:0681] >> enc_siass: io_action '2'
Oct 10 14:07:38.712998 osafamfd [2788:ckpt_enc.cc:0704] << enc_siass
Oct 10 14:07:38.713001 osafamfd [2788:mbcsv_util.c:0438] TR send the encoded message to any other peer with same s/w version
Oct 10 14:07:38.713004 osafamfd [2788:mbcsv_util.c:0441] TR dispatching FSM for NCSMBCSV_SEND_ASYNC_UPDATE
Oct 10 14:07:38.713006 osafamfd [2788:mbcsv_act.c:0101] TR ASYNC update to be sent. role: 1, svc_id: 10, pwe_hdl: 65537
Oct 10 14:07:38.713009 osafamfd [2788:mbcsv_mds.c:0185] >> mbcsv_mds_send_msg: sending to vdest:1
Oct 10 14:07:38.713012 osafamfd [2788:mbcsv_mds.c:0201] TR send type MDS_SENDTYPE_RED
Oct 10 14:07:38.713023 osafamfd [2788:mbcsv_mds.c:0244] << mbcsv_mds_send_msg: success
Oct 10 14:07:38.713027 osafamfd [2788:mbcsv_util.c:0492] << mbcsv_send_ckpt_data_to_all_peers
Oct 10 14:07:38.713030 osafamfd [2788:mbcsv_api.c:0868] << mbcsv_process_snd_ckpt_request: retval: 1
Oct 10 14:07:38.713033 osafamfd [2788:siass.cc:0496] >> avd_susi_delete: safSu=PL-4,safSg=NoRed,safApp=OpenSAF safSi=NoRed4,safApp=OpenSAF
Oct 10 14:09:23.708873 osafamfd [2802:main.cc:0500] >> initialize

Discussion

  • Anders Widell

    Anders Widell - 2017-02-28
    • Milestone: 5.2.FC --> 5.2.RC1
     
  • Minh Hon Chau

    Minh Hon Chau - 2017-02-28
    • status: unassigned --> assigned
    • assigned_to: Minh Hon Chau
     
  • Minh Hon Chau

    Minh Hon Chau - 2017-03-01

    Hi Srikanth,

    Could you please upload the syslog + trace file? I am trying to reproduce it but I could not see it so far, I am using normal 2N application that has 2 SUs hosted in PLs

    Thanks,
    Minh

     
  • Anders Widell

    Anders Widell - 2017-03-14
    • Milestone: 5.2.RC1 --> 5.2.RC2
     
  • Minh Hon Chau

    Minh Hon Chau - 2017-03-24

    I could not guess the scenario of this problem to reproduce and there is no log/trace, so I am going to mark this as not-reproducible.

     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.2.RC2 --> future
     
  • Mohan  Kanakam

    Mohan Kanakam - 2022-11-24

    Hi Guys,
    I tried to reproduce the issue on latest opensaf (5.22.11 - 7089987e9f2d7e5b2f039c14dfb942d2830a27cc) but i did not succeed.
    I have an setup of 2 controllers and 2 payloads with headless feature enabled and 1 PBE with 100k objects.
    I stopped the opensaf on one of the payload to reproduce the issue, but it failed to reproduce.
    So, can I close this ticket?
    Thanks
    Mohan

     

Log in to post a comment.