Menu

#1759 amf: Command based HealthCheck invoked during termination phase (NPI comp).

4.6.2
fixed
Praveen
defect
amf
nd
default
major
2016-06-01
2016-04-14
Praveen
No

Attached are amfnd traces and a configuration to reproduce.

steps to reproduce.
1)Bring the configuration up on a single controller.
2)Command based health check for the component is invoked by AMFND periodically.
3)Lock the SU.
4)As a part of quiesced assignment, AMFND stops all the PM monitoring (if configured) or Command based health check and starts terminating the component by invoking terminating script.
5)During this termination phase, health check command completes and AMFND restarts the timer which should not be started. (AMFND has stopped already before starting termination of comp).
6)Before termination of comp completes(script has 2 seconds sleep), health check timer expires and amfnd invokes health check command.

From traces:
1) AMFND invokes health check command:
Apr 13 14:25:06.698503 osafamfnd [7387:chc.cc:0914] >> avnd_comp_hc_rec_tmr_exp: safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 - osafHealthCheck, sts: 0
Apr 13 14:25:06.698516 osafamfnd [7387:clc.cc:2757] >> avnd_comp_clc_cmd_execute: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1':CLC CLI command type:'AVND_COMP_CLC_CMD_TYPE_HC(6)'
Apr 13 14:25:06.698532 osafamfnd [7387:clc.cc:2868] T1 Component is NPI, 1
Apr 13 14:25:06.698547 osafamfnd [7387:clc.cc:2920] T1 CLC CLI script:'/opt/amf_demo/npi/pm/amf_demo_monitor.sh'
Apr 13 14:25:06.698557 osafamfnd [7387:clc.cc:2925] T1 CLC CLI command timeout: In nano secs:180000000000 In milli secs: 180000
Apr 13 14:25:06.698568 osafamfnd [7387:clc.cc:2929] T1 CLC CLI command env variable name = 'SA_AMF_COMPONENT_NAME': value ='safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Apr 13 14:25:06.698578 osafamfnd [7387:clc.cc:2929] T1 CLC CLI command env variable name = 'SA_AMF_COMPONENT_NAME': value ='safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Apr 13 14:25:06.699016 osafamfnd [7387:clc.cc:2961] T2 The CLC CLI command execution success

2)AMFND gets quiesced assignments as a part of lock operation. It stops health check timer and starts terminating comp:
Apr 13 14:25:07.397472 osafamfnd [7387:su.cc:0376] >> avnd_evt_avd_info_su_si_assign_evh: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Apr 13 14:25:07.397484 osafamfnd [7387:susm.cc:0189] >> avnd_su_siq_rec_buf: 'safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
........
Apr 13 14:25:07.397731 osafamfnd [7387:cpm.cc:0634] >> avnd_comp_pm_finalize: Comp 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Apr 13 14:25:07.397739 osafamfnd [7387:cpm.cc:0650] << avnd_comp_pm_finalize
Apr 13 14:25:07.397748 osafamfnd [7387:chc.cc:0761] >> find_hc_rec: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Apr 13 14:25:07.397758 osafamfnd [7387:tmr.cc:0126] TR health check timer stopped
Apr 13 14:25:07.397767 osafamfnd [7387:clc.cc:0854] T1 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1':Entering CLC FSM: presence state:'SA_AMF_PRESENCE_INSTANTIATED(3)', Event:'AVND_COMP_CLC_PRES_FSM_EV_TERM'
Apr 13 14:25:07.397776 osafamfnd [7387:clc.cc:1852] >> avnd_comp_clc_inst_term_hdler: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1': Terminate event in the Instantiated state
Apr 13 14:25:07.397797 osafamfnd [7387:clc.cc:1876] NO Terminate comann executing
Apr 13 14:25:07.397807 osafamfnd [7387:clc.cc:2757] >> avnd_comp_clc_cmd_execute: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1':CLC CLI command type:'AVND_COMP_CLC_CMD_TYPE_TERMINATE(2)'
Apr 13 14:25:07.397824 osafamfnd [7387:clc.cc:2868] T1 Component is NPI, 1
Apr 13 14:25:07.397836 osafamfnd [7387:clc.cc:2920] T1 CLC CLI script:'/opt/amf_demo/npi/pm/amf_comp_npi_term'
....
Apr 13 14:25:07.398409 osafamfnd [7387:comp.cc:2774] IN 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1' Presence State INSTANTIATED => TERMINATING

3) Health check command completes and AMFND starts timer again in termination phase:
Apr 13 14:25:07.720671 osafamfnd [7387:clc.cc:0314] >> avnd_evt_clc_resp_evh
Apr 13 14:25:07.720686 osafamfnd [7387:clc.cc:0325] TR 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1', command type:AVND_COMP_CLC_CMD_TYPE_HC(6)
Apr 13 14:25:07.720699 osafamfnd [7387:chc.cc:0761] >> find_hc_rec: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1'
Apr 13 14:25:07.720758 osafamfnd [7387:tmr.cc:0088] TR health check timer started
Apr 13 14:25:07.720801 osafamfnd [7387:clc.cc:0417] << avnd_evt_clc_resp_evh: 1

4)Health Check timer expires and command is invoked in termination phase:
Apr 13 14:25:08.022211 osafamfnd [7387:chc.cc:0260] >> avnd_evt_tmr_hc_evh
Apr 13 14:25:08.022225 osafamfnd [7387:chc.cc:0914] >> avnd_comp_hc_rec_tmr_exp: safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1 - osafHealthCheck, sts: 0
Apr 13 14:25:08.022263 osafamfnd [7387:clc.cc:2757] >> avnd_comp_clc_cmd_execute: 'safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1':CLC CLI command type:'AVND_COMP_CLC_CMD_TYPE_HC(6)'
Apr 13 14:25:08.022327 osafamfnd [7387:clc.cc:2868] T1 Component is NPI, 1
Apr 13 14:25:08.022347 osafamfnd [7387:clc.cc:2920] T1 CLC CLI script:'/opt/amf_demo/npi/pm/amf_demo_monitor.sh'
Apr 13 14:25:08.022359 osafamfnd [7387:clc.cc:2925] T1 CLC CLI command timeout: In nano secs:180000000000 In milli secs: 180000

1 Attachments

Related

Tickets: #1759
Wiki: ChangeLog-4.6.2

Discussion

  • Praveen

    Praveen - 2016-04-14
    • status: accepted --> review
     
  • Praveen

    Praveen - 2016-04-22
    • status: review --> fixed
     
  • Praveen

    Praveen - 2016-04-22

    changeset: 7524:6c724a0bce3e
    parent: 7520:4c44fe2abdee
    user: praveen.malviya@oracle.com
    date: Fri Apr 22 09:46:45 2016 +0530
    summary: amfnd: do not invoke health check command when npi comp is TERMINATING [#1759]

    changeset: 7523:6b88d64a8472
    branch: opensaf-5.0.x
    parent: 7518:325844b1a197
    user: praveen.malviya@oracle.com
    date: Fri Apr 22 09:46:37 2016 +0530
    summary: amfnd: do not invoke health check command when npi comp is TERMINATING [#1759]

    changeset: 7522:e2c01073de3f
    branch: opensaf-4.7.x
    parent: 7513:48ee7b1a9048
    user: praveen.malviya@oracle.com
    date: Fri Apr 22 09:46:16 2016 +0530
    summary: amfnd: do not invoke health check command when npi comp is TERMINATING [#1759]

    changeset: 7521:3317f6c45513
    branch: opensaf-4.6.x
    parent: 7512:5cfe8ef6e9a0
    user: praveen.malviya@oracle.com
    date: Fri Apr 22 09:45:51 2016 +0530
    summary: amfnd: do not invoke health check command when npi comp is TERMINATING [#1759]

     

    Related

    Tickets: #1759

    • Minh Hon Chau

      Minh Hon Chau - 2016-05-25

      Hi Praveen,

      The problem of this ticket has been seen again, which healthcheckCmd is following a terminateCmd. Can you upload the model that was used to reproduce this problem in this ticket, I would like to verify it in scenario that we are seeing the problem.

      Thanks,
      Minh

       
  • Praveen

    Praveen - 2016-05-25

    Hi Minh,
    Attached is the model. Is it the case when AMFND invokes the Health check Command and its gets the termination request and launches the termination scrpit also?
    In that case, both health checking command/script and termination script are executing simulantaneously. This situation needs to be handled by the application.

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-05-25

    Thanks. From application's log, it gets terminateCmd first then followed by healthCheckCmd. /Minh

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-05-30

    Hi Praveen

    uint32_t avnd_comp_hc_rec_tmr_exp(AVND_CB cb, AVND_COMP comp, AVND_COMP_HC_REC rec) {
    ...
    if (comp->is_hc_cmd_configured &&
    (strncmp((char
    )rec->key.key, hc_cmd_name, sizeof(rec->key.key)) == 0)) {
    rc = avnd_comp_clc_cmd_execute(cb, comp, AVND_COMP_CLC_CMD_TYPE_HC);
    if (rc == NCSCC_RC_SUCCESS) {
    rec->status = AVND_COMP_HC_STATUS_WAIT_FOR_RESP;
    rc = avnd_start_tmr(cb, &rec->tmr, AVND_TMR_HC, rec->period, rec->opq_hdl);
    }
    }
    ...
    }

    I saw your patch that does not let hc timer restarted after hc cmd response while comp has been terminating, but do you think the following case would happen:

    If hc timer expires in ncs timer thread, avnd_tmr_exp() will send timer_hc_event to amfnd mainthread. After that locking active SU will start npi comp TERMINATING sequence. Then now amfnd processes the timer_hc_event. Will app receive clc hc cmd request after it receives clc terminate cmd request?

    Thanks,
    Minh

     
  • Praveen

    Praveen - 2016-05-31

    Hi Minh,

    Yes that can happen but that is a very remote possiblity timing wise. Remote possibility because AMFND stops any running HC timer when it launches terminate command amd HC command is not invoked always after each timer expiry. This is in this sense different from AMF spec based health checks. After each timer expiry if command is in execution then Health Check will not be invoked and AMFND waits for the completion of script. When HC commands status comes to AMNFD and it sees that is already expired then it restarts timer instead of invking the HC command.
    So the fix provided in this ticket and the way HC work, it becomes a very remote case.
    Are you seeing invocation of HC command in the context of this case ?If possible please share amfnd traces.

    Thanks,
    Praveen

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-05-31

    Hi Praveen,

    This problem has actually being seen in scale-in scenario, which is a npi su is getting TERMINATE. I have tried many times to reproduce but I was not able to see it.
    Most of times I see is HC command response comes after timer expiry (which is already fixed by patch of #1759), or the timer is stopped when amfnd run component termination.
    What I can think for now is the case described above, it happens in very small timing window, the timer has expired in timer thread just after amfnd starts component termination, provided that hc command response also finishes just in time (or before) timer expiry.
    I don't think I'm lucky to see it, but it can happen and it's a bug, because component termination happens before timer expiry anyway. To fix it, I think we can do similiar check as patch of #1759 before run clc hc command in avnd_comp_hc_rec_tmr_exp().
    I will ask for amfnd trace if this problem happens again in scale-in scenario. But this case should be fixed anyway?

    Thanks,
    Minh

     

Log in to post a comment.