Menu

#2971 amf: standby amfd crash during failover to become active

5.19.01
fixed
Gary Lee
None
defect
amf
d
major
False
2019-01-09
2018-11-23
Thuan Tran
No

PL-9 was deleted from cluster, but somehow standby amfd still keep the node.
Then when failover happen, standby amfd crash as following:

Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'SC-1' left the cluster
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active DONE!
Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'PL-9' left the cluster
Nov 20 04:09:14 SC-2 osafamfd[5079]: src/amf/amfd/sgproc.cc:2187: avd_node_down_mw_susi_failover: Assertion 'avnd->list_of_ncs_su.empty() != true' failed.

The root cause is amfnd down on SC-2 vs checkpoint from SC-1

<143>1 2018-11-24T14:43:17.870243+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7238"] 261:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 0x563eacfe3b50
<143>1 2018-11-24T14:43:17.870254+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7239"] 261:amf/amfd/ndfsm.cc:853 << avd_mds_avnd_down_evh 

<143>1 2018-11-24T14:43:17.874433+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22818"] 285:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 0x5601d0d9cb90
<143>1 2018-11-24T14:43:17.874439+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22819"] 285:amf/amfd/ndproc.cc:1235 >> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:17.874443+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22820"] 285:amf/amfd/ndfsm.cc:1149 >> avd_node_mark_absent 

<141>1 2018-11-24T14:43:17.88228+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22908"] 285:amf/amfd/ndfsm.cc:1154 NO Node 'PL-5' left the cluster
<143>1 2018-11-24T14:43:17.882284+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22909"] 285:mbc/mbcsv_api.c:798 >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified

<143>1 2018-11-24T14:43:17.882637+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22943"] 285:amf/amfd/ndfsm.cc:1168 << avd_node_mark_absent 

<143>1 2018-11-24T14:43:17.900529+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7564"] 261:amf/amfd/ckpt_updt.cc:49 >> avd_ckpt_node: update - 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:17.900575+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7577"] 261:amf/amfd/ckpt_updt.cc:78 << avd_ckpt_node: 1

<143>1 2018-11-24T14:43:39.417927+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="8716"] 261:amf/amfd/node.cc:500 >> node_ccb_completed_delete_hdlr: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
<143>1 2018-11-24T14:43:39.417932+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="8717"] 261:amf/amfd/imm.cc:2306 TR Node 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' is still cluster member

Related

Wiki: ChangeLog-5.19.01

Discussion

  • Thuan Tran

    Thuan Tran - 2018-11-23
    • status: unassigned --> review
     
  • Thuan Tran

    Thuan Tran - 2018-11-23

    Full bt of coredump

     
  • Gary Lee

    Gary Lee - 2018-11-24

    This might help?

     
  • Gary Lee

    Gary Lee - 2018-11-24
    • status: review --> accepted
    • assigned_to: Thuan --> Gary Lee
     
  • Thuan Tran

    Thuan Tran - 2018-11-26
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,5 +1,4 @@
     PL-9 was deleted from cluster, but somehow standby amfd still keep the node.
    -The most possible reason is that standby amfd miss node delete apply callback by somehow.
     Then when failover happen, standby amfd crash as following:
     ~~~
     Nov 20 04:09:14 SC-2 osafamfd[5079]: NO FAILOVER StandBy --> Active
    @@ -8,3 +7,23 @@
     Nov 20 04:09:14 SC-2 osafamfd[5079]: NO Node 'PL-9' left the cluster
     Nov 20 04:09:14 SC-2 osafamfd[5079]: src/amf/amfd/sgproc.cc:2187: avd_node_down_mw_susi_failover: Assertion 'avnd->list_of_ncs_su.empty() != true' failed.
     ~~~
    +The root cause is amfnd down on SC-2 vs checkpoint from SC-1
    +~~~
    +<143>1 2018-11-24T14:43:17.870243+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7238"] 261:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 0x563eacfe3b50
    +<143>1 2018-11-24T14:43:17.870254+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7239"] 261:amf/amfd/ndfsm.cc:853 << avd_mds_avnd_down_evh 
    +
    +<143>1 2018-11-24T14:43:17.874433+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22818"] 285:amf/amfd/ndfsm.cc:779 >> avd_mds_avnd_down_evh: 2050f, 0x5601d0d9cb90
    +<143>1 2018-11-24T14:43:17.874439+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22819"] 285:amf/amfd/ndproc.cc:1235 >> avd_node_failover: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
    +<143>1 2018-11-24T14:43:17.874443+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22820"] 285:amf/amfd/ndfsm.cc:1149 >> avd_node_mark_absent 
    +
    +<141>1 2018-11-24T14:43:17.88228+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22908"] 285:amf/amfd/ndfsm.cc:1154 NO Node 'PL-5' left the cluster
    +<143>1 2018-11-24T14:43:17.882284+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22909"] 285:mbc/mbcsv_api.c:798 >> mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers, as per the send-type specified
    +
    +<143>1 2018-11-24T14:43:17.882637+07:00 SC-1 osafamfd 285 osafamfd [meta sequenceId="22943"] 285:amf/amfd/ndfsm.cc:1168 << avd_node_mark_absent 
    +
    +<143>1 2018-11-24T14:43:17.900529+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7564"] 261:amf/amfd/ckpt_updt.cc:49 >> avd_ckpt_node: update - 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
    +<143>1 2018-11-24T14:43:17.900575+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="7577"] 261:amf/amfd/ckpt_updt.cc:78 << avd_ckpt_node: 1
    +
    +<143>1 2018-11-24T14:43:39.417927+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="8716"] 261:amf/amfd/node.cc:500 >> node_ccb_completed_delete_hdlr: 'safAmfNode=PL-5,safAmfCluster=myAmfCluster'
    +<143>1 2018-11-24T14:43:39.417932+07:00 SC-2 osafamfd 261 osafamfd [meta sequenceId="8717"] 261:amf/amfd/imm.cc:2306 TR Node 'safAmfNode=PL-5,safAmfCluster=myAmfCluster' is still cluster member
    +~~~
    
     
  • Gary Lee

    Gary Lee - 2018-11-26
    • status: accepted --> review
     
  • Gary Lee

    Gary Lee - 2018-11-28
    • status: review --> fixed
     
  • Gary Lee

    Gary Lee - 2018-11-28

    develop:

    commit 1a6954900477f4eaddda768b895018d0f71dbeb8
    Author: Gary Lee gary.lee@dektech.com.au
    Date: Wed Nov 28 22:37:26 2018 +1100

    amfd: set userData [#2971]
    
    Depending on timing, it's possible for node_info.member to be set
    after this ccb callback. We should populate userData anyway, in case
    the active validates this callback and then a SC failover to the
    standby occurs.
    

    commit 64f11f19b493115df62c6723af438cd355794456
    Author: Gary Lee gary.lee@dektech.com.au
    Date: Wed Nov 28 22:37:26 2018 +1100

    amfd: checkpoint node state to standby [#2971]
    
    we need to checkpoint change to node_info.member to the
    standby
    
     

Log in to post a comment.