Menu

#3040 Amfnd: reboot if mismatch msg id b/w amfd and amfnd

5.23.03
fixed
None
defect
amf
nd
minor
False
2023-03-27
2019-05-16
No

During SC failover, message received on ACTIVE AMFD can not be checked point to AMFD on STANDBY SC. But the AMFND still process the message ack for that message then it remove from queue.
STANDBY SC takes ACTIVE and mismatch message id b/w AMFD and AMFND on new ACTIVE. As consequence, clm track start can not invoked to update cluster member nodes if these node was rebooted.

Need to reboot recovery if received message id of amfd mismatch with sent message id of amfnd.

Related

Wiki: ChangeLog-5.19.07
Wiki: ChangeLog-5.23.03

Discussion

  • Thang Duc Nguyen

    snippest AMFD trace
    2019-05-08T09:16:19.140+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 411, msg type 8, from 2010f should be 410
    ...
    2019-05-08T09:19:03.137+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 445, msg type 8, from 2010f should be 410
    ...
    2019-05-08T09:22:00.409+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 472, msg type 5, from 2010f should be 410
    ...
    2019-05-08T09:23:25.348+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 518, msg type 5, from 2010f should be 410
    ...
    2019-05-08T09:50:14.478+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 720, msg type 8, from 2010f should be 410
    ...
    2019-05-08T10:06:19.624+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 836, msg type 8, from 2010f should be 410
    ...
    2019-05-08T10:22:24.479+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 946, msg type 8, from 2010f should be 410
    ...
    2019-05-08T10:28:02.188+02:00 SC-1 osafamfd[10055]: WA avd_msg_sanity_chk: invalid msg id 984, msg type 8, from 2010f should be 410

    Snippest of AMFND trace
    <143>1 2019-05-08T09:16:12.495634+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31197"] 10069:amf/amfnd/di.cc:386 >> avnd_evt_avd_ack_evh
    <143>1 2019-05-08T09:16:12.495638+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31198"] 10069:amf/amfnd/di.cc:1294 >> avnd_di_msg_ack_process: 410
    <143>1 2019-05-08T09:16:12.495643+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31199"] 10069:amf/amfnd/di.cc:1311 TR remove msg 410 from queue
    <143>1 2019-05-08T09:16:12.495647+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31200"] 10069:amf/amfnd/di.cc:1450 >> avnd_diq_rec_del
    <143>1 2019-05-08T09:16:12.495652+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31201"] 10069:amf/amfnd/di.cc:1466 << avnd_diq_rec_del
    <143>1 2019-05-08T09:16:12.495656+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31202"] 10069:amf/amfnd/di.cc:1317 << avnd_di_msg_ack_process:
    <143>1 2019-05-08T09:16:12.49566+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31203"] 10069:amf/amfnd/di.cc:395 << avnd_evt_avd_ack_evh
    ...
    <143>1 2019-05-08T09:16:18.942177+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31945"] 10069:amf/amfnd/verify.cc:56 >> avnd_evt_avd_verify_evh: Data Verify message received from newly ACTIVE AVD
    <143>1 2019-05-08T09:16:18.942181+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31946"] 10069:amf/amfnd/verify.cc:69 T1 AVD send ID count: 76
    <143>1 2019-05-08T09:16:18.942184+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31947"] 10069:amf/amfnd/verify.cc:70 T1 AVND receive ID count: 76
    <143>1 2019-05-08T09:16:18.942188+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31948"] 10069:amf/amfnd/di.cc:1182 >> avnd_di_ack_nack_msg_send: Receive id = 76
    <143>1 2019-05-08T09:16:18.942193+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31949"] 10069:amf/amfnd/di.cc:1199 T1 MsgId=417,ACK=1
    <143>1 2019-05-08T09:16:18.942197+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31950"] 10069:amf/amfnd/di.cc:1125 >> avnd_di_msg_send
    <143>1 2019-05-08T09:16:18.942201+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31951"] 10069:amf/amfnd/di.cc:1129 TR Msg type '10'
    <143>1 2019-05-08T09:16:18.942205+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31952"] 10069:amf/amfnd/di.cc:1136 T1 avnd_di_msg_send, Active AVD Adest: 564117399592176
    <143>1 2019-05-08T09:16:18.942209+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31953"] 10069:amf/amfnd/mds.cc:1577 >> avnd_mds_red_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.94223+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31954"] 10069:amf/amfnd/mds.cc:1615 << avnd_mds_red_send: rc '1'
    <143>1 2019-05-08T09:16:18.942239+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31955"] 10069:amf/amfnd/di.cc:1160 << avnd_di_msg_send: 1
    <143>1 2019-05-08T09:16:18.942243+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31956"] 10069:amf/amfnd/di.cc:1207 << avnd_di_ack_nack_msg_send: retval=1
    <143>1 2019-05-08T09:16:18.942247+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31957"] 10069:amf/amfnd/verify.cc:93 T1 AVD receive ID count: 409
    <143>1 2019-05-08T09:16:18.942253+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31958"] 10069:amf/amfnd/verify.cc:94 T1 AVND send ID count: 416
    <143>1 2019-05-08T09:16:18.942256+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31959"] 10069:amf/amfnd/di.cc:1578 >> avnd_diq_rec_send
    <143>1 2019-05-08T09:16:18.942264+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31960"] 10069:amf/amfnd/mds.cc:1486 >> avnd_mds_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.942384+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31961"] 10069:amf/amfnd/mds.cc:1545 << avnd_mds_send: 1
    <143>1 2019-05-08T09:16:18.94239+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31962"] 10069:amf/amfnd/di.cc:1600 << avnd_diq_rec_send: 1
    <143>1 2019-05-08T09:16:18.942394+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31963"] 10069:amf/amfnd/verify.cc:122 T1 AVND record 411 sent, upon fail-over
    <143>1 2019-05-08T09:16:18.942399+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31964"] 10069:amf/amfnd/di.cc:1578 >> avnd_diq_rec_send
    <143>1 2019-05-08T09:16:18.942404+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31965"] 10069:amf/amfnd/mds.cc:1486 >> avnd_mds_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.942518+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31966"] 10069:amf/amfnd/mds.cc:1545 << avnd_mds_send: 1
    <143>1 2019-05-08T09:16:18.942524+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31967"] 10069:amf/amfnd/di.cc:1600 << avnd_diq_rec_send: 1
    <143>1 2019-05-08T09:16:18.942528+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31968"] 10069:amf/amfnd/verify.cc:122 T1 AVND record 412 sent, upon fail-over
    <143>1 2019-05-08T09:16:18.942532+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31969"] 10069:amf/amfnd/di.cc:1578 >> avnd_diq_rec_send
    <143>1 2019-05-08T09:16:18.942536+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31970"] 10069:amf/amfnd/mds.cc:1486 >> avnd_mds_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.942651+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31971"] 10069:amf/amfnd/mds.cc:1545 << avnd_mds_send: 1
    <143>1 2019-05-08T09:16:18.942658+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31972"] 10069:amf/amfnd/di.cc:1600 << avnd_diq_rec_send: 1
    <143>1 2019-05-08T09:16:18.942662+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31973"] 10069:amf/amfnd/verify.cc:122 T1 AVND record 413 sent, upon fail-over
    <143>1 2019-05-08T09:16:18.942666+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31974"] 10069:amf/amfnd/di.cc:1578 >> avnd_diq_rec_send
    <143>1 2019-05-08T09:16:18.94267+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31975"] 10069:amf/amfnd/mds.cc:1486 >> avnd_mds_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.942788+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31976"] 10069:amf/amfnd/mds.cc:1545 << avnd_mds_send: 1
    <143>1 2019-05-08T09:16:18.942795+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31977"] 10069:amf/amfnd/di.cc:1600 << avnd_diq_rec_send: 1
    <143>1 2019-05-08T09:16:18.942799+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31978"] 10069:amf/amfnd/verify.cc:122 T1 AVND record 414 sent, upon fail-over
    <143>1 2019-05-08T09:16:18.942804+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31979"] 10069:amf/amfnd/di.cc:1578 >> avnd_diq_rec_send
    <143>1 2019-05-08T09:16:18.942808+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31980"] 10069:amf/amfnd/mds.cc:1486 >> avnd_mds_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.942926+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31981"] 10069:amf/amfnd/mds.cc:1545 << avnd_mds_send: 1
    <143>1 2019-05-08T09:16:18.942934+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31982"] 10069:amf/amfnd/di.cc:1600 << avnd_diq_rec_send: 1
    <143>1 2019-05-08T09:16:18.942937+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31983"] 10069:amf/amfnd/verify.cc:122 T1 AVND record 415 sent, upon fail-over
    <143>1 2019-05-08T09:16:18.942941+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31984"] 10069:amf/amfnd/di.cc:1578 >> avnd_diq_rec_send
    <143>1 2019-05-08T09:16:18.942949+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31985"] 10069:amf/amfnd/mds.cc:1486 >> avnd_mds_send: Msg type '1'
    <143>1 2019-05-08T09:16:18.943073+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31986"] 10069:amf/amfnd/mds.cc:1545 << avnd_mds_send: 1
    <143>1 2019-05-08T09:16:18.943077+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31987"] 10069:amf/amfnd/di.cc:1600 << avnd_diq_rec_send: 1
    <143>1 2019-05-08T09:16:18.943081+02:00 SC-1 osafamfnd 10069 osafamfnd [meta sequenceId="31988"] 10069:amf/amfnd/verify.cc:122 T1 AVND record 416 sent, upon fail-over****

     
  • Thang Duc Nguyen

    Trace of AMFD, AMFND attached

     
  • Thang Duc Nguyen

    • status: assigned --> review
     
  • Thang Duc Nguyen

    • summary: Amfnd: clm track start not called if mismatch received message id b/w amfd and amfnd when node failover --> Amfnd: reboot if mismatch msg id b/w amfd and amfnd
     
  • Thang Duc Nguyen

    • status: review --> fixed
     
  • Thang Duc Nguyen

    commit 2df0aa97a523bb547c065a1d612fe59ed4800401
    Author: thang.d.nguyen thang.d.nguyen@dektech.com.au
    Date: Tue May 21 10:46:48 2019 +0700

    amf: reboot to recover from lost msgs [#3040]
    
    During SC failover, message received on ACTIVE AMFD can not be
    checked point to AMFD on STANDBY SC. But the AMFND still process
    the message ack for that message then it remove from queue.
    STANDBY SC takes ACTIVE and mismatch message id b/w AMFD and
    AMFND on new ACTIVE. As consequence, clm track start can not
    invoked to update cluster member nodes if these node was rebooted.
    
    Reboot to recovery this issue.
    
     
  • Thang Duc Nguyen

    • status: fixed --> review
     
  • Thang Duc Nguyen

    Need another reboot to recover the system.

     
  • Thang Duc Nguyen

    Step to reproduce partial assigned
    - Apply the attached patch
    - Deploy demo app
    immcfg -f AppConfig-2N.xml
    - Unlock-in/unlock app to make active assignment on PL-4, standby assignment on PL-3
    amf-adm unlock-in safSu=SU4,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock-in safSu=SU3,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock-in safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock-in safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock safSu=SU4,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock safSu=SU3,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock safSu=SU2,safSg=AmfDemo,safApp=AmfDemo1
    amf-adm unlock safSu=SU1,safSg=AmfDemo,safApp=AmfDemo1
    - On PL-4, reboot it to make PL-3 take over active
    init 6
    - Partial assign happen. System unhealthy. System did not recover automatically.
    amf-state si | grep -i demo -A4
    safSi=AmfDemo,safApp=AmfDemo1
    saAmfSIAdminState=UNLOCKED(1)
    saAmfSIAssignmentState=PARTIALLYASSIGNED(3)

     

    Last edit: Thang Duc Nguyen 2023-01-18
  • Thang Duc Nguyen

    Hi @mohan-hasoln,
    Do you have any comment on this ?

    B.R/Thang

     
  • Mohan  Kanakam

    Mohan Kanakam - 2023-01-27

    Hi Thang Duc Nguyen ,
    I acked the patch.
    Thanks
    Mohan(https://www.gethighavailability.com/)

     
    👍
    1
  • Thang Duc Nguyen

    • status: review --> fixed
    • Milestone: 5.19.07 --> 5.22.06
     
  • Thang Duc Nguyen

    commit e4fdf7487e7a3fb01d7c79c6d604c435ee8c9652 (HEAD -> develop, origin/develop, ticket-3040)
    Author: thang.d.nguyen thang.d.nguyen@dektech.com.au
    Date: Wed Jan 18 09:35:52 2023 +0700

    amf: reboot to recover from mismatch msg id [#3040]
    
    There is a case that mismatch messge id between
    AMFND and active AMFD. It causes the system unhealthy
    (partial assigned).
    Reboot the node that message mismatched to recover
    the system.
    
     
  • Gary Lee

    Gary Lee - 2023-03-27
    • Milestone: 5.22.06 --> 5.23.03
     

Log in to post a comment.