Menu

#2436 amfnd: Buffered messages are unexpectedly deleted during SC Absence period

5.17.07
fixed
nobody
None
defect
amf
nd
major
False
2017-07-27
2017-04-24
No

Stop both SCs so that cluster goes into headless. Trigger a su failover, so su_oper message is buffered and supposedly will be sent to active amfd when SC comes back. However, if cluster is waiting up to 3 mins, which is exactly the MDS_AWAIT_ACTIVE_TMR_VAL timeout, amfnd will receive another NCSMDS_DOWN. At this time, amfnd will delete all pending messages, which causes the headless recovery impossible.

Some outline logs:

Apr 18 16:49:09.749428 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0603] >> avnd_evt_mds_avd_dn_evh 
Apr 18 16:49:09.750094 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0618] WA AMF director unexpectedly crashed
Apr 18 16:49:09.750103 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0662] TR Delete all pending messages to be sent to AMFD

Apr 18 16:49:09.796138 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0756] NO avnd_di_oper_send() deferred as AMF director is offline(1), or sync is required(1)

Apr 18 16:49:09.797440 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0756] NO avnd_di_oper_send() deferred as AMF director is offline(1), or sync is required(1)

Apr 18 16:52:09.825457 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0603] >> avnd_evt_mds_avd_dn_evh 
Apr 18 16:52:09.825489 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0618] WA AMF director unexpectedly crashed
Apr 18 16:52:09.825495 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:0662] TR Delete all pending messages to be sent to AMFD
Apr 18 16:52:09.825498 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:1273] >> avnd_diq_rec_del 
Apr 18 16:52:09.825505 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:1290] << avnd_diq_rec_del 
Apr 18 16:52:09.825508 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:1273] >> avnd_diq_rec_del 
Apr 18 16:52:09.825512 osafamfnd [10775:10775:../../opensaf/src/amf/amfnd/di.cc:1290] << avnd_diq_rec_del 

Related

Tickets: #2436
Wiki: ChangeLog-5.17.07

Discussion

  • Minh Hon Chau

    Minh Hon Chau - 2017-04-26
    • status: assigned --> review
     
  • Gary Lee

    Gary Lee - 2017-05-04

    develop:

    commit 4cb4351920a16284ac3dfb40f055bab455e760dc
    Author: Minh Chau minh.chau@dektech.com.au
    Date: Wed Apr 26 15:02:48 2017 +1000

    amfnd: Ignore second NCSMDS_DOWN [#2436]
    
    If cluster goes into headless stage and wait up to 3 mins
    which is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL,
    amfnd will receive another NCSMDS_DOWN, and then delete
    all buffered messages. As a result, the headless recovery
    is impossible because these buffered messages are deleted.
    

    release:

    commit ee0ae69f29bfd3672a4bfa3a55154d07948962ea
    Author: Minh Chau minh.chau@dektech.com.au
    Date: Wed Apr 26 15:02:48 2017 +1000

    amfnd: Ignore second NCSMDS_DOWN [#2436]
    
    If cluster goes into headless stage and wait up to 3 mins
    which is currently the timeout of MDS_AWAIT_ACTIVE_TMR_VAL,
    amfnd will receive another NCSMDS_DOWN, and then delete
    all buffered messages. As a result, the headless recovery
    is impossible because these buffered messages are deleted.
    
    Patch ignores the second NCSMDS_DOWN.
    

    changeset: 8790:c95a64cc4940
    user: Minh Hon Chau minh.chau@dektech.com.au
    date: Thu May 04 15:05:26 2017 +1000
    summary: amfnd: Ignore second NCSMDS_DOWN [#2436]

     

    Related

    Tickets: #2436

  • Minh Hon Chau

    Minh Hon Chau - 2017-05-05
    • status: review --> fixed
    • assigned_to: Minh Hon Chau --> nobody
     
  • Anders Widell

    Anders Widell - 2017-07-01
    • Milestone: 5.17.06 --> 5.17.08
     

Log in to post a comment.