Menu

#3243 imm: amfd crash when multi partitioned clusters rejoin

5.21.03
fixed
None
defect
imm
-
minor
False
2020-12-16
2020-12-14
Thuan Tran
No

During multi partitioned clusters rejoin, quick reboot SC-7 does not reboot immediately cause impact to SC-1

2020-12-02 14:50:03.453 SC-7 osafrded[156]: NO Got peer info response from node 0x2040f with role ACTIVE
2020-12-02 14:50:03.454 SC-7 osafrded[156]: Quick local node rebooting, Reason: Split-brain detected
2020-12-02 14:50:03.582 SC-7 opensaf_reboot: Do quick local node reboot
...
2020-12-02 14:50:09.838 SC-7 osafrded[156]: NO Got peer info response from node 0x2010f with role ACTIVE
2020-12-02 14:50:09.838 SC-7 osafrded[156]: Quick local node rebooting, Reason: Split-brain detected

It make SC-1 reboot then SC-3 become Active while IMMND just restart to sync with Coordinator on SC-1.
Consequently, IMM load from imm.xml (which is out of date) then leads to AMFD unexpected crash.

2020-12-02 14:50:04.485 SC-3 osafimmnd[195]: NO IMMD(2010f) service is UP ... ScAbsenseAllowed?:900 introduced?:2
2020-12-02 14:50:04.485 SC-3 osafimmnd[195]: NO Re-introduce-me highestProcessed:4184 highestReceived:4184 ex_immd_node_id=2010f
2020-12-02 14:50:04.489 SC-3 osafimmnd[195]: WA Restart to sync with Coord! Exit
...
2020-12-02 14:50:09.980 SC-3 osafamfnd[260]: WA AMF director unexpectedly crashed
2020-12-02 14:50:18.079 SC-3 osafimmnd[718]: NO IMMD(2030f) service is UP ... ScAbsenseAllowed?:0 introduced?:0
2020-12-02 14:50:22.070 SC-3 osafimmloadd: NO Load starting
2020-12-02 14:50:22.070 SC-3 osafimmloadd: NO IMMSV_PBE_FILE is defined (imm.db) check it for existence and SaImmRepositoryInitModeT
2020-12-02 14:50:22.070 SC-3 osafimmloadd: IN File '/etc/opensaf/imm.db' is not accessible for read/write, cause:No such file or directory
2020-12-02 14:50:22.070 SC-3 osafimmloadd: WA Could not open repository:imm.db
2020-12-02 14:50:22.070 SC-3 osafimmloadd: NO ***** Loading from XML file imm.xml at /etc/opensaf *****
...
2020-12-02 14:50:22.607 SC-3 osafamfd[247]: src/amf/amfd/siass.cc:1233: avd_susi_recreate: Assertion 'su' failed.

Related

Wiki: ChangeLog-5.21.03

Discussion

  • Thuan Tran

    Thuan Tran - 2020-12-14
    • status: assigned --> review
     
  • Thuan Tran

    Thuan Tran - 2020-12-16
    • status: review --> fixed
     
  • Thuan Tran

    Thuan Tran - 2020-12-16

    commit 109cb75c68399af613f4c5b9684e5c0d97222483
    Author: thuan.tran thuan.tran@dektech.com.au
    Date: Tue Dec 8 10:31:17 2020 +0700

    imm: fix amfd crash when multi partitioned clusters rejoin [#3243]
    
    - Quick reboot is sometimes not quick cause RDE continue cause
    split-brain detection for another SC. Need kill director services
    to avoid impact other SCs.
    
    - Active IMMD pause itself if see another active IMMD. Node will
    reboot by RDE or split-brain timer of local IMMND.
    
    - Improve log messages to avoid confusion about intro/re-intro
    accept or just epoch update.
    
     

Log in to post a comment.