Menu

#3323 imm: PL sync failed after reconnected with SC

future
unassigned
nobody
None
defect
imm
-
minor
False
2026-02-27
2022-10-05
No

Active SC1 and PL4 lost connect suddenly ( may be due to the environment reason),they re-established contact but PL4 sync failed due to PL4 did not update active SC1 information and discard message from IMMD SC1
PL4 sync failed log
2022-09-22 04:07:05.230 DEBUG: Syncing node PL-4 (timeout=120)
2022-09-22 04:08:06.325 WARNING: waiting more than 60 sec for node PL-4 to sync

PL4 discard message from SC1 log
2022-09-22 04:07:08.406 PL-4 osafimmnd[354]: WA DISCARD message from IMMD 2010f as ACT:0 SBY:2020f
2022-09-22 04:07:09.013 PL-4 osafimmnd[354]: message repeated 243 times: [ WA DISCARD message from IMMD 2010f as ACT:0 SBY:2020f]
step to reproduce.
1.start SCs, PLs.
2.Block traffic SC1 and PL4 ( make sure block traffic after IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT ).
3.Unblock traffic SC1 and PL4.

Discussion

  • Son Tran Ngoc

    Son Tran Ngoc - 2022-10-05
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,9 +1,9 @@
     Active SC1 and PL4 lost connect suddenly ( may be due to the environment reason),they re-established contact but PL4 sync failed due to PL4 did not update active SC1 information and discard message from IMMD SC1
    -PL4 
    +PL4 sync failed log
     2022-09-22 04:07:05.230 DEBUG: Syncing node PL-4 (timeout=120)
     2022-09-22 04:08:06.325 WARNING: waiting more than 60 sec for node PL-4 to sync
    
    -PL4 discard message from SC1
    +PL4 discard message from SC1 log
     2022-09-22 04:07:08.406 PL-4 osafimmnd[354]: WA DISCARD message from IMMD 2010f as ACT:0 SBY:2020f
     2022-09-22 04:07:09.013 PL-4 osafimmnd[354]: message repeated 243 times: [ WA DISCARD message from IMMD 2010f as ACT:0 SBY:2020f]
     step to reproduce.
    
     
  • Mohan  Kanakam

    Mohan Kanakam - 2022-10-07

    Hi Son Tran Ngoc,
    I am trying to reproduce the issue using iptable by removing and adding ip rules. I am not able to reproduce the issue.
    I tried this iptable rule to block the traffic:
    iptables -I INPUT -s 192.168.56.122 -j DROP
    I tried this iptable rule to allow the traffic again:
    iptables --delete INPUT 1

    syslog of pl-4:
    Oct 7 12:39:47 pl3-VirtualBox osafimmnd[23011]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
    Oct 7 12:39:50 pl3-VirtualBox osafdtmd[22985]: ER recv() from node 0x2010f failed, errno=110
    Oct 7 12:39:50 pl3-VirtualBox osafdtmd[22985]: NO Lost contact with 'SC-1'
    Oct 7 12:39:50 pl3-VirtualBox osafimmnd[23011]: ER No IMMD service => cluster restart, exiting
    Oct 7 12:39:50 pl3-VirtualBox opensafd[22972]: ER Failed #012 DESC:IMMND
    Oct 7 12:39:50 pl3-VirtualBox opensafd[22972]: ER Going for recovery
    Oct 7 12:39:50 pl3-VirtualBox opensafd[22972]: ER Trying To RESPAWN /usr/local/lib/opensaf/clc-cli/osaf-immnd attempt #1
    Oct 7 12:39:50 pl3-VirtualBox osafdtmd[22985]: NO Established contact with 'SC-1'

    Can you please let me know how to block traffic between SC1 and PL4 at IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
    Thanks
    Mohan

     
  • Son Tran Ngoc

    Son Tran Ngoc - 2022-10-12

    Hi Mohan Kanakam,
    This is detail steps
    1.start SCs
    2.start PLs.
    3.enable network delay betwen SC1 and PL4
    run cmd in PL4
    tc qdisc add dev eth0 root handle 1: htb default 0
    tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit prio 2
    tc filter add dev eth0 parent 1: protocol ip handle 800::1 prio 1 u32 flowid 1:1 match ip dst 10.0.3.101
    tc qdisc add dev eth0 parent 1:1 handle 100: netem delay 1000ms
    4.block traffic SC1 and PL4
    -waiting for IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT by using cmd on PL4
    eg: timeout 120 tail -n 0 -F /var/log/syslog | (grep -m 1 "SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT"; pkill -P $$ timeout)
    -block traffic SC1 and PL4
    5.disable network delay between SC1 and PL4
    run cmd in PL4
    tc filter del dev eth0 parent 1: prio 1 handle 800::1 u32
    6.unblock traffic SC1 and PL4.

    Note :SC1 and PL4 sync very quickly cause cmd waiting for IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT does not work so
    we need to step 4.enable network delay between SC1 and PL4
    I hope it could help you to reproduce this case.
    Regards,
    Tran Ngoc Son

     

    Last edit: Son Tran Ngoc 2022-10-12
  • Son Tran Ngoc

    Son Tran Ngoc - 2022-10-18
    • status: assigned --> review
     
  • Mohan  Kanakam

    Mohan Kanakam - 2022-11-17

    Hi Son Tran Ngoc,
    Thanks for the steps.
    I tried to reproduce the issue but i did not succeed.
    these are the steps i performed to reproduce the issue:
    1) I started the SC-1,SC-2 and PL-6 with 100k objects.
    2) I enabled the network delay between SC-1 and PL-6 by running following command on PL-6:
    tc qdisc add dev enp0s9 root handle 1: htb default 0
    tc class add dev enp0s9 parent 1: classid 1:1 htb rate 100mbit prio 2
    tc filter add dev enp0s9 parent 1: protocol ip handle 800::1 prio 1 u32 flowid 1:1 match ip dst 192.168.56.122
    tc qdisc add dev enp0s9 parent 1:1 handle 100: netem delay 1000ms
    3) I restarted the PL-6 and run the following command on PL-6:
    timeout 180 tail -n 0 -F /var/log/syslog | (grep -m 1 "SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT"; pkill -P $$ timeout) ;
    4) I disabled the network delay between SC-1 and PL-6 by running following command on PL-6:
    tc filter del dev enp0s9 parent 1: prio 1 handle 800::1 u32

    SC-1 log:
    Nov 17 18:46:23 vem1-VirtualBox osafimmd[28119]: NO MDS event from svc_id 25 (change:3, dest:569611447702886)
    Nov 17 18:46:25 vem1-VirtualBox osafimmd[28119]: NO Node 2060f request sync sync-pid:5478 epoch:0
    Nov 17 18:46:25 vem1-VirtualBox osafimmnd[28131]: NO Announce sync, epoch:10
    Nov 17 18:46:25 vem1-VirtualBox osafimmnd[28131]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
    Nov 17 18:46:25 vem1-VirtualBox osafimmnd[28131]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
    Nov 17 18:46:25 vem1-VirtualBox osafimmd[28119]: NO Successfully announced sync. New ruling epoch:10
    Nov 17 18:46:25 vem1-VirtualBox osafimmloadd: logtrace: trace enabled to file 'osafimmnd', mask=0x0
    Nov 17 18:46:25 vem1-VirtualBox osafimmloadd: NO Sync starting
    Nov 17 18:46:30 vem1-VirtualBox dhclient[884]: DHCPREQUEST of 192.168.56.122 on enp0s9 to 192.168.56.100 port 67 (xid=0x3b4cc293)
    Nov 17 18:46:30 vem1-VirtualBox dhclient[884]: DHCPACK of 192.168.56.122 from 192.168.56.100
    Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3675] address 192.168.56.122
    Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3678] plen 24 (255.255.255.0)
    Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3681] server identifier 192.168.56.100
    Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3684] lease time 600
    Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3686] dhcp4 (enp0s9): state changed bound -> bound
    Nov 17 18:46:30 vem1-VirtualBox dbus[707]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
    Nov 17 18:46:30 vem1-VirtualBox systemd[1]: Starting Network Manager Script Dispatcher Service...
    Nov 17 18:46:30 vem1-VirtualBox dbus[707]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
    Nov 17 18:46:30 vem1-VirtualBox systemd[1]: Started Network Manager Script Dispatcher Service.
    Nov 17 18:46:30 vem1-VirtualBox dhclient[884]: bound to 192.168.56.122 -- renewal in 289 seconds.
    Nov 17 18:46:30 vem1-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: new request (1 scripts)
    Nov 17 18:46:30 vem1-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: start running ordered scripts...
    Nov 17 18:48:15 vem1-VirtualBox osafimmloadd: IN Synced 100627 objects in total
    Nov 17 18:48:15 vem1-VirtualBox osafimmnd[28131]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 19360</info></info></info></info></info>

    PL-6 log:
    

    Nov 17 18:46:25 pl3-VirtualBox osafimmnd[5478]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
    Nov 17 18:47:51 pl3-VirtualBox dhclient[849]: DHCPREQUEST of 192.168.56.107 on enp0s9 to 192.168.56.100 port 67 (xid=0x3015643a)
    Nov 17 18:47:51 pl3-VirtualBox dhclient[849]: DHCPACK of 192.168.56.107 from 192.168.56.100
    Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6233] address 192.168.56.107
    Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6237] plen 24 (255.255.255.0)
    Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6240] server identifier 192.168.56.100
    Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6242] lease time 600
    Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6245] dhcp4 (enp0s9): state changed bound -> bound
    Nov 17 18:47:51 pl3-VirtualBox dbus[635]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
    Nov 17 18:47:51 pl3-VirtualBox systemd[1]: Starting Network Manager Script Dispatcher Service...
    Nov 17 18:47:51 pl3-VirtualBox dbus[635]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
    Nov 17 18:47:51 pl3-VirtualBox dhclient[849]: bound to 192.168.56.107 -- renewal in 260 seconds.
    Nov 17 18:47:51 pl3-VirtualBox systemd[1]: Started Network Manager Script Dispatcher Service.
    Nov 17 18:47:51 pl3-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: new request (1 scripts)
    Nov 17 18:47:51 pl3-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: start running ordered scripts...
    Nov 17 18:48:16 pl3-VirtualBox osafimmnd[5478]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2846
    Nov 17 18:48:16 pl3-VirtualBox osafimmnd[5478]: NO RepositoryInitModeT is SA_IMM_INIT_FROM_FILE</info></info></info></info></info>

    Thanks
    Mohan

     
  • Gary Lee

    Gary Lee - 2022-11-18
    • Milestone: 5.22.11 --> 5.23.03
     
  • Son Tran Ngoc

    Son Tran Ngoc - 2022-11-21

    Hi Mohan Kanakam,
    I see you didn't drop the connect between SC and PL.
    Please try to reproduce it.
    Regards,
    Tran Ngoc Son

     
  • Thang Duc Nguyen

    Hi Son,
    Another scenario does not work, e.g, swap OpenSAF SI (2N) b/w step 2 and 3.
    I think the current issue is the one of split-brain use case and solution could be considered is to reboot the PL.
    B.R/Thang

     
  • Thang Duc Nguyen

    • status: review --> unassigned
    • assigned_to: Son Tran Ngoc --> nobody
     
  • Gary Lee

    Gary Lee - 2023-03-27
    • Milestone: 5.23.03 --> 5.23.07
     
  • Gary Lee

    Gary Lee - 2023-07-30
    • Milestone: 5.23.07 --> 5.23.12
     
  • Gary Lee

    Gary Lee - 2024-02-27
    • Milestone: 5.24.02 --> 5.24.09
     
  • Gary Lee

    Gary Lee - 2025-03-29
    • Milestone: 5.25.03 --> 5.25.09
     
  • Gary Lee

    Gary Lee - 2026-02-27
    • Milestone: 5.26.02 --> future
     

Log in to post a comment.

MongoDB Logo MongoDB