Active SC1 and PL4 lost connect suddenly ( may be due to the environment reason),they re-established contact but PL4 sync failed due to PL4 did not update active SC1 information and discard message from IMMD SC1
PL4 sync failed log
2022-09-22 04:07:05.230 DEBUG: Syncing node PL-4 (timeout=120)
2022-09-22 04:08:06.325 WARNING: waiting more than 60 sec for node PL-4 to sync
PL4 discard message from SC1 log
2022-09-22 04:07:08.406 PL-4 osafimmnd[354]: WA DISCARD message from IMMD 2010f as ACT:0 SBY:2020f
2022-09-22 04:07:09.013 PL-4 osafimmnd[354]: message repeated 243 times: [ WA DISCARD message from IMMD 2010f as ACT:0 SBY:2020f]
step to reproduce.
1.start SCs, PLs.
2.Block traffic SC1 and PL4 ( make sure block traffic after IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT ).
3.Unblock traffic SC1 and PL4.
Diff:
Hi Son Tran Ngoc,
I am trying to reproduce the issue using iptable by removing and adding ip rules. I am not able to reproduce the issue.
I tried this iptable rule to block the traffic:
iptables -I INPUT -s 192.168.56.122 -j DROP
I tried this iptable rule to allow the traffic again:
iptables --delete INPUT 1
syslog of pl-4:
Oct 7 12:39:47 pl3-VirtualBox osafimmnd[23011]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Oct 7 12:39:50 pl3-VirtualBox osafdtmd[22985]: ER recv() from node 0x2010f failed, errno=110
Oct 7 12:39:50 pl3-VirtualBox osafdtmd[22985]: NO Lost contact with 'SC-1'
Oct 7 12:39:50 pl3-VirtualBox osafimmnd[23011]: ER No IMMD service => cluster restart, exiting
Oct 7 12:39:50 pl3-VirtualBox opensafd[22972]: ER Failed #012 DESC:IMMND
Oct 7 12:39:50 pl3-VirtualBox opensafd[22972]: ER Going for recovery
Oct 7 12:39:50 pl3-VirtualBox opensafd[22972]: ER Trying To RESPAWN /usr/local/lib/opensaf/clc-cli/osaf-immnd attempt #1
Oct 7 12:39:50 pl3-VirtualBox osafdtmd[22985]: NO Established contact with 'SC-1'
Can you please let me know how to block traffic between SC1 and PL4 at IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Thanks
Mohan
Hi Mohan Kanakam,
This is detail steps
1.start SCs
2.start PLs.
3.enable network delay betwen SC1 and PL4
run cmd in PL4
tc qdisc add dev eth0 root handle 1: htb default 0
tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit prio 2
tc filter add dev eth0 parent 1: protocol ip handle 800::1 prio 1 u32 flowid 1:1 match ip dst 10.0.3.101
tc qdisc add dev eth0 parent 1:1 handle 100: netem delay 1000ms
4.block traffic SC1 and PL4
-waiting for IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT by using cmd on PL4
eg: timeout 120 tail -n 0 -F /var/log/syslog | (grep -m 1 "SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT"; pkill -P $$ timeout)
-block traffic SC1 and PL4
5.disable network delay between SC1 and PL4
run cmd in PL4
tc filter del dev eth0 parent 1: prio 1 handle 800::1 u32
6.unblock traffic SC1 and PL4.
Note :SC1 and PL4 sync very quickly cause cmd waiting for IMM State : IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT does not work so
we need to step 4.enable network delay between SC1 and PL4
I hope it could help you to reproduce this case.
Regards,
Tran Ngoc Son
Last edit: Son Tran Ngoc 2022-10-12
Hi Son Tran Ngoc,
Thanks for the steps.
I tried to reproduce the issue but i did not succeed.
these are the steps i performed to reproduce the issue:
1) I started the SC-1,SC-2 and PL-6 with 100k objects.
2) I enabled the network delay between SC-1 and PL-6 by running following command on PL-6:
tc qdisc add dev enp0s9 root handle 1: htb default 0
tc class add dev enp0s9 parent 1: classid 1:1 htb rate 100mbit prio 2
tc filter add dev enp0s9 parent 1: protocol ip handle 800::1 prio 1 u32 flowid 1:1 match ip dst 192.168.56.122
tc qdisc add dev enp0s9 parent 1:1 handle 100: netem delay 1000ms
3) I restarted the PL-6 and run the following command on PL-6:
timeout 180 tail -n 0 -F /var/log/syslog | (grep -m 1 "SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT"; pkill -P $$ timeout) ;
4) I disabled the network delay between SC-1 and PL-6 by running following command on PL-6:
tc filter del dev enp0s9 parent 1: prio 1 handle 800::1 u32
SC-1 log:
Nov 17 18:46:23 vem1-VirtualBox osafimmd[28119]: NO MDS event from svc_id 25 (change:3, dest:569611447702886)
Nov 17 18:46:25 vem1-VirtualBox osafimmd[28119]: NO Node 2060f request sync sync-pid:5478 epoch:0
Nov 17 18:46:25 vem1-VirtualBox osafimmnd[28131]: NO Announce sync, epoch:10
Nov 17 18:46:25 vem1-VirtualBox osafimmnd[28131]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Nov 17 18:46:25 vem1-VirtualBox osafimmnd[28131]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Nov 17 18:46:25 vem1-VirtualBox osafimmd[28119]: NO Successfully announced sync. New ruling epoch:10
Nov 17 18:46:25 vem1-VirtualBox osafimmloadd: logtrace: trace enabled to file 'osafimmnd', mask=0x0
Nov 17 18:46:25 vem1-VirtualBox osafimmloadd: NO Sync starting
Nov 17 18:46:30 vem1-VirtualBox dhclient[884]: DHCPREQUEST of 192.168.56.122 on enp0s9 to 192.168.56.100 port 67 (xid=0x3b4cc293)
Nov 17 18:46:30 vem1-VirtualBox dhclient[884]: DHCPACK of 192.168.56.122 from 192.168.56.100
Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3675] address 192.168.56.122
Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3678] plen 24 (255.255.255.0)
Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3681] server identifier 192.168.56.100
Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3684] lease time 600
Nov 17 18:46:30 vem1-VirtualBox NetworkManager[726]: <info> [1668690990.3686] dhcp4 (enp0s9): state changed bound -> bound
Nov 17 18:46:30 vem1-VirtualBox dbus[707]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Nov 17 18:46:30 vem1-VirtualBox systemd[1]: Starting Network Manager Script Dispatcher Service...
Nov 17 18:46:30 vem1-VirtualBox dbus[707]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Nov 17 18:46:30 vem1-VirtualBox systemd[1]: Started Network Manager Script Dispatcher Service.
Nov 17 18:46:30 vem1-VirtualBox dhclient[884]: bound to 192.168.56.122 -- renewal in 289 seconds.
Nov 17 18:46:30 vem1-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: new request (1 scripts)
Nov 17 18:46:30 vem1-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: start running ordered scripts...
Nov 17 18:48:15 vem1-VirtualBox osafimmloadd: IN Synced 100627 objects in total
Nov 17 18:48:15 vem1-VirtualBox osafimmnd[28131]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 19360</info></info></info></info></info>
Nov 17 18:46:25 pl3-VirtualBox osafimmnd[5478]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Nov 17 18:47:51 pl3-VirtualBox dhclient[849]: DHCPREQUEST of 192.168.56.107 on enp0s9 to 192.168.56.100 port 67 (xid=0x3015643a)
Nov 17 18:47:51 pl3-VirtualBox dhclient[849]: DHCPACK of 192.168.56.107 from 192.168.56.100
Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6233] address 192.168.56.107
Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6237] plen 24 (255.255.255.0)
Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6240] server identifier 192.168.56.100
Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6242] lease time 600
Nov 17 18:47:51 pl3-VirtualBox NetworkManager[650]: <info> [1668691071.6245] dhcp4 (enp0s9): state changed bound -> bound
Nov 17 18:47:51 pl3-VirtualBox dbus[635]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Nov 17 18:47:51 pl3-VirtualBox systemd[1]: Starting Network Manager Script Dispatcher Service...
Nov 17 18:47:51 pl3-VirtualBox dbus[635]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Nov 17 18:47:51 pl3-VirtualBox dhclient[849]: bound to 192.168.56.107 -- renewal in 260 seconds.
Nov 17 18:47:51 pl3-VirtualBox systemd[1]: Started Network Manager Script Dispatcher Service.
Nov 17 18:47:51 pl3-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: new request (1 scripts)
Nov 17 18:47:51 pl3-VirtualBox nm-dispatcher: req:1 'dhcp4-change' [enp0s9]: start running ordered scripts...
Nov 17 18:48:16 pl3-VirtualBox osafimmnd[5478]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2846
Nov 17 18:48:16 pl3-VirtualBox osafimmnd[5478]: NO RepositoryInitModeT is SA_IMM_INIT_FROM_FILE</info></info></info></info></info>
Thanks
Mohan
Hi Mohan Kanakam,
I see you didn't drop the connect between SC and PL.
Please try to reproduce it.
Regards,
Tran Ngoc Son
Hi Son,
Another scenario does not work, e.g, swap OpenSAF SI (2N) b/w step 2 and 3.
I think the current issue is the one of split-brain use case and solution could be considered is to reboot the PL.
B.R/Thang