[Keepalived-announce] Weird stuff on transition from master to backup or from backup to master, som

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello, 

We're running clustered firewall on linux with iptables (latest kernel 
2.6.17.4 with iptables 1.3.5), and until now it was 
working as simple router without any kind of packet filtering. 

We have 7 networks and 2 interfaces on each firewall, where on eth0 we 
have small subnet with 16 addresses, and on eth1 we have 
divided 3 C-classes in smaller subnets with separate virtual interfaces, 
using keepalived (version 1.1.12). 

This problem I am describing have been there since the very beginning, 
and now we have to solve it since we can't seem to be 
able to use iptables properly with this setup. 

After days of searching google we decided to ask for help inhere. 
Here are errors on secondary node when we force failover (take down 
keepalived on primary node). It switches over but gives few 
errors: 

Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_5) 
Transition to MASTER STATE 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Group(DG1) Syncing 
instances to MASTER state 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_6) 
Transition to MASTER STATE 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_6) Entering 
MASTER STATE 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such 
device, type=(24), seq=1152912322, pid=0 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such 
device, type=(24), seq=1152912323, pid=0 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such 
device, type=(24), seq=1152912324, pid=0 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such 
device, type=(24), seq=1152912325, pid=0 
Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such 
device, type=(24), seq=1152912326, pid=0 
Jul 14 23:44:53 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_5) Entering 
MASTER STATE 
Jul 14 23:44:53 dg-fw2-n2 Keepalived_vrrp: Netlink: error: File exists, 
type=(24), seq=1152912328, pid=0 

-- 
I am a bit confused since IP information which is shown here seem to be 
correct, all netmasks and broadcast addresses are fine 
too. 
Due confidentiality reasons, I am going to substitute real addresses 
with fake ones, we are using only public addresses on this 
firewall, no NATing or masquarading. 

dg-fw2-n1 ~ # ip addr show 
1: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000 
link/ether 00:30:48:88:25:ba brd ff:ff:ff:ff:ff:ff 
inet 185.113.152.7/28 brd 185.113.152.15 scope global eth0 
inet 185.113.152.3/28 brd 185.113.152.15 scope global secondary eth0 
2: eth1: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000 
link/ether 00:30:48:88:25:bb brd ff:ff:ff:ff:ff:ff 
inet 185.113.148.130/25 brd 185.113.148.255 scope global eth1 
inet 185.113.149.129/25 brd 185.113.149.255 scope global eth1 
inet 185.113.149.1/26 brd 185.113.149.63 scope global eth1 
inet 185.113.149.65/27 brd 185.113.149.95 scope global eth1 
inet 185.113.149.97/27 brd 185.113.149.127 scope global eth1 
inet 185.113.150.1/26 brd 185.113.150.63 scope global eth1 
inet 185.113.148.129/25 brd 185.113.148.255 scope global secondary eth1 
3: lo: <LOOPBACK,UP,10000> mtu 16436 qdisc noqueue 
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo 

Real interface data from primary node: 
eth0 Link encap:Ethernet HWaddr 00:30:48:88:25:BA 
inet addr:185.113.152.7 Bcast:185.113.152.15 Mask:255.255.255.240 
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 
RX packets:462317963 errors:0 dropped:0 overruns:0 frame:0 
TX packets:506140499 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:1000 
RX bytes:2195883065 (2094.1 Mb) TX bytes:2199081298 (2097.2 Mb) 
Interrupt:177 

eth1 Link encap:Ethernet HWaddr 00:30:48:88:25:BB 
inet addr:185.113.148.130 Bcast:85.113.148.255 Mask:255.255.255.128 
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 
RX packets:573884689 errors:0 dropped:0 overruns:0 frame:0 
TX packets:529048273 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:1000 
RX bytes:3900549887 (3719.8 Mb) TX bytes:3875052168 (3695.5 Mb) 
Interrupt:185 

Secondary node have the same, except .8 on eth0, and .131 on eth1 
At this point primary node is an active one, and when we switch either 
from primary to secondary one, or backwards, we're 
getting the same errors there. 

Here is a routing info from active node: 
-- 
dg-fw2-n1 ~ # ip route list 
185.113.152.0/28 dev eth0 proto kernel scope link src 185.113.152.7 
185.113.149.64/27 dev eth1 proto kernel scope link src 185.113.149.65 
185.113.149.96/27 dev eth1 proto kernel scope link src 185.113.149.97 
185.113.150.0/26 dev eth1 proto kernel scope link src 185.113.150.1 
185.113.149.0/26 dev eth1 proto kernel scope link src 185.113.149.1 
185.113.149.128/25 dev eth1 proto kernel scope link src 185.113.149.129 
185.113.148.128/25 dev eth1 proto kernel scope link src 185.113.148.130 
127.0.0.0/8 dev lo scope link 
default via 185.113.152.1 dev eth0 -- 

And...keepalived.conf from primary node: 

vrrp_sync_group DG1 { 
group { 
VI_5 
VI_6 
} 
} 

vrrp_instance VI_5 { 
interface eth0 
state MASTER 
virtual_router_id 55 
priority 100 
advert_int 1 
authentication { 
auth_type PASS 
auth_pass something 
} 
virtual_ipaddress { 
185.113.152.3/28 brd 185.113.152.15 dev eth0 
} 
virtual_routes { 
0.0.0.0/0 via 185.113.152.1 
} 
} 

vrrp_instance VI_6 { 
interface eth1 
state MASTER 
virtual_router_id 56 
priority 100 
advert_int 1 
authentication { 
auth_type PASS 
auth_pass something 
} 
virtual_ipaddress { 
185.113.148.129/25 brd 185.113.148.255 dev eth1 
185.113.149.129/25 brd 185.113.149.255 dev eth1 
185.113.149.1/26 brd 185.113.149.63 dev eth1 
185.113.149.65/27 brd 185.113.149.95 dev eth1 
185.113.149.97/27 brd 185.113.149.127 dev eth1 
185.113.150.1/26 brd 185.113.150.63 dev eth1 
} 
virtual_routes { 

185.113.149.128/25 via dev eth1 
185.113.149.0/26 via dev eth1 
185.113.149.64/27 via dev eth1 
185.113.149.96/27 via dev eth1 
185.113 150.0/26 via dev eth1 
} 
} 
-------------------------- 
Same setup is on secondary node, just with lower priority settings. 

2. Problem nr.2 - if we enable iptables and use basic setup with states 
like ESTABLISHED,RELATED,NEW - when we access from 
185.113.148.252 IP we're getting INVALID state in logfiles on the way 
back from the web server which stands on 185.113.149.69 IP 
address, meaning that something is happening to the packets traversing 
firewall. 

However, if we switch off iptables totally - communication goes fine 
without any sign of problems, no invalid packets. 

I am wondering, can it be related to those errors we're getting on 
fail-over? 

Of course, we have ip forwarding enabled. 
Are there any specific additional parameters for sysctl / proc that we 
need to modify? 

We have attempted to use fwbuilder with just 1 open any/any/any accept 
rule, it generates only few lines of code for 
ESTABLISHED,RELATED and NEW connections, and drops everything else - and 
we're able to access everything out there on the 
internet, but it goes with enormous delays at once between our subnets. 
What is interesting, that some of servera are 
functioning extremely fast, and that is something that confuses me. 

I would appreciate any hints or tips what direction of problems we 
should look at. We will give any debug information if you 
need some, since we have no more ideas what can be wrong with our setup. 

Best regards, 

Mike.

[Keepalived-announce] Weird stuff on transition from master to backup or from backup to master, som

[Keepalived-announce] Weird stuff on transition from master to backup or from backup to master, some weirdness with iptables