[Keepalived-announce] Weird stuff on transition from master to backup or from backup to master, som
Status: Beta
Brought to you by:
acassen
From: Mike J. <mi...@fr...> - 2006-07-17 08:42:46
|
Hello, We're running clustered firewall on linux with iptables (latest kernel 2.6.17.4 with iptables 1.3.5), and until now it was working as simple router without any kind of packet filtering. We have 7 networks and 2 interfaces on each firewall, where on eth0 we have small subnet with 16 addresses, and on eth1 we have divided 3 C-classes in smaller subnets with separate virtual interfaces, using keepalived (version 1.1.12). This problem I am describing have been there since the very beginning, and now we have to solve it since we can't seem to be able to use iptables properly with this setup. After days of searching google we decided to ask for help inhere. Here are errors on secondary node when we force failover (take down keepalived on primary node). It switches over but gives few errors: Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_5) Transition to MASTER STATE Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Group(DG1) Syncing instances to MASTER state Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_6) Transition to MASTER STATE Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_6) Entering MASTER STATE Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such device, type=(24), seq=1152912322, pid=0 Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such device, type=(24), seq=1152912323, pid=0 Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such device, type=(24), seq=1152912324, pid=0 Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such device, type=(24), seq=1152912325, pid=0 Jul 14 23:44:52 dg-fw2-n2 Keepalived_vrrp: Netlink: error: No such device, type=(24), seq=1152912326, pid=0 Jul 14 23:44:53 dg-fw2-n2 Keepalived_vrrp: VRRP_Instance(VI_5) Entering MASTER STATE Jul 14 23:44:53 dg-fw2-n2 Keepalived_vrrp: Netlink: error: File exists, type=(24), seq=1152912328, pid=0 -- I am a bit confused since IP information which is shown here seem to be correct, all netmasks and broadcast addresses are fine too. Due confidentiality reasons, I am going to substitute real addresses with fake ones, we are using only public addresses on this firewall, no NATing or masquarading. dg-fw2-n1 ~ # ip addr show 1: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:30:48:88:25:ba brd ff:ff:ff:ff:ff:ff inet 185.113.152.7/28 brd 185.113.152.15 scope global eth0 inet 185.113.152.3/28 brd 185.113.152.15 scope global secondary eth0 2: eth1: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:30:48:88:25:bb brd ff:ff:ff:ff:ff:ff inet 185.113.148.130/25 brd 185.113.148.255 scope global eth1 inet 185.113.149.129/25 brd 185.113.149.255 scope global eth1 inet 185.113.149.1/26 brd 185.113.149.63 scope global eth1 inet 185.113.149.65/27 brd 185.113.149.95 scope global eth1 inet 185.113.149.97/27 brd 185.113.149.127 scope global eth1 inet 185.113.150.1/26 brd 185.113.150.63 scope global eth1 inet 185.113.148.129/25 brd 185.113.148.255 scope global secondary eth1 3: lo: <LOOPBACK,UP,10000> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope host lo Real interface data from primary node: eth0 Link encap:Ethernet HWaddr 00:30:48:88:25:BA inet addr:185.113.152.7 Bcast:185.113.152.15 Mask:255.255.255.240 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:462317963 errors:0 dropped:0 overruns:0 frame:0 TX packets:506140499 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2195883065 (2094.1 Mb) TX bytes:2199081298 (2097.2 Mb) Interrupt:177 eth1 Link encap:Ethernet HWaddr 00:30:48:88:25:BB inet addr:185.113.148.130 Bcast:85.113.148.255 Mask:255.255.255.128 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:573884689 errors:0 dropped:0 overruns:0 frame:0 TX packets:529048273 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3900549887 (3719.8 Mb) TX bytes:3875052168 (3695.5 Mb) Interrupt:185 Secondary node have the same, except .8 on eth0, and .131 on eth1 At this point primary node is an active one, and when we switch either from primary to secondary one, or backwards, we're getting the same errors there. Here is a routing info from active node: -- dg-fw2-n1 ~ # ip route list 185.113.152.0/28 dev eth0 proto kernel scope link src 185.113.152.7 185.113.149.64/27 dev eth1 proto kernel scope link src 185.113.149.65 185.113.149.96/27 dev eth1 proto kernel scope link src 185.113.149.97 185.113.150.0/26 dev eth1 proto kernel scope link src 185.113.150.1 185.113.149.0/26 dev eth1 proto kernel scope link src 185.113.149.1 185.113.149.128/25 dev eth1 proto kernel scope link src 185.113.149.129 185.113.148.128/25 dev eth1 proto kernel scope link src 185.113.148.130 127.0.0.0/8 dev lo scope link default via 185.113.152.1 dev eth0 -- And...keepalived.conf from primary node: vrrp_sync_group DG1 { group { VI_5 VI_6 } } vrrp_instance VI_5 { interface eth0 state MASTER virtual_router_id 55 priority 100 advert_int 1 authentication { auth_type PASS auth_pass something } virtual_ipaddress { 185.113.152.3/28 brd 185.113.152.15 dev eth0 } virtual_routes { 0.0.0.0/0 via 185.113.152.1 } } vrrp_instance VI_6 { interface eth1 state MASTER virtual_router_id 56 priority 100 advert_int 1 authentication { auth_type PASS auth_pass something } virtual_ipaddress { 185.113.148.129/25 brd 185.113.148.255 dev eth1 185.113.149.129/25 brd 185.113.149.255 dev eth1 185.113.149.1/26 brd 185.113.149.63 dev eth1 185.113.149.65/27 brd 185.113.149.95 dev eth1 185.113.149.97/27 brd 185.113.149.127 dev eth1 185.113.150.1/26 brd 185.113.150.63 dev eth1 } virtual_routes { 185.113.149.128/25 via dev eth1 185.113.149.0/26 via dev eth1 185.113.149.64/27 via dev eth1 185.113.149.96/27 via dev eth1 185.113 150.0/26 via dev eth1 } } -------------------------- Same setup is on secondary node, just with lower priority settings. 2. Problem nr.2 - if we enable iptables and use basic setup with states like ESTABLISHED,RELATED,NEW - when we access from 185.113.148.252 IP we're getting INVALID state in logfiles on the way back from the web server which stands on 185.113.149.69 IP address, meaning that something is happening to the packets traversing firewall. However, if we switch off iptables totally - communication goes fine without any sign of problems, no invalid packets. I am wondering, can it be related to those errors we're getting on fail-over? Of course, we have ip forwarding enabled. Are there any specific additional parameters for sysctl / proc that we need to modify? We have attempted to use fwbuilder with just 1 open any/any/any accept rule, it generates only few lines of code for ESTABLISHED,RELATED and NEW connections, and drops everything else - and we're able to access everything out there on the internet, but it goes with enormous delays at once between our subnets. What is interesting, that some of servera are functioning extremely fast, and that is something that confuses me. I would appreciate any hints or tips what direction of problems we should look at. We will give any debug information if you need some, since we have no more ideas what can be wrong with our setup. Best regards, Mike. |