|
From: Vicente A. <bi...@bi...> - 2009-03-17 17:04:26
|
Hi I have an issue on my servers related to both ucarp and the e1000 drivers, thus the crossposting. :-) I think that during system boot the e1000 driver (e1000e too) reports to the OS that the link is up some seconds before it really is. Server & module info: Red Hat Enterprise Linux ES release 4 (Nahant) filename: /lib/modules/2.6.9-5.ELsmp/kernel/drivers/net/e1000/e1000.ko parm: copybreak:Maximum size of packet that is copied to a new buffer on receive author: Intel Corporation, <lin...@in...> description: Intel(R) PRO/1000 Network Driver license: GPL version: 7.5.5-NAPI ucarp 1.2 Networking is configured on rc2.d/S10network and ucarp on S98ucarp. This is what happens: after a reboot of the master server, configured with preemption so that it would be master again after getting back online, the virtual IP was unresponsive. We did some tcpdumps and found out that the gratuitous-arp that ucarp sends when going to master state wasn't reaching the router, so in the router's arp table the virtual IP still pointed to the secondary server's MAC address. On syslog on the primary server we have: Mar 17 13:45:48 server1 network: Bringing up interface eth2: succeeded Mar 17 13:45:54 server1 ucarp[2489]: [INFO] Local advertised ethernet address is [00:15:17:58:19:08] Mar 17 13:45:54 server1 ucarp[2489]: [WARNING] Spawning [/opt/VIP/servicioVIP_add.sh eth2] Mar 17 13:45:54 server1 ucarp[2489]: [WARNING] Switching to state: MASTER Mar 17 13:46:12 server1 kernel: e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection Mar 17 13:46:13 server1 kernel: e1000: eth2: e1000_watchdog_task: 10/100 speed: disabling TSO Mar 17 13:46:13 server1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link is Up 100 Mbps Half Duplex, Flow Control: None So it seems that, while the network is configured before ucarp is launched (S10 vs S98), the cards (or the driver?) don't have link until after some 25 seconds after running the network startup script. So when ucarp runs, the network isn't still really working. ucarp sends the gratuitous-arp but it gets lost. After some seconds the link gets up and the heartbeats reach the secondary server, which goes into backup state and releases the VIP. But, as the router hasn't received the gratuitous-arp, in its table the VIP still belongs to the secondary server. All traffic to the VIP gets routed to the secondary server, which drops it as it doesn't recognize the VIP any more. This last point was verified with dumps on both the router and the secondary server and taking a look at the arp table on the router. There are two things that make me think the driver has to do with this issue: - I've talked with the people in charge of all the networking systems and there have been no flapping on the port the server is plugged to. In other words, according to the switch (Cisco Catalyst 4510), that link has never gone down. - I've inserted both a mii-tool and a ethtool on the ucarp startup script, just before launching ucarp. According to both of them the link is UP at that moment. But according to the messages by e1000_watchdong on syslog, the link goes UP a couple of seconds after that!!! And in any case the first packets sent by ucarp never leave the server. Besides, after all this testing I've tried upgrading the driver to the latest e1000e-0.5.11.2. Same problem, same log traces (bring up interface succeeded -> ucarp runs -> link UP), same behavior when studying the traffic with dumps. On a side note: the VIP works with ucarp 1.5. The first gratuitous-arp still gets lost, but it sends an additional one when the link gets up and it receives the heartbeats from the other server, "fixing" the router's arp table at that moment. So, is this a know issue with the e1000/e1000e drivers? Anybody else has experienced a similar situation? Just after a reboot, apparently having the network up but losing traffic for some seconds? Why do mii-tool and ethtool report that the link is UP, but it appears as going UP on syslog a couple of seconds after that? Is there any other way to check the link status? Thanks in advance. Regards -- Vicente Aguilar <bi...@bi...> | http://www.bisente.com |