From: Покотиленко К. <ca...@me...> - 2010-01-28 23:30:42
|
В Чтв, 28/01/2010 в 14:32 -0800, Alexander Duyck пишет: > On Wed, 2010-01-27 at 04:14 -0800, Покотиленко Костик wrote: > > Using serial console I've figured out: > > > > - system working fine except for the NIC > > - ifconfig show only RX dropped increasing on eth1 (client side), other > > counters stailed. > > - ethtool -t eth0: > > > > The test result is FAIL > > The test extra info: > > Register test (offline) 0 > > Eeprom test (offline) 0 > > Interrupt test (offline) 0 > > Loopback test (offline) 13 > > Link test (on/offline) 0 > > > > - ethtool -t eth1 > > > > The test result is FAIL > > The test extra info: > > Register test (offline) 0 > > Eeprom test (offline) 0 > > Interrupt test (offline) 0 > > Loopback test (offline) 13 > > Link test (on/offline) 0 > > > > - After doing: > > > > ifdown -a; rmmod igb; rmmod dca; modprobe igb; ifup -a > > > > both ethtool commands (The test result is FAIL) and ifconfig show same > > result > > > > So it seems like NIC hawdware hand. > > The next time this occurs could you go though and run the ethtool test > on all of the network ports? I'm wondering if it is only eth0/1 that > are blocked or if eth3/4 are stopped as well. Sure. > > I don't think this problem is related to something other then NIC / igb > > driver. If there are HW problems like memory or power I would notice > > other system problems not just NIC, itsn't it?\ > > I'm wondering if this issue might somehow be a PCIe problem. The fact > that the loopback test is failing tells me that the issue is likely > somehow related to the NIC's ability to perform DMA transactions since > that is essentially all the loopback test does. > > One of the reasons why I am thinking it is something in the system is > because both eth0 and eth1 fail at the same time. From the software's > perspective these ports appear as two separate devices, but there are > certain physical items that are shared such as the PCIe physical link > and it is possible that there may be some sort of issue there that is > causing the hangs and resets. By doing an ethtool test on eth3/4 we > will at least know if the issue extends to the bridge on the NIC or if > it is only eth0/1. This is one of the most probable sources of the problem I think. Considering that we also have had excactly the same problem with e1000e onboard cards + deep hang or reboots. The question is how to debug this. My guess it that this is due to a HW being too new and maybe some kernel subsystem did not have enough testing. Maybe I should also join some kernel driver list to discuss this problem, but don't know which one. > > If I can do more testing let me know. Moving NIC to other server isn't > > option for me. > > > > The server is quite new, could it be IRQ related problem, i.e. > > motherboard not fully supported by <=2.6.30? > > > > I'm not suspecting an IRQ problem because the loopback test doesn't do > anything with the interrupts. Also one of the tests that are performed > in the ethtool testing is an interrupt test and the fact that it passed > means that interrupts are behaving as expected. One guy in bug comment told that "pcie_aspm=off" solved problem with 82574L for him. I tried this option hoping it could also help with 82576 with no luck. He also suggested switching off "PCIe Powermanagement" in the Bios, but I don't have such BIOS option. Maybe there are other kernel options to try? Also, there is BIOS update available from Intel recently, would you suggest to update? I have previous one. Planning to try 2.6.32 tomorrow. -- Покотиленко Костик <ca...@me...> |