From: Alexander D. <ale...@in...> - 2010-01-29 01:06:37
|
On Thu, 2010-01-28 at 15:29 -0800, Покотиленко Костик wrote: > В Чтв, 28/01/2010 в 14:32 -0800, Alexander Duyck пишет: > > On Wed, 2010-01-27 at 04:14 -0800, Покотиленко Костик wrote: > > > Using serial console I've figured out: > > > > > > - system working fine except for the NIC > > > - ifconfig show only RX dropped increasing on eth1 (client side), other > > > counters stailed. > > > - ethtool -t eth0: > > > > > > The test result is FAIL > > > The test extra info: > > > Register test (offline) 0 > > > Eeprom test (offline) 0 > > > Interrupt test (offline) 0 > > > Loopback test (offline) 13 > > > Link test (on/offline) 0 > > > > > > - ethtool -t eth1 > > > > > > The test result is FAIL > > > The test extra info: > > > Register test (offline) 0 > > > Eeprom test (offline) 0 > > > Interrupt test (offline) 0 > > > Loopback test (offline) 13 > > > Link test (on/offline) 0 > > > > > > - After doing: > > > > > > ifdown -a; rmmod igb; rmmod dca; modprobe igb; ifup -a > > > > > > both ethtool commands (The test result is FAIL) and ifconfig show same > > > result > > > > > > So it seems like NIC hawdware hand. > > > > The next time this occurs could you go though and run the ethtool test > > on all of the network ports? I'm wondering if it is only eth0/1 that > > are blocked or if eth3/4 are stopped as well. > > Sure. > > > > I don't think this problem is related to something other then NIC / igb > > > driver. If there are HW problems like memory or power I would notice > > > other system problems not just NIC, itsn't it?\ > > > > I'm wondering if this issue might somehow be a PCIe problem. The fact > > that the loopback test is failing tells me that the issue is likely > > somehow related to the NIC's ability to perform DMA transactions since > > that is essentially all the loopback test does. > > > > One of the reasons why I am thinking it is something in the system is > > because both eth0 and eth1 fail at the same time. From the software's > > perspective these ports appear as two separate devices, but there are > > certain physical items that are shared such as the PCIe physical link > > and it is possible that there may be some sort of issue there that is > > causing the hangs and resets. By doing an ethtool test on eth3/4 we > > will at least know if the issue extends to the bridge on the NIC or if > > it is only eth0/1. > > This is one of the most probable sources of the problem I think. > Considering that we also have had excactly the same problem with e1000e > onboard cards + deep hang or reboots. > > The question is how to debug this. > > My guess it that this is due to a HW being too new and maybe some kernel > subsystem did not have enough testing. > > Maybe I should also join some kernel driver list to discuss this > problem, but don't know which one. > > > If I can do more testing let me know. Moving NIC to other server isn't > > > option for me. > > > > > > The server is quite new, could it be IRQ related problem, i.e. > > > motherboard not fully supported by <=2.6.30? > > > > > > > I'm not suspecting an IRQ problem because the loopback test doesn't do > > anything with the interrupts. Also one of the tests that are performed > > in the ethtool testing is an interrupt test and the fact that it passed > > means that interrupts are behaving as expected. > > One guy in bug comment told that "pcie_aspm=off" solved problem with > 82574L for him. I tried this option hoping it could also help with 82576 > with no luck. He also suggested switching off "PCIe Powermanagement" in > the Bios, but I don't have such BIOS option. The fact that it had no effect doesn't surprise me much. Based on the lspci dump you provided ASPM was disabled for all ports so I don't think the change you made would have had any effect. > Maybe there are other kernel options to try? > > Also, there is BIOS update available from Intel recently, would you > suggest to update? I have previous one. I didn't see any notes in the BIOS update that pointed to anything related to this issue so I wouldn't recommend updating for now. > > Planning to try 2.6.32 tomorrow. > One other thing you can do which may provide more info would be to do an lspci -vvv both before and after the hang and see if there are any differences between the two. If there are differences it would likely point to, or provide more information on what is causing the issue. Thanks, Alex |