This ticket is being submitted in relation to this e-mail conversation on the mailing list: https://sourceforge.net/p/e1000/mailman/e1000-devel/thread/9B4A1B1917080E46B64F07F2989DADD65338D212%40ORSMSX114.amr.corp.intel.com/#msg32580366
I have two ASRock C2750D4I (Avoton C2750, 16GB ECC Memory, 2x Intel i210 Gigabit NIC, Link: http://www.asrockrack.com/general/productdetail.asp?Model=C2750D4I) boards that I've been trying to use as part of a couple of Hadoop nodes.
My problem arises when there is any significant network load (originally from Hadoop, but the problem is producable using just netcat). Everything will work fine for a bit, but then after a while I loose the network connection to the node. "igb ... Detected Tx Unit Hang" will be printed to the attached screen and to kern.log.
The systems (which are identical):
ASRock C2750D4I Mother Board
2x8 GB Kingston ECC Memory
2x Intel i210 Gigabit NICs (part of the Mother Boards)
4x1 TB WD Red HDDs
I am currently using the 5.1.2 igb module as the 5.2.5 version will not build on my system (see http://sourceforge.net/p/e1000/mailman/message/32526226/) This problem also occours when using the 5.0.5-k version that ships with Ubuntu 14.04.
Once a Tx Unit Hang occours, the only way that I have been able to get the node back on the network is by rebooting. I have tried ifdown and ifup with no effect.
Usually (but not always) there is a (large) spike in sent data just before the Tx Unit Hang. This is usually (but not always accompanied by a high number of interrupts on the MSI vectors.
I have tested this using netcat to send a 20 GB file full of random data to and from one of the nodes. The other machine being used as the recipient and sender respectively was another node machine that has not been having network issues (and has different hardware completely).
The results from this suggest that CPU utilisation is not the issue. The tests were run while Hadoop was down, and the machine was restarted after each test.
When using the problem machine to transmit/send the 20 GB file, a Hang occurred within seconds (1~10 seconds) every time (6 times in 6 tests).
While the network was up, CPU utilisation was between 10% and 15% with idle being reported as between 80% to 85%.
The MSI-X interrupt queues were seeing between 1000 and 9000 interrupts per second while data was being transmitted, and data transmission down the wire was around 50 MB/s Rx and 90 MB/s Tx.
When using the problem machine to recive the 20 GB file, a Hang only occurred twice during 6 tests.
When there was a Hang, CPU utilisation was between 3% and 5%, with idle being around 90%.
The MSI-X interrupt queues were seeing between 1000 interrupts and 4000 or 16000 interrupts depending on the queue.
The data sent down the wire while it was working was around 100-110 MB/s Rx and around 2 MB/s Tx.
When there was not a Hang, the network still became unresponsive in each of the 4 remaining test, although no error was reported or displayed in the logs or on screen.
During these tests, it took anywhere from 10 seconds to 2 minutes for the network to become unresponsive.
Processor utilisation looked similar, with around 4/5% being used, and around 90% reported idle.
Interrupts were again between 1000 and 4000/16000 as above. Rx and Tx was also as above
Something interesting that I noted with this is that when I set the InterruptThrotleRate at 4000 or above I get a Hang within seconds of starting the file transfer, whereas if I set it to 3000 or less, the network stays up for between 5 and 10 minutes after starting the file transfer.
The attached files are from before and after a Hang on one of the problem machines.