Share

Intel Wired Ethernet

Tracker: Bugs

5 e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang - ID: 1463045
Last Update: Comment added ( sf-robot )

The full log:

Mar 31 13:27:18 dy-xeon-1 kernel: nfs: server
192.168.2.1 not responding, still trying
Mar 31 13:27:18 dy-xeon-1 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit Hang
Mar 31 13:27:18 dy-xeon-1 kernel: Tx Queue
<0>
Mar 31 13:27:18 dy-xeon-1 kernel: TDH
<1c>
Mar 31 13:27:18 dy-xeon-1 kernel: TDT
<9>
Mar 31 13:27:18 dy-xeon-1 kernel: next_to_use
<9>
Mar 31 13:27:18 dy-xeon-1 kernel: next_to_clean
<1b>
Mar 31 13:27:18 dy-xeon-1 kernel:
buffer_info[next_to_clean]
Mar 31 13:27:18 dy-xeon-1 kernel: time_stamp
<3abf4>
Mar 31 13:27:18 dy-xeon-1 kernel: next_to_watch
<1f>
Mar 31 13:27:18 dy-xeon-1 kernel: jiffies
<3acfc>
Mar 31 13:27:18 dy-xeon-1 kernel:
next_to_watch.status <0>
Mar 31 13:27:18 dy-xeon-1 kernel: nfs: server
192.168.2.1 not responding, still trying

The board is SE7520AF2, the kernel is 2.6.15.7 the
driver is 6.3.9-k4.

This bug is reproducible on my system randomly about daily.

After this message ~1 minute the machine hangs.

Cheers,
Janos


JaniD++ ( janid ) - 2006-04-02 14:11

5

Closed

None

Jesse Brandeburg

e1000

standalone driver

Public


Comments ( 42 )

Date: 2007-12-28 03:20
Sender: sf-robotSourceForge.net Site Admin


This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 60 days (the time period specified by
the administrator of this Tracker).


Date: 2007-10-29 00:51
Sender: go_jesseProject Admin


This issue was unceremoniously set to closed due to the age of the last
update. If you still have this issue please re-open it or file a new bug
with the standard troubleshooting info provided.




Date: 2007-01-22 06:43
Sender: sofar


rincebrain: please open a new issue and provide us with all the usual
debugging information, so we can identify if your problems are not caused
by something else. It's becoming inpossible for us to identify the
different issues that people have.

micw: the eeprom fix for the 82573 here:
http://e1000.sourceforge.net/wiki/index.php/Issues#82573.28V.2FL.2FE.29_TX_Unit_Hang_messages
might fix your problem. Please apply that fix and open a new issue in case
it's not fixed yet.

These error messages can be caused by various things, and there are fixes
for some of the problems. We need to knock them off one by one.

So please, open up a new issue on your specific problem.

Thanks.


Date: 2007-01-22 05:59
Sender: rincebrain


My group, the JHU ACM, has several systems on which this can be reliably
reproduced.

If you'd like, we could give you exclusive SSH access to one of these
systems - if you'd prefer something more hands on, and feel like paying for
shipping, you could also borrow it from us.

We really want this fixed, as we upgraded to gigabit a few months ago, and
this has been plaguing us ever since. :)


Date: 2007-01-22 04:58
Sender: sofar


j_olexa: you might need an eeprom fix - please visit the following URL
describing the procedure and fix.


http://e1000.sourceforge.net/wiki/index.php/Issues#82573.28V.2FL.2FE.29_TX_Unit_Hang_messages

Cheers,

Auke


Date: 2007-01-22 04:40
Sender: j_olexa


Hi,

we've just upgraded our switch to gbit and we are suddenly facing the
problem described. what we found is the following:
1) module options can help (delay), but not solve the tx_hang
2) switching the speed to 100mbit solves the tx_hang, but that's not the
point of a gbit card :-)
3) TDH != TDT ... not even once

our system lspci:
0d:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet
Controller (Copper) (rev 03)
Subsystem: Super Micro Computer Inc Unknown device 108c
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 66
Region 0: Memory at e0200000 (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at 4000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable+
Address: 00000000fee00000 Data: 4042
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0,
ExtTag-
Device: Latency L0s <512ns, L1 <64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal-
Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown,
Port 0
Link: Latency L0s <128ns, L1 <64us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1

0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
Controller
Subsystem: Super Micro Computer Inc Unknown device 109a
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 177
Region 0: Memory at e0300000 (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at 5000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0,
ExtTag-
Device: Latency L0s <512ns, L1 <64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal-
Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM unknown,
Port 0
Link: Latency L0s <128ns, L1 <64us
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x1

module settings:
alias eth0 e1000
alias eth1 e1000
options e1000 XsumRX=0,0 InterruptThrottleRate=0,0 FlowControl=3,3
RxDescriptors=4096,4096 TxDescriptors=4096,40
96 RxIntDelay=0,0 TxIntDelay=0,0

ethtool -e eth0
Offset Values
------ ------
0x0000 00 30 48 89 fa bc 30 0d 46 f7 f4 00 ff ff ff ff
0x0010 ff ff ff ff 6b 02 8c 10 d9 15 8c 10 86 80 de 83
0x0020 08 00 00 20 14 7e 48 00 00 10 d8 00 00 00 00 27
0x0030 c9 6c 50 31 22 07 0b 04 84 09 00 00 00 c0 06 07
0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060 00 01 00 40 1c 12 ff ff ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 35 72
0x0080 0a 00 00 0a 00 07 e9 03 33 c2 02 6f 02 98 00 00
0x0090 00 00 00 00 00 00 00 00 00 23 20 00 05 00 00 00
0x00a0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00c0 80 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff
0x00d0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00f0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0100 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0110 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0120 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0130 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0140 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0150 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0160 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0170 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0180 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0190 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x01a0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x01b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x01c0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x01d0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x01e0 00 00 00 00 20 00 00 20 c9 00 00 00 00 00 00 fd
0x01f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0200 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0210 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0220 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0230 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0240 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0250 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0260 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0270 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0280 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0290 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x02a0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x02b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x02c0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x02d0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x02e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x02f0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0300 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0310 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0320 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0330 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0340 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0350 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0360 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0370 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0380 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0390 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x03a0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x03b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x03c0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x03d0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x03e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x03f0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

ethtool -i eth0
driver: e1000
version: 7.3.20-NAPI
firmware-version: 0.15-4
bus-info: 0000:0d:00.0

we have tested the cables and they work just fine with another server
having an older version of e1000.

the messages with original default config (no module options, v 7.1.9 -
etch testing):
Jan 21 06:50:05 mailkit-ms1 kernel: Tx Queue <0>
Jan 21 06:50:05 mailkit-ms1 kernel: TDH <5b>
Jan 21 06:50:05 mailkit-ms1 kernel: TDT <5d>
Jan 21 06:50:05 mailkit-ms1 kernel: next_to_use <5d>
Jan 21 06:50:05 mailkit-ms1 kernel: next_to_clean <5b>
Jan 21 06:50:05 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 21 06:50:05 mailkit-ms1 kernel: time_stamp <12478d469>
Jan 21 06:50:05 mailkit-ms1 kernel: next_to_watch <5b>
Jan 21 06:50:05 mailkit-ms1 kernel: jiffies <12478d711>
Jan 21 06:50:05 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 21 06:50:07 mailkit-ms1 kernel: Tx Queue <0>
Jan 21 06:50:07 mailkit-ms1 kernel: TDH <5b>
Jan 21 06:50:07 mailkit-ms1 kernel: TDT <5d>
Jan 21 06:50:07 mailkit-ms1 kernel: next_to_use <5d>
Jan 21 06:50:07 mailkit-ms1 kernel: next_to_clean <5b>
Jan 21 06:50:07 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 21 06:50:07 mailkit-ms1 kernel: time_stamp <12478d469>
Jan 21 06:50:07 mailkit-ms1 kernel: next_to_watch <5b>
Jan 21 06:50:07 mailkit-ms1 kernel: jiffies <12478d905>
Jan 21 06:50:07 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 21 06:50:08 mailkit-ms1 kernel: NETDEV WATCHDOG: eth0: transmit timed
out
Jan 21 06:50:11 mailkit-ms1 kernel: e1000: eth0: e1000_watchdog: NIC Link
is Up 1000 Mbps Full Duplex
Jan 21 07:15:07 mailkit-ms1 -- MARK --
Jan 21 07:27:04 mailkit-ms1 kernel: Tx Queue <0>
Jan 21 07:27:04 mailkit-ms1 kernel: TDH <cb>
Jan 21 07:27:04 mailkit-ms1 kernel: TDT <cc>
Jan 21 07:27:04 mailkit-ms1 kernel: next_to_use <cc>
Jan 21 07:27:04 mailkit-ms1 kernel: next_to_clean <cb>
Jan 21 07:27:04 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 21 07:27:04 mailkit-ms1 kernel: time_stamp <124814c1a>
Jan 21 07:27:04 mailkit-ms1 kernel: next_to_watch <cb>
Jan 21 07:27:04 mailkit-ms1 kernel: jiffies <124814e0e>
Jan 21 07:27:04 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 21 07:27:06 mailkit-ms1 kernel: Tx Queue <0>
Jan 21 07:27:06 mailkit-ms1 kernel: TDH <cb>
Jan 21 07:27:06 mailkit-ms1 kernel: TDT <cc>
Jan 21 07:27:06 mailkit-ms1 kernel: next_to_use <cc>
Jan 21 07:27:06 mailkit-ms1 kernel: next_to_clean <cb>
Jan 21 07:27:06 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 21 07:27:06 mailkit-ms1 kernel: time_stamp <124814c1a>
Jan 21 07:27:06 mailkit-ms1 kernel: next_to_watch <cb>
Jan 21 07:27:06 mailkit-ms1 kernel: jiffies <124815002>
Jan 21 07:27:06 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 21 07:27:08 mailkit-ms1 kernel: Tx Queue <0>
Jan 21 07:27:08 mailkit-ms1 kernel: TDH <cb>
Jan 21 07:27:08 mailkit-ms1 kernel: TDT <cc>
Jan 21 07:27:08 mailkit-ms1 kernel: next_to_use <cc>
Jan 21 07:27:08 mailkit-ms1 kernel: next_to_clean <cb>

output with module options running 7.3.20:
Jan 22 04:34:47 mailkit-ms1 kernel: Tx Queue <0>
Jan 22 04:34:47 mailkit-ms1 kernel: TDH <bd5>
Jan 22 04:34:47 mailkit-ms1 kernel: TDT <bd8>
Jan 22 04:34:47 mailkit-ms1 kernel: next_to_use <bd8>
Jan 22 04:34:47 mailkit-ms1 kernel: next_to_clean <bd5>
Jan 22 04:34:47 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 22 04:34:47 mailkit-ms1 kernel: time_stamp <ffffcf81>
Jan 22 04:34:47 mailkit-ms1 kernel: next_to_watch <bd5>
Jan 22 04:34:47 mailkit-ms1 kernel: jiffies <ffffd088>
Jan 22 04:34:47 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 22 04:34:49 mailkit-ms1 kernel: Tx Queue <0>
Jan 22 04:34:49 mailkit-ms1 kernel: TDH <bd5>
Jan 22 04:34:49 mailkit-ms1 kernel: TDT <bd8>
Jan 22 04:34:49 mailkit-ms1 kernel: next_to_use <bd8>
Jan 22 04:34:49 mailkit-ms1 kernel: next_to_clean <bd5>
Jan 22 04:34:49 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 22 04:34:49 mailkit-ms1 kernel: time_stamp <ffffcf81>
Jan 22 04:34:49 mailkit-ms1 kernel: next_to_watch <bd5>
Jan 22 04:34:49 mailkit-ms1 kernel: jiffies <ffffd27b>
Jan 22 04:34:49 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 22 04:34:51 mailkit-ms1 kernel: Tx Queue <0>
Jan 22 04:34:51 mailkit-ms1 kernel: TDH <bd5>
Jan 22 04:34:51 mailkit-ms1 kernel: TDT <bd8>
Jan 22 04:34:51 mailkit-ms1 kernel: next_to_use <bd8>
Jan 22 04:34:51 mailkit-ms1 kernel: next_to_clean <bd5>
Jan 22 04:34:51 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 22 04:34:51 mailkit-ms1 kernel: time_stamp <ffffcf81>
Jan 22 04:34:51 mailkit-ms1 kernel: next_to_watch <bd5>
Jan 22 04:34:51 mailkit-ms1 kernel: jiffies <ffffd46f>
Jan 22 04:34:51 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 22 04:34:53 mailkit-ms1 kernel: NETDEV WATCHDOG: eth0: transmit timed
out
Jan 22 04:34:56 mailkit-ms1 kernel: e1000: eth0: e1000_watchdog: NIC Link
is Up 1000 Mbps Full Duplex, Flow Control: R
X/TX
Jan 22 04:35:22 mailkit-ms1 kernel: Tx Queue <0>
Jan 22 04:35:22 mailkit-ms1 kernel: TDH <a9f>
Jan 22 04:35:22 mailkit-ms1 kernel: TDT <aa1>
Jan 22 04:35:22 mailkit-ms1 kernel: next_to_use <aa1>
Jan 22 04:35:22 mailkit-ms1 kernel: next_to_clean <a9f>
Jan 22 04:35:22 mailkit-ms1 kernel: buffer_info[next_to_clean]
Jan 22 04:35:22 mailkit-ms1 kernel: time_stamp <ffffeff5>
Jan 22 04:35:22 mailkit-ms1 kernel: next_to_watch <a9f>
Jan 22 04:35:22 mailkit-ms1 kernel: jiffies <fffff2db>
Jan 22 04:35:22 mailkit-ms1 kernel: next_to_watch.status <0>
Jan 22 04:35:24 mailkit-ms1 kernel: Tx Queue <0>
Jan 22 04:35:24 mailkit-ms1 kernel: TDH <a9f>
Jan 22 04:35:24 mailkit-ms1 kernel: TDT <aa1>
Jan 22 04:35:24 mailkit-ms1 kernel: next_to_use <aa1>
Jan 22 04:35:24 mailkit-ms1 kernel: next_to_clean <a9f>
Jan 22 04:35:24 mailkit-ms1 kernel: buffer_info[next_to_clean]

It's a supermicro motherboard, with 2x onboard e1000, current kernel is
2.6.18.3 - x86_64, cpu is core2duo e6400.


Date: 2007-01-02 18:18
Sender: go_jesseProject Admin


the TDHclean driver may well have some problems, as it has not been tested
as thoroughly as our production drivers. It is more of a proof of concept.
Unfortunately I haven't had time yet to figure out a way to integrate it
into our production code. Your info is very useful however, as it does
point to some problem in the TDH based clean up code.

We still don't have any systems here to reproduce this error (i.e. it is
fairly rare, and system dependent)



Date: 2006-12-31 14:02
Sender: lubosdolezel


Tdh driver has fixed this problem for me but has introduced another one.
Sometimes (depending on the network load) the interface just stops
transmitting/receiving any packets but there are no errors in the log. It
is enough to wait for about 30 seconds or to do "ip link set eth0 down; ip
link set eth0 up" and it starts working again.

When pinging the Intel card, it looks like this:

64 bytes from ares (10.10.10.2): icmp_seq=315 ttl=64 time=0.110 ms
64 bytes from ares (10.10.10.2): icmp_seq=316 ttl=64 time=0.152 ms
64 bytes from ares (10.10.10.2): icmp_seq=317 ttl=64 time=0.149 ms
64 bytes from ares (10.10.10.2): icmp_seq=318 ttl=64 time=0.094 ms // now
it stops working
64 bytes from ares (10.10.10.2): icmp_seq=341 ttl=64 time=0.153 ms
64 bytes from ares (10.10.10.2): icmp_seq=342 ttl=64 time=0.098 ms
64 bytes from ares (10.10.10.2): icmp_seq=343 ttl=64 time=0.138 ms
64 bytes from ares (10.10.10.2): icmp_seq=344 ttl=64 time=0.099 ms
64 bytes from ares (10.10.10.2): icmp_seq=345 ttl=64 time=0.156 ms
64 bytes from ares (10.10.10.2): icmp_seq=346 ttl=64 time=0.140 ms

Note that this is definitely not because of a faulty UTP cable or sth.
like that.


Date: 2006-12-16 04:43
Sender: rincebrain


dr_kludge,

I used the driver there, but I presumed that the 7.3.15tdh driver differed
from the -TAPI driver noted in log.

Using the tdh driver, the 1m long hangs with NFS stress have gone
away...but the machine now hardlocks once an hour or so instead, and
occasionally stops responding to anything on that interface without
printing or noting anything is wrong - a reload of the module fixes this.

The hardlock can't be fixed short of a SysRq-B or power-cycle. The driver
prints nothing.


Date: 2006-12-16 03:07
Sender: dr_kludgeSourceForge.net Subscriber


rincebrain,

As Jessie noted in a prior post scroll down to the bottom of this bug
report and hit the download link. I posted my steps to install the driver
in Red Hat bug report here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656#c10 .

> The long string of modules seems like a kludgy ...
Tee Hee get it now? Dr. Kludge. >:->> However, if you want to tune how
the e1000 driver works in clustering environments or other high performance
situations, then the modprobe option list is the way you pass the tuning
parms to the driver. I recall reading how the cluster column at
http://www.linux-mag.com/ gained measurable performance results by using
these parms. Search for e1000 or Intel to find the article. Moreover, as
I reported here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656#c11 , I have
been running the stock driver since 10/18/2006 without the modprobe
options. The logs have been clean all this time.

Finally, as you noted, all of your symptoms reported will be fixed by the
driver because your TDH=TDT. Please give the driver a try and report back
if that solves your problem too.

Regards,
Greg


Date: 2006-12-10 11:06
Sender: rincebrain


Jumping on this bug, as we've been reproducing this bug on three different
82540EM (8086:100e) cards, using driver versions 7.0.33-k2 and 7.3.15.
Using the long chain of module options posted below (XsumRX=0 Speed=1000
Duplex=2 InterruptThrottleRate=0 FlowControl=3 RxDescriptors=4096
TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0), most of the errors are
relieved, and for those that remain, TDH=TDT, so we suspect that the
newly-installed 7.3.15-tdhdump driver will remedy them.

Three things I wanted to note in this driver report:
1) Simply disabling TSO (via ethtool) did nothing to help this problem.
2) I can reliably cause these messages by burning a DVD image being shared
over NFS between gigabit NICs.
3) The long string of modules seems like a kludgy way to fix this problem
- is there a less user-interactive fix in the works for this problem, or do
you still require more data to reliably reproduce it before being able to
fix it?


Date: 2006-10-18 23:57
Sender: go_jesseProject Admin

Logged In: YES
user_id=631160

I just added an attachment to this bug that is the driver
dr_kludge mentioned.

It is not our final version of the fix, and probably will
only help people that have the signature in their traces of
TDH <cb>
TDT <cb>
where TDH equals TDT.

if your TDH does not equal TDT then it is likely you are
having a hardware problem for some reason or another.


Date: 2006-10-18 07:29
Sender: dr_kludgeSourceForge.net Subscriber

Logged In: YES
user_id=244471

Update: Stock modprobe options are now being used as noted
here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656#c11
. The solution driver continues to be a success.

Greg


Date: 2006-10-16 08:21
Sender: jheissler

Logged In: YES
user_id=1621878

I've got similar problems and I'd like to test the new
driver. Where to get it?


Date: 2006-10-15 22:01
Sender: dr_kludgeSourceForge.net Subscriber

Logged In: YES
user_id=244471

Update: I am testing a new driver as noted here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656#c10
.

Greg


Date: 2006-09-07 00:56
Sender: tonychung00

Logged In: YES
user_id=1499129

It was an accident that the NIC is not fully inserted in the
PCI slot. It took me a while to find out because I can
actually boot up the system and ethtool -p is working! I
agree this is not a valid test case.

However, I think it proof that "Tx unit hang" could be
caused by hardware faiure. It may a good case that you can
reproduce the problem much easily than trying with heavy
traffic that so many people have complained. I am thinking
this is one way to inject error so the e1000 driver can be
improved to handle it properly.

May be people shold simply replace their NICs or
switch/cable first to make sure "Tx Unit Hang" is caused by
hardware failure.


Date: 2006-09-06 23:32
Sender: go_jesseProject Admin

Logged In: YES
user_id=631160

tonychung00: that is a completely invalid test and you
cannot realistically expect *any* adapter to work right if
it is not plugged in. You could be shorting out pins to
ground or +5V. Honestly, I'm surprised you didn't fry
your slot.


Date: 2006-09-06 22:33
Sender: tonychung00

Logged In: YES
user_id=1499129

I can reproduced "Tx unit Hang" with a ethernet card not
fully inserted into the PCI slot.
The system bootup and can recognize the ethernet port; then
a ping cause the system lockup so bad that sysrq keys did
not work. I have to power cycle the system.

Why will it cause NMI not working?
I beleive the e1000 driver do a full reset on the NIC so why
it still cause the system hung?


Date: 2006-09-05 17:17
Sender: shawvrana

Logged In: YES
user_id=485554

I'm getting the same error on a couple boxes. Tried
upgrading to the latest driver on SF, but that box crashed
over the weekend too.


Date: 2006-09-02 15:03
Sender: avilespa

Logged In: YES
user_id=112643

I am hoping this is the place to debug this. I am
getting e1000 TX unit hang on a Tyan GS12 server (
82541GI/PI and 82547GI) when the Gig card is put under some
load. Not really a lot of load and will hang the system
completely needing a power kill to restart.

I am including the output of lspci and ethtool -e here..

http://sourceforge.net/tracker/download.php?
group_id=42302&atid=447449&file_id=191714&aid=1551045

http://sourceforge.net/tracker/download.php?
group_id=42302&atid=447449&file_id=191713&aid=1551045



Date: 2006-07-31 07:13
Sender: dr_kludgeSourceForge.net Subscriber

Logged In: YES
user_id=244471

These /etc/mobprobe.conf settings bring some relief as noted
here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656

The options line has to be one contiguous line.

...
alias eth0 e1000
#
# Attempt to fix e1000_clean_tx_irq: Detected Tx Unit Hang
# http://www.2cpu.com/forums/showthread.php?t=75798
# http://www.gatago.com/linux/kernel/14660762.html
# http://lkml.org/lkml/2005/12/19/144
# http://support.intel.com/support/network/sb/CS-009209.htm
# http://support.intel.com/support/network/sb/cs-009918.htm
# ftp://download.intel.com/design/network/applnots/ap450.pdf
#
http://agenda.clustermonkey.net/index.php/Tuning_Intel_e1000_NICs
# http://downloadmirror.intel.com/df-support/9180/ENG/README.txt
#
options e1000 XsumRX=0 Speed=1000 Duplex=2
InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0

Regards,
Greg


Date: 2006-07-30 03:02
Sender: dr_kludgeSourceForge.net Subscriber

Logged In: YES
user_id=244471

I added a new Fedora bug against the Xen0 kernel and pointed
to this tracker. Please see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656
. Some of the information should be moved "upstream" to
this tracker. However, RH/Fedora folks need to protect
themselves from addtional bug reports in FC6 if the "tso" is
not turned off on all unsupported gigabit chips. Intel
needs to provide a list of these chips on this site and the
Fedora bug report especially if there is no intention to fix
the problem because of end-of-life hardware policy issues.

Greg


Date: 2006-07-30 03:01
Sender: dr_kludgeSourceForge.net Subscriber

Logged In: YES
user_id=244471

I added a new Fedora bug against the Xen0 kernel and pointed
to this tracker. Please see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656
. Some of the information should be moved "upstream" to
this tracker. However, RH/Fedora folks need to protect
themselves from addtional bug reports in FC6 if the "tso" is
not turned off on all unsupported gigabit chips. Intel
needs to provide a list of these chips on this site and the
Fedora bug report especially if there is no intention to fix
the problem because of end-of-life hardware policy issues.

Greg



Date: 2006-07-19 19:39
Sender: go_jesseProject Admin

Logged In: YES
user_id=631160

d_sergienko, you have a different problem manifseting in a
tx hang than micw.

micw, you need an updated eeprom for your 82573.




Date: 2006-07-19 13:36
Sender: d_sergienko

Logged In: YES
user_id=143657

Sure, running tests for 30 minutes have shown hands :(



Date: 2006-07-19 12:45
Sender: micw

Logged In: YES
user_id=402991

Hi,

i cannot confirm that "InterruptThrottleRate=0" solves this
issue. It seems to run a bit better (the first error occured
after about a minute instead of after seconds). But it still
occured a few times during my "rsync test".

Here's the syslog output with "InterruptThrottleRate=0":


Jul 19 14:42:39 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jul 19 14:42:39 pc14 kernel: Tx Queue <0>
Jul 19 14:42:39 pc14 kernel: TDH <cb>
Jul 19 14:42:39 pc14 kernel: TDT <d0>
Jul 19 14:42:39 pc14 kernel: next_to_use <d0>
Jul 19 14:42:39 pc14 kernel: next_to_clean <cb>
Jul 19 14:42:39 pc14 kernel: buffer_info[next_to_clean]
Jul 19 14:42:39 pc14 kernel: time_stamp <595e89>
Jul 19 14:42:39 pc14 kernel: next_to_watch <cb>
Jul 19 14:42:39 pc14 kernel: jiffies <596073>
Jul 19 14:42:39 pc14 kernel: next_to_watch.status <0>
Jul 19 14:42:41 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jul 19 14:42:41 pc14 kernel: Tx Queue <0>
Jul 19 14:42:41 pc14 kernel: TDH <cb>
Jul 19 14:42:41 pc14 kernel: TDT <d0>
Jul 19 14:42:41 pc14 kernel: next_to_use <d0>
Jul 19 14:42:41 pc14 kernel: next_to_clean <cb>
Jul 19 14:42:41 pc14 kernel: buffer_info[next_to_clean]
Jul 19 14:42:41 pc14 kernel: time_stamp <595e89>
Jul 19 14:42:41 pc14 kernel: next_to_watch <cb>
Jul 19 14:42:41 pc14 kernel: jiffies <596267>
Jul 19 14:42:41 pc14 kernel: next_to_watch.status <0>
Jul 19 14:42:43 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jul 19 14:42:43 pc14 kernel: Tx Queue <0>
Jul 19 14:42:43 pc14 kernel: TDH <cb>
Jul 19 14:42:43 pc14 kernel: TDT <d0>
Jul 19 14:42:43 pc14 kernel: next_to_use <d0>
Jul 19 14:42:43 pc14 kernel: next_to_clean <cb>
Jul 19 14:42:43 pc14 kernel: buffer_info[next_to_clean]
Jul 19 14:42:43 pc14 kernel: time_stamp <595e89>
Jul 19 14:42:43 pc14 kernel: next_to_watch <cb>
Jul 19 14:42:43 pc14 kernel: jiffies <59645b>
Jul 19 14:42:43 pc14 kernel: next_to_watch.status <0>
Jul 19 14:42:45 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jul 19 14:42:45 pc14 kernel: Tx Queue <0>
Jul 19 14:42:45 pc14 kernel: TDH <cb>
Jul 19 14:42:45 pc14 kernel: TDT <d0>
Jul 19 14:42:45 pc14 kernel: next_to_use <d0>
Jul 19 14:42:45 pc14 kernel: next_to_clean <cb>
Jul 19 14:42:45 pc14 kernel: buffer_info[next_to_clean]
Jul 19 14:42:45 pc14 kernel: time_stamp <595e89>
Jul 19 14:42:45 pc14 kernel: next_to_watch <cb>
Jul 19 14:42:45 pc14 kernel: jiffies <59664f>
Jul 19 14:42:45 pc14 kernel: next_to_watch.status <0>
Jul 19 14:42:46 pc14 kernel: NETDEV WATCHDOG: eth0: transmit
timed out
Jul 19 14:42:48 pc14 kernel: e1000: eth0:
e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
Jul 19 14:42:48 pc14 kernel: e1000: eth0:
e1000_watchdog_task: 10/100 speed: disabling TSO



Date: 2006-07-19 11:26
Sender: d_sergienko

Logged In: YES
user_id=143657

Bug is 100% reproduceable on 82541GI, kernel 2.6.16.20,
driver versions 6.1.16.2.DB, 6.3.9-k4, 7.1.9.

To reproduce error just flood machine with packets, I do
this from FreeBSD 6.1 host:

# ping -i 0 -q <hostname>

It generates traffic about 30-35 kpps.

Disabling Interrupt Throttling solves the problem.

options e1000 InterruptThrottleRate=0

TSO has been left intact and is on by default.



Date: 2006-07-12 07:14
Sender: micw

Logged In: YES
user_id=402991

Hi,

i uploaded the output of the dubug driver to:
http://wyraz.de/files/niclog
Hope that helps.

output of ethtool -e eth0:
Offset Values
------ ------
0x0000 00 16 76 21 bd dd 30 0b 46 f7 01 10 ff ff ff ff
0x0010 ff ff ff ff 6b 02 90 30 86 80 8b 10 86 80 de 80
0x0020 00 00 00 20 14 7e 00 00 00 00 d8 00 00 00 00 27
0x0030 c9 6c 50 31 22 07 0b 04 84 09 00 00 00 c0 06 07
0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060 00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 0e


Yesterday i upgraded to Kernel 2.6.17-1.2145_FC5smp (Fedora)
- the problem is still there.

Michael


Date: 2006-07-11 18:35
Sender: go_jesseProject Admin

Logged In: YES
user_id=631160

micw: please include the output of ethtool -e ethX for your
82573V


Date: 2006-06-14 07:47
Sender: micw

Logged In: YES
user_id=402991

Hi,

Yesterday i upgraded my kernel to 2.6.16-1.2133_FC5smp -
problem still exists.
I could reproduce it today by simply rsyncing my /opt to a
different pc (on the same 100mbit switch).
I removed "/sbin/ethtool -K eth0 tso off". Today the
connection resumed fast from the failure again.

Now I compiled the driver with the patch. It's significant
larger than the original.

The problem is also reproduceable with the debug driver (of
course :) ). I have a 70kb logfile but a cannot attach files
here.



Date: 2006-06-13 14:38
Sender: sofar

Logged In: YES
user_id=126698

`emerge --sync uses rsync`

Unfortunately we're unable to work with rsync over the
internet here due to fw restrictions. Can you give me some
idea what kind of rsync connection is established?

rsync only uses a single connection over post 873. it's
amazing that this causes the tx_hang error: most people
report tx_hang with ten or more concurrent connections on
heavily loaded servers. I would really like to know the
payload of the physical rsync call!

Another thing to try (for debugging output) is driver 7.0.33
with this patch:

http://sourceforge.net/tracker/download.php?group_id=42302&atid=447451&file_id=172710&aid=1460945

Since you are able to reproduce the problem so quickly, I
hope you can give this patch a try to get us some debugging
output.


Date: 2006-06-13 08:13
Sender: micw

Logged In: YES
user_id=402991

In addition to the last message, something has changed after
upgrading to the sf.net driver (from original kernel 2.6.16
driver) and setting "tso off":

Before this change, the system hung a few seconds and
resumed fine. Today, my nfs shares and my network stopped
working totally. After restarting the network, nfs came up
after about 1 minute.


Date: 2006-06-13 08:10
Sender: micw

Logged In: YES
user_id=402991

Hi,
sf.net rejected my answer because of a missing postmaster
address... So i repost here again...

emerge --sync uses rsync to synchronize more than 100.000
small files.

You can reproduce this from any linux system with a gentoo
chroot:

- download a gentoo tarball
http://distro.ibiblio.org/pub/linux/distributions/gentoo/releases/x86/current/stages/stage3-x86-2006.0.tar.bz2
- mkdir and extract
mkdir gentoo; cd gentoo
tar xfjp ../stage3-x86-2006.0.tar.bz2
- prepare and chroot
cp /etc/resolv.conf etc/
mount /dev dev -o bind
mount none proc -t proc
chroot . /bin/bash
- run the sync
emerge --sync

You have to remove
usr/portage to run the full sync again.

The process takes about 3% cpu (dualcore p4@3ghz).
The computer is connected to a 100MBit SWITCH.
The card correctly detects this.

The error occurs after a few seconds or minutes (the whole
sync takes about 15-30 minutes - time enough to reproduce).

The System i'm running is fedora core 5 with latest stable
SMP-kernel
2.6.16-1.2122_FC5smp.
I recompiled the e1000 module from sourceforge (latest stable).
On startup i run /sbin/ethtool -K eth0 tso off which seems
to have no effekt.

Here are the kernel messages:

Jun 13 09:20:35 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jun 13 09:20:35 pc14 kernel: Tx Queue <0>
Jun 13 09:20:35 pc14 kernel: TDH <28>
Jun 13 09:20:35 pc14 kernel: TDT <2a>
Jun 13 09:20:35 pc14 kernel: next_to_use <2a>
Jun 13 09:20:35 pc14 kernel: next_to_clean <28>
Jun 13 09:20:35 pc14 kernel: buffer_info[next_to_clean]
Jun 13 09:20:35 pc14 kernel: time_stamp <f8a4>
Jun 13 09:20:35 pc14 kernel: next_to_watch <28>
Jun 13 09:20:35 pc14 kernel: jiffies <f9ba>
Jun 13 09:20:35 pc14 kernel: next_to_watch.status <0>
Jun 13 09:20:37 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jun 13 09:20:37 pc14 kernel: Tx Queue <0>
Jun 13 09:20:37 pc14 kernel: TDH <28>
Jun 13 09:20:37 pc14 kernel: TDT <2a>
Jun 13 09:20:37 pc14 kernel: next_to_use <2a>
Jun 13 09:20:37 pc14 kernel: next_to_clean <28>
Jun 13 09:20:37 pc14 kernel: buffer_info[next_to_clean]
Jun 13 09:20:37 pc14 kernel: time_stamp <f8a4>
Jun 13 09:20:37 pc14 kernel: next_to_watch <28>
Jun 13 09:20:37 pc14 kernel: jiffies <fbae>
Jun 13 09:20:37 pc14 kernel: next_to_watch.status <0>
Jun 13 09:20:39 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jun 13 09:20:39 pc14 kernel: Tx Queue <0>
Jun 13 09:20:39 pc14 kernel: TDH <28>
Jun 13 09:20:39 pc14 kernel: TDT <2a>
Jun 13 09:20:39 pc14 kernel: next_to_use <2a>
Jun 13 09:20:39 pc14 kernel: next_to_clean <28>
Jun 13 09:20:39 pc14 kernel: buffer_info[next_to_clean]
Jun 13 09:20:39 pc14 kernel: time_stamp <f8a4>
Jun 13 09:20:39 pc14 kernel: next_to_watch <28>
Jun 13 09:20:39 pc14 kernel: jiffies <fda2>
Jun 13 09:20:39 pc14 kernel: next_to_watch.status <0>
Jun 13 09:20:41 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jun 13 09:20:41 pc14 kernel: Tx Queue <0>
Jun 13 09:20:41 pc14 kernel: TDH <28>
Jun 13 09:20:41 pc14 kernel: TDT <2a>
Jun 13 09:20:41 pc14 kernel: next_to_use <2a>
Jun 13 09:20:41 pc14 kernel: next_to_clean <28>
Jun 13 09:20:41 pc14 kernel: buffer_info[next_to_clean]
Jun 13 09:20:41 pc14 kernel: time_stamp <f8a4>
Jun 13 09:20:41 pc14 kernel: next_to_watch <28>
Jun 13 09:20:41 pc14 kernel: jiffies <ff97>
Jun 13 09:20:41 pc14 kernel: next_to_watch.status <0>
Jun 13 09:20:43 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jun 13 09:20:43 pc14 kernel: Tx Queue <0>
Jun 13 09:20:43 pc14 kernel: TDH <28>
Jun 13 09:20:43 pc14 kernel: TDT <2a>
Jun 13 09:20:43 pc14 kernel: next_to_use <2a>
Jun 13 09:20:43 pc14 kernel: next_to_clean <28>
Jun 13 09:20:43 pc14 kernel: buffer_info[next_to_clean]
Jun 13 09:20:43 pc14 kernel: time_stamp <f8a4>
Jun 13 09:20:43 pc14 kernel: next_to_watch <28>
Jun 13 09:20:43 pc14 kernel: jiffies <1018c>
Jun 13 09:20:43 pc14 kernel: next_to_watch.status <0>
Jun 13 09:20:44 pc14 kernel: NETDEV WATCHDOG: eth0: transmit
timed out
Jun 13 09:20:44 pc14 kernel: e1000: eth0: e1000_watchdog:
NIC Link is Up 100 Mbp s Full Duplex
Jun 13 09:20:44 pc14 kernel: e1000: eth0:
e1000_clean_tx_irq: Detected Tx Unit H ang
Jun 13 09:20:44 pc14 kernel: Tx Queue <0>
Jun 13 09:20:44 pc14 kernel: TDH <0>
Jun 13 09:20:44 pc14 kernel: TDT <31>
Jun 13 09:20:44 pc14 kernel: next_to_use <31>
Jun 13 09:20:44 pc14 kernel: next_to_clean <28>
Jun 13 09:20:44 pc14 kernel: buffer_info[next_to_clean]
Jun 13 09:20:44 pc14 kernel: time_stamp <f8a4>
Jun 13 09:20:44 pc14 kernel: next_to_watch <28>
Jun 13 09:20:44 pc14 kernel: jiffies <10282>
Jun 13 09:20:44 pc14 kernel: next_to_watch.status <0>
Jun 13 09:20:44 pc14 kernel: e1000: eth0: e1000_watchdog:
NIC Link is Down
Jun 13 09:20:46 pc14 kernel: e1000: eth0: e1000_watchdog:
NIC Link is Up 100 Mbp s Full Duplex


There is another workstation with an equal NIC but different
mainboard wich is
connected to a different SWITCH (also 100MBit).
The problem also occurs there.

If you need information about the boards or NICs, please
tell me.


Date: 2006-06-12 18:01
Sender: sofar

Logged In: YES
user_id=126698

requested user to describe system load during `emerge
--sync` as to figure out if the system load can be
reproduced on non-gentoo linux systems.


Date: 2006-06-12 17:30
Sender: micw

Logged In: YES
user_id=402991

Same problem here with latest driver (e1000-7.0.41) and also
with the current 2.6.16 driver on both gentoo and fedora 5.

My Hardware is an onboard 82573V (Intel Motherboard).

I can reproduce the problem if i extract a new gentoo
stage3, chroot and run "emerge --sync". This causes the
error to occur several times.

Now i'm tried with "ethtool -K eth0 tso off":
e1000: eth0: e1000_set_tso: TSO is Disabled
I run emerge --sync:
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

So the workaround is not working :-(

Michael.




Date: 2006-06-02 17:29
Sender: sofar

Logged In: YES
user_id=126698

Update: currently we have spent a lot of time trying to
resolve this issue. Allthough we have some idea what kind of
hardware this happens on it still is impossible to reproduce
and no known fix exists.

As a workaround we advise people to turn off tso using
ethtool for the device. That should remove the tx hangs.


Date: 2006-06-02 17:27
Sender: sofar

Logged In: YES
user_id=126698

marked #978449 as duplicate to this one


Date: 2006-05-10 20:45
Sender: stevewin64

Logged In: YES
user_id=1475325

I am seeing this error using a Pro/1000 PT PCIE adapter
with Linux kernel versions 2.6.17-rc* (e1000 driver
version 7.0.33-k2)

I tried the 'acpi=off noacpi' options mentioned in this
thread and it did not make any difference. Enabling NAPI
also did not make a difference.

Here is some output with debug on (note there is another
interface eth1 (non-e1000) also being brought up - any
successful messages relate to that interface):

e1000_check_for_link
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_check_downshift
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_config_dsp_after_link_change
e1000_config_collision_dist
e1000_config_fc_after_link_up
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
Flow Control = FULL.

e1000_get_speed_and_duplex
100 Mbs,
Full Duplex

e1000_force_mac_fc
e1000_get_speed_and_duplex
100 Mbs,
Full Duplex

e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps
Full Duplex
e1000: eth0: e1000_watchdog_task: 10/100 speed: disabling
TSO
e1000_update_adaptive
Sending DHCP requests .<6>eth1: Link is Up. Speed is 100
Mbps Full Duplex
e1000_phy_get_info
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_phy_igp_get_info
e1000_check_polarity
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_check_for_link
e1000_update_adaptive
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <0>
TDT <1>
next_to_use <1>
next_to_clean <0>
buffer_info[next_to_clean]
time_stamp <fffee3a5>
next_to_watch <0>
jiffies <fffee4d0>
next_to_watch.status <0>
.<2>e1000_check_for_link
e1000_update_adaptive
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <0>
TDT <1>
next_to_use <1>
next_to_clean <0>
buffer_info[next_to_clean]
time_stamp <fffee3a5>
next_to_watch <0>
jiffies <fffee6c5>
next_to_watch.status <0>
, OK
IP-Config: Got DHCP answer from 9.42.234.13, my address is
9.42.235.101
U3-MPIC: disable_irq: 3 (src 3)
e1000_reset_hw
e1000_disable_pciex_master
e1000_set_pci_express_master_disable
Master requests are pending.

PCI-E Master disable polling has failed.

Masking off all interrupts

Issuing a global reset to MAC

e1000_get_auto_rd_done
Masking off all interrupts

e1000_init_hw
e1000_id_led_init
e1000_read_eeprom
e1000_is_onboard_nvm_eeprom
e1000_acquire_eeprom
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_spi_eeprom_ready
e1000_release_eeprom
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_set_media_type
Initializing the IEEE VLAN

e1000_init_rx_addrs
Programming MAC Address into RAR[0]

Clearing RAR[1-15]

Zeroing the MTA

e1000_setup_link
After fix-ups FlowControl is now = 3

e1000_setup_copper_link
e1000_copper_link_preconfig
e1000_detect_gig_phy
Phy ID = 2a80380

e1000_set_phy_mode
e1000_copper_link_igp_setup
e1000_phy_reset
e1000_phy_hw_reset
Resetting Phy...

e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_get_phy_cfg_done
e1000_release_software_semaphore
e1000_phy_init_script
e1000_set_d3_lplu_state
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_set_d0_lplu_state
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_copper_link_autoneg
Reconfiguring auto-neg advertisement params

e1000_phy_setup_autoneg
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
autoneg_advertised 2f

Advertise 10mb Half duplex

Advertise 10mb Full duplex

Advertise 100mb Half duplex

Advertise 100mb Full duplex

Advertise 1000mb Full duplex

e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
Auto-Neg Advertising de1

e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
Restarting Auto-Neg

e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_write_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
Unable to establish link!!!

Initializing the Flow Control address, type and timer regs

e1000_reset_adaptive
e1000_phy_get_info
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
...

Not sure if it is related, but when I turn ip=off and try
to configure the interface after Linux has booted, I do
not get the Tx hang error, but the interface still does
not come up - just stuck in a loop checking for link:

[root@stevewin4 root]# ifconfig eth0 9.42.235.101 netmask
255.255.252.0 up
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_config_collision_dist
c000000000547000: U3-MPIC: enable_irq: 3 (src 3)
e1000_check_for_link
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_check_downshift
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_config_dsp_after_link_change
e1000_config_collision_dist
[root@stevewin4 e1000_config_fc_after_link_up
root]# e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
Flow Control = FULL.

e1000_get_speed_and_duplex
100 Mbs,
Full Duplex

e1000_force_mac_fc
e1000_get_speed_and_duplex
100 Mbs,
Full Duplex

e1000_update_adaptive
e1000_phy_get_info
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_phy_igp_get_info
e1000_check_polarity
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_read_phy_reg
e1000_swfw_sync_acquire
e1000_get_hw_eeprom_semaphore
e1000_write_phy_reg_ex
e1000_read_phy_reg_ex
e1000_swfw_sync_release
e1000_put_hw_eeprom_semaphore
e1000_check_for_link
e1000_update_adaptive
e1000_check_for_link
e1000_update_adaptive
e1000_check_for_link
e1000_update_adaptive
e1000_check_for_link
e1000_update_adaptive
e1000_check_for_link
e1000_update_adaptive
e1000_check_for_link
e1000_update_adaptive
e1000_check_for_link
e1000_update_adaptive

Note that when a PCIE-PCI bridge is plugged into the PCIE
slot, and a PCI PRO/1000 card is plugged into the PCI slot
on the bridge everything works fine.


Date: 2006-05-04 08:49
Sender: denismb

Logged In: YES
user_id=1514651

I have exactly the same problem on my travelmate 8204. I
"solved" the problem by booting with the options :
"acpi=off noacpi"

But still, it's not a perfect solution to lose all the acpi
stuff on a laptop.



Date: 2006-04-27 11:04
Sender: degerhar

Logged In: YES
user_id=231130

Hi. I'm running kernel 2.6.16.2, Acer Travelmate 8204

lspci gives: Unknown device 108c (rev 03)

I keep getting e1000: eth0: e1000_clean_tx_irq: Detected Tx
Unit Hang.

It seems that this only happens when the card is
experiencing some load. If I use another device for
accessing the internet, it's fine and I can continue to work
and ssh to other boxes.

Any ideas on how to fix this? I'm using e1000 v. 7.0.33

Thanx


Date: 2006-04-21 10:08
Sender: nimbob

Logged In: YES
user_id=901594

I also have this problem under both 6.1.16-k2 and 7.0.33

Is this problem likely to be fixed, what's the situation?


Date: 2006-04-05 20:27
Sender: janid

Logged In: YES
user_id=1313719

Hi!

I have new messages, with the latest (7.0.33)driver too!

e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <62>
TDT <4e>
next_to_use <4e>
next_to_clean <61>
buffer_info[next_to_clean]
time_stamp <46b25f>
next_to_watch <65>
jiffies <46b2ec>
next_to_watch.status <0>
nfs: server 192.168.2.1 not responding, still trying
nfs: server 192.168.2.1 not responding, still trying
nfs: server 192.168.2.1 not responding, still trying
nfs: server 192.168.2.1 not responding, still trying
e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <62>
TDT <4e>
next_to_use <4e>
next_to_clean <61>
buffer_info[next_to_clean]
time_stamp <46b25f>
next_to_watch <65>
jiffies <46b3b4>
next_to_watch.status <0>
nfs: server 192.168.2.1 not responding, still trying
e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <62>
TDT <4e>
next_to_use <4e>
next_to_clean <61>
buffer_info[next_to_clean]
time_stamp <46b25f>
next_to_watch <65>
jiffies <46b47c>
next_to_watch.status <0>
nfs: server 192.168.2.1 not responding, still trying
NETDEV WATCHDOG: eth2: transmit timed out
e1000: eth2: e1000_watchdog: NIC Link is Up 1000 Mbps Full
Duplex
nfs: server 192.168.2.1 OK

The only difference is with the new driver, the system can
survive the issue.

And if i disable the tso, the driver becomes stable!
(I am not so sure, need more testing...)

Cheers,
Janos


Attached File ( 1 )

Filename Description Download
e1000-7.3.15tdh.tar.gz this driver may help if TDH==TDT in your tx hang messages Download

Changes ( 9 )

Field Old Value Date By
close_date 2007-10-29 00:51 2007-12-28 03:20 sf-robot
status_id Pending 2007-12-28 03:20 sf-robot
status_id Open 2007-10-29 00:51 go_jesse
close_date - 2007-10-29 00:51 go_jesse
artifact_group_id None 2007-01-02 18:18 go_jesse
File Added 198849: e1000-7.3.15tdh.tar.gz 2006-10-18 23:54 go_jesse
artifact_group_id v1.0 (example) 2006-06-02 17:24 sofar
category_id Interface (example) 2006-06-02 15:36 sofar
assigned_to nobody 2006-05-26 17:52 sofar