#407 e1000e Detected Hardware Unit Hang 3.0.4.1-NAPI

open
None
standalone_driver
8
2015-12-12
2014-04-15
Max
No

We have several new Supermico boxes all of which are getting this error. The connected switch reports a link loss for three seconds during each occurrence. The boxes get this error several times each per day. We also tried several other driver versions in the 2.x.x series including the standard kernels for linux 3.8 and 3.11 before also trying 3.0.4.1-NAPI.

I'm attaching
- dmesg output
- eeprom dump
- lspci -vvv

Please let me know if I can add any other info or if you have questions.

Thx

Max

p.s. all output files attached as a single file (hopefully?) as allfiles.txt

1 Attachments

Discussion

  • Todd Fujinaka
    Todd Fujinaka
    2014-04-16

    • assigned_to: dertman
     
  • dertman
    dertman
    2014-04-16

    Could you please try disabling EEE and EEE advertising to see if that has an effect on the issue?

    ethtool --set-eee eth0 eee off
    ethtool --set-eee eth0 advertise 0

     
  • Max
    Max
    2014-04-16

    OK cool, thanks for the help. I downloaded and built ethtool-3.13 and ran those commands. We haven't found a way to make the problem happen on demand but yesterday there were 34 occurrences, so the debug cycle for this bug currently has a built-in latency : ) I'll post an update tomorrow (or beforehand if there is a recurrence).

    Max
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    ./ethtool --show-eee eth0

    EEE Settings for eth0:
    EEE status: disabled
    Tx LPI: 17 (us)
    Supported EEE link modes: 100baseT/Full
    1000baseT/Full
    Advertised EEE link modes: Not reported
    Link partner advertised EEE link modes: 100baseT/Full
    1000baseT/Full

     
  • dertman
    dertman
    2014-04-16

    Also, could you supply the output from "ethtool -i eth0"?

     
  • Max
    Max
    2014-04-16

    Unfortunately, disabling eee does not seem to have fixed it. Also, below, I've pasted "ethtool -i eth0"

    Thanks

    Max

    Apr 16 16:42:34 ubuntu kernel: [372570.239954] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] TDH <48>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] TDT <8f>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] next_to_use <8f>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] next_to_clean <48>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] buffer_info[next_to_clean]:
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] time_stamp <1058af3aa>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] next_to_watch <48>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] jiffies <1058afa9d>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] next_to_watch.status <0>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] MAC Status <40080083>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] PHY Status <796d>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] PHY 1000BASE-T Status <3800>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] PHY Extended Status <3000>
    Apr 16 16:42:34 ubuntu kernel: [372570.239954] PCI Status <10>
    Apr 16 16:42:35 ubuntu kernel: [372571.244068] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
    Apr 16 16:42:39 ubuntu kernel: [372575.076263] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx


    ./ethtool -i eth0

    driver: e1000e
    version: 3.0.4.1-NAPI
    firmware-version: 0.13-4
    bus-info: 0000:00:19.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no

    ./ethtool --show-eee eth0

    EEE Settings for eth0:
    EEE status: disabled
    Tx LPI: 17 (us)
    Supported EEE link modes: 100baseT/Full
    1000baseT/Full
    Advertised EEE link modes: Not reported
    Link partner advertised EEE link modes: 100baseT/Full
    1000baseT/Full

     
  • Max
    Max
    2014-04-27

    Any update on this? Is there any additional info that we could supply ?

    Thx

    Max

     
  • asmlover
    asmlover
    2014-04-28

     
    Last edit: asmlover 2014-04-28
  • asmlover
    asmlover
    2014-04-28

    I have the same problems on a Supermicro X9SCM-F. Did you try kernel 3.10.38? The git log has quite a bunch of fixes from the previous weeks.

     
  • Olaf Marzocchi
    Olaf Marzocchi
    2014-05-03

    Hello, I experience the same issue on 2.5.4 and 3.0.x, but not on 2.4.14:

    May  2 23:32:46 Xeon-di-Olaf kernel[0]: AppleIntelE1000e(Err): Detected Hardware Unit Hang:
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: TDH                  <81>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: TDT                  <5b>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: next_to_use          <5b>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: next_to_clean        <7d>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: buffer_info[next_to_clean]:
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: time_stamp           <59faf3>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: next_to_watch        <81>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: next_to_watch.status <0>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: MAC Status             <80083>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: PHY Status             <796d>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: PHY 1000BASE-T Status  <3c00>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: PHY Extended Status    <3000>
    May  2 23:32:46 Xeon-di-Olaf kernel[0]: PCI Status             <10>
    

    I have a GA-Z87MX-D3H but I don't know which chip it is, it is 10/100/1000.
    I can provide more info if needed.

     
  • Oliver Wagner
    Oliver Wagner
    2014-05-05

    Try disabling TSO and GSO:

    ethtool -K eth0 tso off
    ethtool -K eth0 gso off

    See also tickets #372 and #378 which IMHO describe the same bug

     
  • Max
    Max
    2014-05-09

    OK I ran the ethtool commands suggested (thanks, Oliver) - and we haven't seen any further occurrences of the Hang problem. It's been a few days, so it seems like it ought to have occurred since then if it were still susceptible.

    We've got a few other boxes where this problem was also occurring, and I'll be running these commands on them as well,but so far, so good!

    This is one of those cases where it's difficult to prove a negative, and for all I know it may be premature to say it's completely fixed. Nonetheless this is encouraging.

    Max

     
  • Oliver Wagner
    Oliver Wagner
    2014-05-10

    Note that this is not really a fix, just a workaround, as it disables a performance feature (segmentation offload).

    Best Regards,
    Olli

     
  • Max
    Max
    2014-05-14

    We made the same change on the remainder of the boxes, got this email from "the boss",

     "Excellent!  We have not seen any issues since."
    

    So it seems that disabling segmentation offload has made the boss happy. That this is just a workaround may be true enough, I believe this means the CPU on the mobo is working harder than it might have before, and that the root cause of the problem is not actually addressed here; nonetheless this is good progress. Not sure ultimately whether it means this ticket can/should be closed or what, it might be difficult to get the boss to agree to any other changes now - but in any case, thanks again Oliver for your help!

    Best

    Max

     
  • Oliver Wagner
    Oliver Wagner
    2014-05-26

    For the sake of completeness (see 372 and 378):

    This issue still happens with

    e1000e: Intel(R) PRO/1000 Network Driver - 3.0.4.1-NAPI

    on

    Linux gateway1 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

     
  • root@r5:~# ethtool -i eth1
    driver: e1000e
    version: 3.1.0.2-NAPI
    firmware-version: 0.10-2
    bus-info: 0000:00:19.0

    [1061400.823272] e1000e 0000:00:19.0 eth1: Detected Hardware Unit Hang:
    [1061400.823272] TDH <0>
    [1061400.823272] TDT <1>
    [1061400.823272] next_to_use <1>
    [1061400.823272] next_to_clean <0>
    [1061400.823272] buffer_info[next_to_clean]:
    [1061400.823272] time_stamp <106537a01>
    [1061400.823272] next_to_watch <0>
    [1061400.823272] jiffies <106537b7e>
    [1061400.823272] next_to_watch.status <0>
    [1061400.823272] MAC Status <40080083>
    [1061400.823272] PHY Status <796d>
    [1061400.823272] PHY 1000BASE-T Status <3800>
    [1061400.823272] PHY Extended Status <2000>
    [1061400.823272] PCI Status <10>
    [1061402.551187] e1000e: eth1 NIC Link is Down

    root@r5:~# ethtool -k eth1
    Offload parameters for eth1:
    rx-checksumming: on
    tx-checksumming: on
    scatter-gather: on
    tcp-segmentation-offload: off
    udp-fragmentation-offload: off
    generic-segmentation-offload: off
    generic-receive-offload: on
    large-receive-offload: off
    ntuple-filters: off
    receive-hashing: on

    00:19.0 Ethernet controller: Intel Corporation 82578DM Gigabit Network Connection (rev 05)
    Subsystem: Intel Corporation Device 34ec
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort-="">SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 62
    Region 0: Memory at b3200000 (32-bit, non-prefetchable) [size=128K]
    Region 1: Memory at b3225000 (32-bit, non-prefetchable) [size=4K]
    Region 2: I/O ports at 4040 [size=32]
    Capabilities: [c8] Power Management version 2
    Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
    Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Address: 00000000fee0f00c Data: 412a
    Capabilities: [e0] PCI Advanced Features
    AFCap: TP+ FLR+
    AFCtrl: FLR-
    AFStatus: TP+
    Kernel driver in use: e1000e

     
  • Todd Fujinaka
    Todd Fujinaka
    2015-05-12

    • assigned_to: dertman --> Yanir Lubetkin
     
  • Graham Crowe
    Graham Crowe
    2015-05-14

    I also have this problem.

    A machine running Fedora 19, with four Xen guests also running Fedora 19 was working fine. I upgraded the host to Fedora 21, but left the guests running Fedora 19 and the problem started.

    Note that the host has one interface, em1, with 6 VLANs configured. Each VLAN interface is a member of a bridge which the guests connect to.

    I tried
    ethtool --set-eee em1 eee off
    ethtool --set-eee em1 advertise 0
    but the problem didn't go away.

    I then tried
    ethtool -K eth0 tso off
    which stopped the problem. The problem came back when I turned TSO back on, and then cleared again when I turned it off.

    After the upgrade of the host from F19 to F21, the system worked fine until I printed something from a windows box via CUPS on one of the guests. The host interface would yo-yo up and down repeatedly. Rebooting the guest or the host would not fix the issue, but disabling CUPS on the guest would stop the problems (re-enabling CUPS would trigger the problem again). Once I had disabled TSO, I was able to re-enable CUPS and it finished printing.
    Note that this interface has continual high incoming traffic (average > 20Mb/s according to MRTG) due to ethernet video streaming. One of the guests runs as a MythTV backend which has run well since the upgrade, plus other large transfers (in both directions) such as samba haven't shown any issues, so I don't really understand how CUPS would be different here.

    Outputs from ethtool as follows...

    [root@host1 ~]# ethtool -i em1
    driver: e1000e
    version: 2.3.2-k
    firmware-version: 0.13-4
    bus-info: 0000:00:19.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no

    Output from dmesg as follows...

    [168815.466415] e1000e 0000:00:19.0 em1: Detected Hardware Unit Hang:
    TDH <63>
    TDT <d3>
    next_to_use <d3>
    next_to_clean <62>
    buffer_info[next_to_clean]:
    time_stamp <10a0b3a2d>
    next_to_watch <66>
    jiffies <10a0b4100>
    next_to_watch.status <0>
    MAC Status <80083>
    PHY Status <796d>
    PHY 1000BASE-T Status <7800>
    PHY Extended Status <3000>
    PCI Status <10>
    [168817.466427] e1000e 0000:00:19.0 em1: Detected Hardware Unit Hang:
    TDH <63>
    TDT <d3>
    next_to_use <d3>
    next_to_clean <62>
    buffer_info[next_to_clean]:
    time_stamp <10a0b3a2d>
    next_to_watch <66>
    jiffies <10a0b48d0>
    next_to_watch.status <0>
    MAC Status <80083>
    PHY Status <796d>
    PHY 1000BASE-T Status <7800>
    PHY Extended Status <3000>
    PCI Status <10>
    [168819.466448] e1000e 0000:00:19.0 em1: Detected Hardware Unit Hang:
    TDH <63>
    TDT <d3>
    next_to_use <d3>
    next_to_clean <62>
    buffer_info[next_to_clean]:
    time_stamp <10a0b3a2d>
    next_to_watch <66>
    jiffies <10a0b50a0>
    next_to_watch.status <0>
    MAC Status <80083>
    PHY Status <796d>
    PHY 1000BASE-T Status <7800>
    PHY Extended Status <3000>
    PCI Status <10>
    [168821.466558] e1000e 0000:00:19.0 em1: Detected Hardware Unit Hang:
    TDH <63>
    TDT <d3>
    next_to_use <d3>
    next_to_clean <62>
    buffer_info[next_to_clean]:
    time_stamp <10a0b3a2d>
    next_to_watch <66>
    jiffies <10a0b5870>
    next_to_watch.status <0>
    MAC Status <80083>
    PHY Status <796d>
    PHY 1000BASE-T Status <7800>
    PHY Extended Status <3000>
    PCI Status <10>
    [168823.466557] e1000e 0000:00:19.0 em1: Detected Hardware Unit Hang:
    TDH <63>
    TDT <d3>
    next_to_use <d3>
    next_to_clean <62>
    buffer_info[next_to_clean]:
    time_stamp <10a0b3a2d>
    next_to_watch <66>
    jiffies <10a0b6040>
    next_to_watch.status <0>
    MAC Status <80083>
    PHY Status <796d>
    PHY 1000BASE-T Status <7800>
    PHY Extended Status <3000>
    PCI Status <10>
    [168824.474285] e1000e 0000:00:19.0 em1: Reset adapter unexpectedly
    [168824.495051] br10: port 1(em1.10) entered disabled state
    [168824.495320] br20: port 1(em1.20) entered disabled state
    [168824.495561] br21: port 1(em1.21) entered disabled state
    [168824.495792] br100: port 1(em1.100) entered disabled state
    [168824.496031] br101: port 1(em1.101) entered disabled state
    [168824.496273] br102: port 1(em1.102) entered disabled state
    [168827.629514] e1000e: em1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [168827.629817] br10: port 1(em1.10) entered forwarding state
    [168827.630039] br10: port 1(em1.10) entered forwarding state
    [168827.630267] br20: port 1(em1.20) entered forwarding state
    [168827.630485] br20: port 1(em1.20) entered forwarding state
    [168827.630738] br21: port 1(em1.21) entered forwarding state
    [168827.630941] br21: port 1(em1.21) entered forwarding state
    [168827.631156] br100: port 1(em1.100) entered forwarding state
    [168827.631367] br100: port 1(em1.100) entered forwarding state
    [168827.631585] br101: port 1(em1.101) entered forwarding state
    [168827.631786] br101: port 1(em1.101) entered forwarding state
    [168827.631991] br102: port 1(em1.102) entered forwarding state
    [168827.632176] br102: port 1(em1.102) entered forwarding state
    [168841.478764] e1000e 0000:00:19.0 em1: Detected Hardware Unit Hang:
    TDH <f2>
    TDT <73>
    next_to_use <73>
    next_to_clean <f1>
    buffer_info[next_to_clean]:
    time_stamp <10a0b9f9a>
    next_to_watch <f4>
    jiffies <10a0ba69c>
    next_to_watch.status <0>
    MAC Status <80083>
    PHY Status <796d>
    PHY 1000BASE-T Status <3800>
    PHY Extended Status <3000>
    PCI Status <10>

     
  • Todd Fujinaka
    Todd Fujinaka
    2015-08-20

    • assigned_to: Yanir Lubetkin --> Raanan Avargil
     
  • Silvan Raijer
    Silvan Raijer
    2015-12-11

    I also experience this for quite some time now and used the hotfix after every reboot (ethtool -K eth0 gso off gro off tso off).

    Got a system running Debian 8.2 on kernel 4.2.6 using the e1000e buildin driver.

    Found a topic on serverfault.com suggesting turning off 'PCIe Active State Power Management' with kernel option 'pcie_aspm=off'
    Can confirm for now that this works.

    Some more information;

    eth0 is running a bridge with 2 vlans

    root@riddler:~# lspci -v
    00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
    Subsystem: Intel Corporation Device 2035
    Flags: bus master, fast devsel, latency 0, IRQ 28
    Memory at f7e00000 (32-bit, non-prefetchable) [size=128K]
    Memory at f7e35000 (32-bit, non-prefetchable) [size=4K]
    I/O ports at f080 [size=32]
    Capabilities: [c8] Power Management version 2
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [e0] PCI Advanced Features
    Kernel driver in use: e1000e

    root@riddler:~# ethtool -i eth0
    driver: e1000e
    version: 3.2.5-k
    firmware-version: 0.13-4
    bus-info: 0000:00:19.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no

    root@riddler:~# ethtool -k eth0
    Features for eth0:
    rx-checksumming: on
    tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
    scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
    tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp6-segmentation: on
    udp-fragmentation-offload: off [fixed]
    generic-segmentation-offload: on
    generic-receive-offload: on
    large-receive-offload: off [fixed]
    rx-vlan-offload: on
    tx-vlan-offload: on
    ntuple-filters: off [fixed]
    receive-hashing: on
    highdma: on [fixed]
    rx-vlan-filter: off [fixed]
    vlan-challenged: off [fixed]
    tx-lockless: off [fixed]
    netns-local: off [fixed]
    tx-gso-robust: off [fixed]
    tx-fcoe-segmentation: off [fixed]
    tx-gre-segmentation: off [fixed]
    tx-ipip-segmentation: off [fixed]
    tx-sit-segmentation: off [fixed]
    tx-udp_tnl-segmentation: off [fixed]
    fcoe-mtu: off [fixed]
    tx-nocache-copy: off
    loopback: off [fixed]
    rx-fcs: off
    rx-all: off
    tx-vlan-stag-hw-insert: off [fixed]
    rx-vlan-stag-hw-parse: off [fixed]
    rx-vlan-stag-filter: off [fixed]
    l2-fwd-offload: off [fixed]
    busy-poll: off [fixed]

     
  • Silvan Raijer
    Silvan Raijer
    2015-12-12

    After more extensive testing the unit hang unfortunately so using the kernel option 'pcie_aspm=off'' is not a workaround.