#423 i217v: Detected Hardware Unit Hang causes data corruption

open
None
in-kernel_driver
1
2015-08-20
2014-07-08
No

Sorry to add to the growing number of bugs related to the "Detected Hardware Unit Hang" message, but I've read them all and did not find any mention of whether the data being transmitted is correct or not.

While attempting to rsync over ssh my large backups to another location I was constantly interrupted by the following complaint at my remote location:

Jul  7 20:53:13 priest sshd[19230]: Corrupted MAC on input.
Jul  7 20:53:13 priest sshd[19230]: Disconnecting: Packet corrupt

At the very same time my NAT box spat in the logs:

Jul  7 20:52:52 pinky kernel: e1000e 0000:00:19.0 ext0eth: Detected Hardware Unit Hang:
Jul  7 20:52:52 pinky kernel: TDH                  <a3>
Jul  7 20:52:52 pinky kernel: TDT                  <fb>
Jul  7 20:52:52 pinky kernel: next_to_use          <fb>
Jul  7 20:52:52 pinky kernel: next_to_clean        <a3>
Jul  7 20:52:52 pinky kernel: buffer_info[next_to_clean]:
Jul  7 20:52:52 pinky kernel: time_stamp           <101b08061>
Jul  7 20:52:52 pinky kernel: next_to_watch        <a4>
Jul  7 20:52:52 pinky kernel: jiffies              <101b08180>
Jul  7 20:52:52 pinky kernel: next_to_watch.status <0>
Jul  7 20:52:52 pinky kernel: MAC Status             <80083>
Jul  7 20:52:52 pinky kernel: PHY Status             <796d>
Jul  7 20:52:52 pinky kernel: PHY 1000BASE-T Status  <3800>
Jul  7 20:52:52 pinky kernel: PHY Extended Status    <3000>
Jul  7 20:52:52 pinky kernel: PCI Status             <10>

And again...

Jul  7 20:10:29 pinky kernel: e1000e 0000:00:19.0 ext0eth: Detected Hardware Unit Hang:
Jul  7 20:10:29 pinky kernel: TDH                  <f9>
Jul  7 20:10:29 pinky kernel: TDT                  <b0>
Jul  7 20:10:29 pinky kernel: next_to_use          <b0>
Jul  7 20:10:29 pinky kernel: next_to_clean        <f7>
Jul  7 20:10:29 pinky kernel: buffer_info[next_to_clean]:
Jul  7 20:10:29 pinky kernel: time_stamp           <101ac9ed5>
Jul  7 20:10:29 pinky kernel: next_to_watch        <f9>
Jul  7 20:10:29 pinky kernel: jiffies              <101aca024>
Jul  7 20:10:29 pinky kernel: next_to_watch.status <0>
Jul  7 20:10:29 pinky kernel: MAC Status             <80083>
Jul  7 20:10:29 pinky kernel: PHY Status             <796d>
Jul  7 20:10:29 pinky kernel: PHY 1000BASE-T Status  <7800>
Jul  7 20:10:29 pinky kernel: PHY Extended Status    <3000>
Jul  7 20:10:29 pinky kernel: PCI Status             <10>

Jul  7 20:10:50 priest sshd[19205]: Corrupted MAC on input.
Jul  7 20:10:50 priest sshd[19205]: Disconnecting: Packet corrupt

And this goes on through the logs of both machines.

Anyway, I am troubled by the fact that the corrupted packets reach the application layer containing corrupted data - in this case detected by the ssh application-level protocol. I am concerned that users of this driver/hardware combo, including myself, will end up corrupting their data as a result of this glitch when using less paranoid protocols (FTP etc)

The behavior stopped after disabling rx and tx segmentation offload with ethtool, leading to this state:

pinky ~ # ethtool -k ext0eth
Features for ext0eth:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]

uname -a (both gentoo machines):

Linux priest 3.6.11+ #484 PREEMPT Mon Jun 24 15:45:35 BST 2013 armv6l ARMv6-compatible processor rev 7 (v6l) BCM2708 GNU/Linux
Linux pinky 3.14.5-hardened-r2 #1 SMP Fri Jun 13 10:54:23 EEST 2014 x86_64 Intel(R) Xeon(R) CPU E3-1230L v3 @ 1.80GHz GenuineIntel GNU/Linux

lspci -v (pinky):

00:19.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I217-V [8086:153b] (rev 05)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:e000]
        Flags: bus master, fast devsel, latency 0, IRQ 44
        Memory at f0300000 (32-bit, non-prefetchable) [size=128K]
        Memory at f0335000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at f040 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [e0] PCI Advanced Features
        Kernel driver in use: e1000e

Discussion

  • Todd Fujinaka

    Todd Fujinaka - 2015-05-12
    • assigned_to: Yanir Lubetkin
     
  • Todd Fujinaka

    Todd Fujinaka - 2015-08-20
    • assigned_to: Yanir Lubetkin --> Raanan Avargil
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks