#365 82574L - e1000e - Detected Hardware Unit Hang

wont-fix
nobody
None
in-kernel_driver
1
2014-10-28
2012-12-02
No

I set my IntMode to 1 for both adapters due to bug 360. After 24 hours both adapters die under 50% load.

First it was ASPM on the 82574L that was found to have a hardware errata.

Then it is found that the same chip can't handle MSI-X.

Now I'm getting random "Hardware Unit Hang" messages.

This chip is a failure from the fab and Intel is still selling this chip today. A case could be made to blacklist this chip from the kernel since it has so many hardware bugs. None of my other Intel NICs, or competitors chips, have even one outstanding issue.

I'm attaching a screenshot of the hang since the error caused a complete lock up of the server and caused my RAID 5 to start a rebuild when I hard reset it.

I'm wondering what network card I want to replace the two defective 82574Ls as I have had enough with the constant problems of this one chip model. My desktops or laptops can stay up for weeks but my /server/ cannot last more than 24 hours.

Distro: Fedora 17 x86_64
Motherboard: Supermicro X8SIL-F BIOS 1.2a
Kernel: 3.6.7
I'm using the in-kernel driver.

1 Attachments

Discussion

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -11,3 +11,8 @@
     I'm attaching a screenshot of the hang since the error caused a complete lock up of the server and caused my RAID 5 to start a rebuild when I hard reset it.
    
     I'm wondering what network card I want to replace the two defective 82574Ls as I have had enough with the constant problems of this one chip model. My desktops or laptops can stay up for weeks but my /server/ cannot last more than 24 hours.
    +
    +Distro: Fedora 17 x86_64
    +Motherboard: Supermicro X8SIL-F BIOS 1.2a
    +Kernel: 3.6.7
    +I'm using the in-kernel driver.
    
     
  • Just got another lock up with the same error messages. I have not changed any settings since the start of this bug report. I'm currently investigating a purchase of an add-on card so that I can disable these onboard chips.

     
  • I'm sorry you're having problems with this part. Usually all the things you've tried keep the lockups from happening.

    Can you provide some more info? Please send full dmesg output once problem has occurred. Can you send lspci -vvv output from before and after the problem occurs. Can you also download ethregs from our SF project site and provide ethregs dump from before and after the problem occurs.

    How often does the lockup happen or how soon after boot? Is there a ball-park range?

    This part has been on the market for quite a few years and seems senstive to some board designs. Our follow on part is the i210 and which has been released now. Its not based on the 82574 part and uses the igb driver instead.

     
  • I can attach the lspci output now. I will most likely not be able to get lspci output when the system locks up. The keyboard is unresponsive over IPMI, but I do get scrolling kernel text. The adapters attempts to reset themselves 2 or 3 times and then everything halts.

    I cannot reboot the system now to set "iomem" for ethregs, but I will reply with that data later.

    I see the announcement for the i210, but I do not see it for sale. I do see the i350 card for sale. Do you feel (off the record) that this card should work reliably in a Linux environment with the motherboard in my OP?

     
    Attachments
  • Thanks for the lspci, I'll review it and let you know if I find anything problematic in it. The i350 card is is also a newer GbE part. The i210 part is based on it.

     
  • I am attaching the output of ethregs.

    Thanks for the advice, Carolyn.

     
    Attachments
  • The server has locked up yet again. It seems the IntMode setting makes the system lock up faster (1 to 2 days). I have reverted the IntMode setting and I am back on MSI-X, which lasts 5-7 days.

    I have ordered an i350-T2 card to replace these terrible 82574L chips so if you have anything you'd like me to troubleshoot you only have a few more days before I disable the onboard chips.

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-07-08

    • status: open --> wont-fix
     
  • Todd Fujinaka
    Todd Fujinaka
    2013-07-08

    Closing.

     
  • prz
    prz
    2013-08-21

    This chip is a piece of junk. Wasting 2nd day trying to get around this issue. Intel forced me to change boards with their release strategy & this chip is an abonimation.

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-08-21

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-08-21

    Sorry to hear you're having issues. Please open a separate issue with details if you want help debugging this issue. If you need to return a defective Intel NIC, please contact support@intel.com.

     
  • prz
    prz
    2013-08-21

    Todd, this is exactly the Detected Hardware Unit Hang (looks like TX) here. I cannot return the chip (it's on 2550MUD2) and I wish I could junk this intel board but I can't. I used the 525MWs which were OK (the ether chip had issues but I could firmware it) and was forced onto those MUD2 by Intel discountinung boards. The MUD2 is garbage for embedded work, I wasted 2-3 days now working around the power save issue (acmp), the MSI issues & ported manually the newest e1000e & e1000 into 2.6.39 kernel and I'm stuck now on this adapter hang. This MUD2 does not even work with Vanilla Fedora 18. I never expected Intel to release something like this replacing a 525MW board which was pretty good and I have deployed now. I'm stuck and I'm surely not happy about it and I am not sure what to tell customers waiting for orders here.

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-08-21

    Please open a new issue. I'm not sure what a MUD2 is, but it certainly isn't a Supermicro X8SIL-F, is it?

     
  • Hauke
    Hauke
    2014-10-28

    I had the same problem with this NIC.

    I've buid a system using a (slow dualcore) J1800 (Bay Trail-D) processor on a ASRock D1800M which contains an onboard Realtec 8168 NIC.

    I've tried the Realtec 8168 and the Intel PRO/1000 but I've got problems with both cards crashing/restarting/hanging when having heavy load.

    So might it be a problem with cheap slow hardware which might not be able to handle high traffic and many interrupts?

    By the way, this is how i can generate these problems quite fast:
    On Computer to test:
    # iperf -s
    On different Client:
    # while iperf -c <IP_TO_TEST> -P 20; do echo "-"; done

    The Realtec NIC crashes/resets directly on this network load while the PRO/1000 looks stable for few mins but extra traffic (eg. ssh) on the NIC causes the network to hang directly.

    By the way: the used PRO/1000 has been running in my FX-8350 without any problems.

    dmesg shows:
    [ 1842.034733] WARNING: CPU: 0 PID: 0 at /build/linux-i5neKT/linux-3.16.5/net/sched/sch_generic.c:264 dev_watchdog+0x236/0x240()
    [ 1842.034738] NETDEV WATCHDOG: eth1 (e1000e): transmit queue 0 timed out
    [ 1842.034741] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc xfs libcrc32c nls_utf8 nls_cp437 vfat fat intel_powerclamp intel_rapl coretemp ppdev iTCO_wdt kvm_intel kvm i915 iTCO_vendor_support drm_kms_helper drm evdev efi_pstore crc32_pclmul ghash_clmulni_intel serio_raw cryptd efivars pcspkr i2c_algo_bit parport_pc parport battery video shpchp i2c_designware_platform i2c_i801 i2c_designware_core button iosf_mbi lpc_ich mfd_core processor loop autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel e1000e ptp pps_core i2c_hid hid i2c_core sdhci_acpi ahci sdhci libahci xhci_hcd mmc_core usbcore fan usb_common thermal thermal_sys libata scsi_mod
    [ 1842.034848] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16-3-amd64 #1 Debian 3.16.5-1
    [ 1842.034852] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./D1800M, BIOS P1.40 09/01/2014
    [ 1842.034856] 0000000000000009 ffffffff815066c3 ffff880079003e28 ffffffff81065717
    [ 1842.034863] 0000000000000000 ffff880079003e78 0000000000000001 0000000000000000
    [ 1842.034869] ffff880036a9c000 ffffffff8106577c ffffffff81775de8 ffff880000000030
    [ 1842.034876] Call Trace:
    [ 1842.034880] <IRQ> [<ffffffff815066c3>] ? dump_stack+0x41/0x51
    [ 1842.034896] [<ffffffff81065717>] ? warn_slowpath_common+0x77/0x90