#100 e1000e intermittent freeze-until-reboot in 2.6.36+

closed
Bruce Allan
e1000e (107)
in-kernel_driver
6
2013-07-09
2011-02-02
Nix
No

This is possibly ASPM-related: diagnostics to determine it are going on now.

Described in full in http://sourceforge.net/mailarchive/forum.php?thread_name=87k4kfq1at.fsf%40spindle.srvr.nix&forum_name=e1000-devel, in brief, after the hang, a register dump looks like this:

Offset Values
-------- -----
000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
030: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 06 88 00 00 06 88 00 00 00 00 00 00 00 00 00 00
070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Keeping the adapter totally idle or persistently active (via pingflooding and apparently even ping -s 1) keeps the hang from happening.

Discussion

1 2 3 4 > >> (Page 1 of 4)
  • Nix
    Nix
    2011-02-02

    .config from kernel with freezing NIC

     
    Attachments
  • Nix
    Nix
    2011-02-02

    dmesg output

     
    Attachments
  • Nix
    Nix
    2011-02-02

    lspci from working kernel

     
    Attachments
  • Nix
    Nix
    2011-02-02

    lspci from kernel with freezing NIC

     
    Attachments
  • ASPM is enabled on your adapter, and it will not work, and explains your issues.

    if you boot the kernel with pcie_aspm=off appended to kernel line, the issue should go away.

    02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    Subsystem: Intel Corporation Device 0000
    ...
    ==> LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    ...

     
  • Nix
    Nix
    2011-02-19

    I did. It didn't. I reported this to the list about two weeks ago :) currently the only thing that fixes it is a setpci call to force ASPM off directly. This is not ideal!

    (I will double-check with 2.6.37.1 when I reboot into it tomorrow -- perhaps I misspelled aspm last time or something -- and then it's time to stick printks into pcie_aspm_init_link_state(), I think. The only way I can see that this can be going wrong is if it's being initialized late, after the driver has turned ASPM off, but that should be completely impossible.)

     
  • Nix
    Nix
    2011-02-19

    Confirmed.

    append="pcie_aspm=off"

    in /etc/lilo.conf, and the PCI configuration space shows ASPM is on, and we get an unresponsive card with registers filled with FFs in less than a day of uptime.

     
  • I'm hoping bruce (bwa) will be able to either settle this with a -stable patch to the kernel or tell you what commit you need to have in order for the in-kernel e1000e driver to work.

     
  • Nix
    Nix
    2011-02-22

    Assuming an upstream commit does fix it. I'll try git head, probably tomorrow, and see if the ASPM register has the right value after boot there.

     
  • Nix
    Nix
    2011-02-26

    This still goes wrong with current git head.

     
1 2 3 4 > >> (Page 1 of 4)