Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#100 e1000e intermittent freeze-until-reboot in 2.6.36+

closed
Bruce Allan
e1000e (106)
in-kernel_driver
6
2013-07-09
2011-02-02
Nix
No

This is possibly ASPM-related: diagnostics to determine it are going on now.

Described in full in http://sourceforge.net/mailarchive/forum.php?thread_name=87k4kfq1at.fsf%40spindle.srvr.nix&forum_name=e1000-devel, in brief, after the hang, a register dump looks like this:

Offset Values
-------- -----
000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
030: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 06 88 00 00 06 88 00 00 00 00 00 00 00 00 00 00
070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Keeping the adapter totally idle or persistently active (via pingflooding and apparently even ping -s 1) keeps the hang from happening.

Discussion

<< < 1 2 3 4 > >> (Page 3 of 4)
  • Bruce Allan
    Bruce Allan
    2011-03-17

    Nix, any update on your investigation on why the kernel incorrectly assumes ASPM is disabled?

     
  • Nix
    Nix
    2011-03-17

    No, sorry for the delay: I've been flooded with job-switch-related stuff and have had no time at all.

    I'll see if I can squeeze in some diagnostics this weekend.

     
  • Nix
    Nix
    2011-03-19

    Done some debugging (with 2.6.38: you will be unsurprised to learn that the problem persists there).

    get_port_device_capability() is apparently not failing for these devices. It returns 8 for both cards (for devices 0000:00:1c.[45], and also for the peculiar unlabelled 0000:00:1c.0, which lspci -vvv lists as a PCI bridge but which (despite the similar numbering) lspci -t says these two devices are not behind. For every other device on the system, it returns zero (no capabilities).

    This is distinctly peculiar, given that earlier debugging suggests that get_port_device_capability() must be returning zero to pcie_port_device_register() in this case. It occurs to me that said earlier debugging did not in fact indicate which device it was warning about: I bet it was one of the non-ASPM-capable devices, i.e., not the e1000 at all. So that earlier debugging patch probably gave no useful info.

    I'll pile in more printk()s into pcie_port_device_register() and report back.

     
  • Nix
    Nix
    2011-03-19

    Proved, I'm afraid: the pcie_no_aspm() call which fired your warning related to the four other PCIe devices on the system which do not support ASPM.

    0000:00:1c.0, 4, and 5 all leave pcie_port_device_register() with nr_service set to 1 (thus, they return successfully) and do not call pcie_no_aspm().

    Thus, your debugging patch earlier produced no output at all related to the e1000e: the problem must lie elsewhere.

     
  • Bruce Allan
    Bruce Allan
    2011-03-22

    The issue with ASPM remaining enabled when it shouldn't be appears to be fixed (on my system at least) using the for-linus branch in Jesse Barnes' pci-2.6 tree[1]. I have not dug into exactly which patch(es) fixes the issue (possibly the patchset that was applied just today). Would it be possible for you to try that tree to see if you still have problems with your 82574 device?

    [1] git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6.git

     
  • Nix
    Nix
    2011-03-22

    That does look plausible, doesn't it. Unfortunately the failure mode is the same with that tree applied, with or without the patch: ASPM remains enabled: no change. (The set of ASPM-related messages gains an extra line:

    Unable to assume _OSC PCIe control. Disabling ASPM

    which appears somewhat inaccurate, alas.)

    I'll do some more debugging this weekend... (starting the Thursday after that my current job finishes and I can spend a lot more time on this.)

     
  • Nix
    Nix
    2011-03-22

    Still no change, I'm afraid.

     
  • Bruce Allan
    Bruce Allan
    2011-03-23

    ASPM still enabled, eh? Hmm, this has gotten out of my expertise I'm afraid, and when using the pci-2.6 tree with the most recent patches I am not able to reproduce the problem on any of my systems. You should consider taking the issue of ASPM L0s not getting disabled on the adapter in your system to the PCI experts on the linux-pci@vger.kernel.org mailing list. I'll continue to monitor the situation, but the PCI maintainers are far more knowledgable of the PCI code than I am and you have a much better chance getting this resolved through them.

     
<< < 1 2 3 4 > >> (Page 3 of 4)