#100 e1000e intermittent freeze-until-reboot in 2.6.36+

closed
Bruce Allan
e1000e (107)
in-kernel_driver
6
2013-07-09
2011-02-02
Nix
No

This is possibly ASPM-related: diagnostics to determine it are going on now.

Described in full in http://sourceforge.net/mailarchive/forum.php?thread_name=87k4kfq1at.fsf%40spindle.srvr.nix&forum_name=e1000-devel, in brief, after the hang, a register dump looks like this:

Offset Values
-------- -----
000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
030: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 06 88 00 00 06 88 00 00 00 00 00 00 00 00 00 00
070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Keeping the adapter totally idle or persistently active (via pingflooding and apparently even ping -s 1) keeps the hang from happening.

Discussion

  • Nix
    Nix
    2011-02-02

    .config from kernel with freezing NIC

     
    Attachments
  • Nix
    Nix
    2011-02-02

    dmesg output

     
    Attachments
  • Nix
    Nix
    2011-02-02

    lspci from working kernel

     
    Attachments
  • Nix
    Nix
    2011-02-02

    lspci from kernel with freezing NIC

     
    Attachments
  • ASPM is enabled on your adapter, and it will not work, and explains your issues.

    if you boot the kernel with pcie_aspm=off appended to kernel line, the issue should go away.

    02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    Subsystem: Intel Corporation Device 0000
    ...
    ==> LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    ...

     
  • Nix
    Nix
    2011-02-19

    I did. It didn't. I reported this to the list about two weeks ago :) currently the only thing that fixes it is a setpci call to force ASPM off directly. This is not ideal!

    (I will double-check with 2.6.37.1 when I reboot into it tomorrow -- perhaps I misspelled aspm last time or something -- and then it's time to stick printks into pcie_aspm_init_link_state(), I think. The only way I can see that this can be going wrong is if it's being initialized late, after the driver has turned ASPM off, but that should be completely impossible.)

     
  • Nix
    Nix
    2011-02-19

    Confirmed.

    append="pcie_aspm=off"

    in /etc/lilo.conf, and the PCI configuration space shows ASPM is on, and we get an unresponsive card with registers filled with FFs in less than a day of uptime.

     
  • I'm hoping bruce (bwa) will be able to either settle this with a -stable patch to the kernel or tell you what commit you need to have in order for the in-kernel e1000e driver to work.

     
  • Nix
    Nix
    2011-02-22

    Assuming an upstream commit does fix it. I'll try git head, probably tomorrow, and see if the ASPM register has the right value after boot there.

     
  • Nix
    Nix
    2011-02-26

    This still goes wrong with current git head.

     
  • Bruce Allan
    Bruce Allan
    2011-03-01

    I am unable to reproduce the issue with ASPM not getting disabled on the hardware that I currently have.

    To confirm (please forgive me if the information has already been provided):
    What type of system is this?
    When not specifying the pcie_aspm kernel parameter, do you see in the kernel log the message "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it"?

    Would you also please provide the output of 'lspci -t'?

     
  • Nix
    Nix
    2011-03-01

    This is a DNUK Netserver, a PCIe-based Nehalem L5520.

    The string 'ACPI FADT' appears nowhere in any boot logs I possess (going back to October last year), which are mostly logs of boots done without the pcie_aspm parameter.

    lspci -t output:

    -[0000:00]-+-00.0
    +-01.0-[0a]--
    +-03.0-[09]--
    +-05.0-[08]--
    +-07.0-[05-07]--+-00.0-[07]----0e.0
    | -00.2-[06]--
    +-13.0
    +-14.0
    +-14.1
    +-14.2
    +-14.3
    +-16.0
    +-16.1
    +-16.2
    +-16.3
    +-16.4
    +-16.5
    +-16.6
    +-16.7
    +-1a.0
    +-1a.1
    +-1a.2
    +-1a.7
    +-1b.0
    +-1c.0-[04]--
    +-1c.4-[03]----00.0
    +-1c.5-[02]----00.0
    +-1d.0
    +-1d.1
    +-1d.2
    +-1d.7
    +-1e.0-[01]--+-00.0
    | -05.0
    +-1f.0
    +-1f.2
    -1f.3

    (aside: I have had not a single networking freeze since forcing ASPM off. We can, I think, confirm that ASPM being incorrectly enabled was the cause of the freeze. It seems not to cause problems on 100Mb/s links.)

     
  • Nix
    Nix
    2011-03-04

    Was I ever wrong. This affects 100Mb/s interfaces too, with the same symptoms in the register dump. It is definitely correlated with packet volume: I saw it while doing a 500Mb/s download (all of which was flowing over that link at about 6Mb/s).

    After the failure, lspci was... interesting:

    03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel driver in use: e1000e

    I don't think I've ever seen that error before.

     
  • Bruce Allan
    Bruce Allan
    2011-03-05

    I've finally found a system on which I can reproduce this. One difference is my system dumps "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it" to the system log, and you said yours does not, but the end-result is the same - ASPM L0s is enabled for the 82574L device when it should not be. It looks like the kernel is lead to believe that ASPM is disabled on the system when it apparently is not(*); and when the driver calls the kernel API to disable it for the 82574L device, the API call returns without doing anything since it thinks ASPM is disabled.

    Would you please try the attached patch (foo.diff) created against 2.6.37 to see if your setup is doing the same (or similar) thing as mine. After booting with the patched kernel, grab the outputs of 'lspci -t', 'lspci -vvv' and dmesg and attach them to this bug tracker.

    (*) there are a number of ways this can happen, one of them results in the "ACPI FADT..." message.

    Thanks,
    Bruce.

     
  • Bruce Allan
    Bruce Allan
    2011-03-05

    debug patch against 2.6.37

     
    Attachments
  • Nix
    Nix
    2011-03-06

    Your debugging fired. Looks like get_port_device_capability() is returning 0, either because of an error or because there are no capabilities on any of these ports at all (seems a tad unlikely).

    I could add more debugging to figure out which possibliity it is, I suppose.

     
  • Nix
    Nix
    2011-03-06

    lspci with debugging hack

     
    Attachments
  • Nix
    Nix
    2011-03-06

    I can't add the lspci -vvv or dmesg output as an attachment: massive floods of internal errors from SourceForge's dire bugtracker (how totally unsurprising).

    Here's the essential bit of the dmesg, repeated numerous times:

    [ 1.910645] WARNING: at drivers/pci/pcie/aspm.c:905 pcie_no_aspm+0x2c/0x38()
    [ 1.910892] Hardware name: empty
    [ 1.911125] Disabling PCIe ASPM
    [ 1.911358] Modules linked in:
    [ 1.911626] Pid: 1, comm: swapper Not tainted 2.6.37.2-dirty #2
    [ 1.911869] Call Trace:
    [ 1.912111] [<ffffffff81065fea>] warn_slowpath_common+0x85/0x9d
    [ 1.912356] [<ffffffff810660a5>] warn_slowpath_fmt+0x46/0x48
    [ 1.912608] [<ffffffff8122ad60>] ? pcie_port_acpi_setup+0x80/0xd8
    [ 1.912852] [<ffffffff8122a244>] pcie_no_aspm+0x2c/0x38
    [ 1.913085] [<ffffffff8122a4c8>] pcie_port_device_register+0x34/0x4b5
    [ 1.913332] [<ffffffff8115463d>] ? sysfs_add_one+0x63/0x91
    [ 1.913575]
    [<ffffffff8115434d>] ? sysfs_addrm_finish+0x34/0xc3
    [ 1.913819]
    [<ffffffff8115468c>] ? sysfs_add_one+0x21/0xfd
    [ 1.914062]
    [<ffffffff8145c3ad>] pcie_portdrv_probe+0x59/0x70
    [ 1.914302]
    [<ffffffff8122439a>] local_pci_probe+0x4d/0x96
    [ 1.914545]
    [<ffffffff81224713>] pci_device_probe+0x6a/0x9b
    [ 1.914788]
    [<ffffffff81154f84>] ? sysfs_create_link+0x13/0x15
    [ 1.915032]
    [<ffffffff8128e6bc>] driver_probe_device+0xb7/0x151
    [ 1.915276]
    [<ffffffff8128e7b3>]
    driver_attach+0x5d/0x81
    [ 1.915518] [<ffffffff8128e756>] ? driver_attach+0x0/0x81
    [ 1.915768]
    [<ffffffff8128d844>] bus_for_each_dev+0x56/0x82
    [ 1.916011]
    [<ffffffff8128e38a>] driver_attach+0x1e/0x20
    [ 1.916254]
    [<ffffffff8128dfcc>] bus_add_driver+0x101/0x255
    [ 1.916498]
    [<ffffffff818a26ed>] ? pcie_portdrv_init+0x0/0x6d
    [ 1.916739]
    [<ffffffff8128ea12>] driver_register+0x9b/0x108
    [ 1.916981]
    [<ffffffff818a26ed>] ? pcie_portdrv_init+0x0/0x6d
    [ 1.917225]
    [<ffffffff81224973>]
    pci_register_driver+0x56/0xc2
    [ 1.917469] [<ffffffff818a26ed>] ? pcie_portdrv_init+0x0/0x6d
    [ 1.917712] [<ffffffff818a2749>] pcie_portdrv_init+0x5c/0x6d
    [ 1.917956] [<ffffffff810002ff>] do_one_initcall+0x7f/0x130
    [ 1.918204] [<ffffffff81882c9f>] kernel_init+0x143/0x1cc
    [ 1.918444] [<ffffffff81030e54>] kernel_thread_helper+0x4/0x10
    [ 1.918685] [<ffffffff81882b5c>] ? kernel_init+0x0/0x1cc
    [ 1.918925] [<ffffffff81030e50>] ? kernel_thread_helper+0x0/0x10
    [ 1.919171] ---[ end trace 6d450e935ee1897c ]---

     
  • Bruce Allan
    Bruce Allan
    2011-03-08

    So, this looks like a bug in the PCI subsystem; should probably pull in linux-pci@vger.kernel.org to the discussion. IIRC, you said this used to work but started breaking in 2.6.36 through now, correct? Would it be possible for you to do a git bisect to find the commit that is causing the problem?

     
  • Nix
    Nix
    2011-03-08

    I really don't think bisection will work here: I tried a while back, but it's pointless. The problem is so very intermittent that I'd need to run for a couple of weeks to conclude that a revision was good, and the whole time I'd need to be prepared for this headless server's NIC going dead :/

    However, It feels to me like we're closing in on a smoking gun. I'll blow some time piling more trace in tomorrow evening and tracking down the point where the kernel comes to the conclusion that we are ASPMless. get_port_device_capability() is definitely the next place to check out: why does it think we have none? lspci shows that this must be wrong. Thus, this is probably an error return which is being discarded. Next thing to do: printk() out that error.

     
  • Bruce Allan
    Bruce Allan
    2011-03-17

    Nix, any update on your investigation on why the kernel incorrectly assumes ASPM is disabled?

     
  • Nix
    Nix
    2011-03-17

    No, sorry for the delay: I've been flooded with job-switch-related stuff and have had no time at all.

    I'll see if I can squeeze in some diagnostics this weekend.

     
  • Nix
    Nix
    2011-03-19

    Done some debugging (with 2.6.38: you will be unsurprised to learn that the problem persists there).

    get_port_device_capability() is apparently not failing for these devices. It returns 8 for both cards (for devices 0000:00:1c.[45], and also for the peculiar unlabelled 0000:00:1c.0, which lspci -vvv lists as a PCI bridge but which (despite the similar numbering) lspci -t says these two devices are not behind. For every other device on the system, it returns zero (no capabilities).

    This is distinctly peculiar, given that earlier debugging suggests that get_port_device_capability() must be returning zero to pcie_port_device_register() in this case. It occurs to me that said earlier debugging did not in fact indicate which device it was warning about: I bet it was one of the non-ASPM-capable devices, i.e., not the e1000 at all. So that earlier debugging patch probably gave no useful info.

    I'll pile in more printk()s into pcie_port_device_register() and report back.

     
  • Nix
    Nix
    2011-03-19

    Proved, I'm afraid: the pcie_no_aspm() call which fired your warning related to the four other PCIe devices on the system which do not support ASPM.

    0000:00:1c.0, 4, and 5 all leave pcie_port_device_register() with nr_service set to 1 (thus, they return successfully) and do not call pcie_no_aspm().

    Thus, your debugging patch earlier produced no output at all related to the e1000e: the problem must lie elsewhere.

     
  • Bruce Allan
    Bruce Allan
    2011-03-22

    The issue with ASPM remaining enabled when it shouldn't be appears to be fixed (on my system at least) using the for-linus branch in Jesse Barnes' pci-2.6 tree[1]. I have not dug into exactly which patch(es) fixes the issue (possibly the patchset that was applied just today). Would it be possible for you to try that tree to see if you still have problems with your 82574 device?

    [1] git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6.git

     
  • Nix
    Nix
    2011-03-22

    That does look plausible, doesn't it. Unfortunately the failure mode is the same with that tree applied, with or without the patch: ASPM remains enabled: no change. (The set of ASPM-related messages gains an extra line:

    Unable to assume _OSC PCIe control. Disabling ASPM

    which appears somewhat inaccurate, alas.)

    I'll do some more debugging this weekend... (starting the Thursday after that my current job finishes and I can spend a lot more time on this.)

     
  • Nix
    Nix
    2011-03-22

    Still no change, I'm afraid.

     
  • Bruce Allan
    Bruce Allan
    2011-03-23

    ASPM still enabled, eh? Hmm, this has gotten out of my expertise I'm afraid, and when using the pci-2.6 tree with the most recent patches I am not able to reproduce the problem on any of my systems. You should consider taking the issue of ASPM L0s not getting disabled on the adapter in your system to the PCI experts on the linux-pci@vger.kernel.org mailing list. I'll continue to monitor the situation, but the PCI maintainers are far more knowledgable of the PCI code than I am and you have a much better chance getting this resolved through them.

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-07-09

    Closing due to inactivity.

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-07-09

    • status: open --> closed