#100 e1000e intermittent freeze-until-reboot in 2.6.36+

closed
e1000e (108)
in-kernel_driver
6
2013-07-09
2011-02-02
Nix
No

This is possibly ASPM-related: diagnostics to determine it are going on now.

Described in full in http://sourceforge.net/mailarchive/forum.php?thread_name=87k4kfq1at.fsf%40spindle.srvr.nix&forum_name=e1000-devel, in brief, after the hang, a register dump looks like this:

Offset Values
-------- -----
000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
030: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 06 88 00 00 06 88 00 00 00 00 00 00 00 00 00 00
070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Keeping the adapter totally idle or persistently active (via pingflooding and apparently even ping -s 1) keeps the hang from happening.

Discussion

1 2 > >> (Page 1 of 2)
  • Nix

    Nix - 2011-02-02

    .config from kernel with freezing NIC

     
    Attachments
  • Nix

    Nix - 2011-02-02

    dmesg output

     
    Attachments
  • Nix

    Nix - 2011-02-02

    lspci from working kernel

     
    Attachments
  • Nix

    Nix - 2011-02-02

    lspci from kernel with freezing NIC

     
    Attachments
  • Jesse Brandeburg

    ASPM is enabled on your adapter, and it will not work, and explains your issues.

    if you boot the kernel with pcie_aspm=off appended to kernel line, the issue should go away.

    02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    Subsystem: Intel Corporation Device 0000
    ...
    ==> LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    ...

     
  • Nix

    Nix - 2011-02-19

    I did. It didn't. I reported this to the list about two weeks ago :) currently the only thing that fixes it is a setpci call to force ASPM off directly. This is not ideal!

    (I will double-check with 2.6.37.1 when I reboot into it tomorrow -- perhaps I misspelled aspm last time or something -- and then it's time to stick printks into pcie_aspm_init_link_state(), I think. The only way I can see that this can be going wrong is if it's being initialized late, after the driver has turned ASPM off, but that should be completely impossible.)

     
  • Nix

    Nix - 2011-02-19

    Confirmed.

    append="pcie_aspm=off"

    in /etc/lilo.conf, and the PCI configuration space shows ASPM is on, and we get an unresponsive card with registers filled with FFs in less than a day of uptime.

     
  • Jesse Brandeburg

    I'm hoping bruce (bwa) will be able to either settle this with a -stable patch to the kernel or tell you what commit you need to have in order for the in-kernel e1000e driver to work.

     
  • Nix

    Nix - 2011-02-22

    Assuming an upstream commit does fix it. I'll try git head, probably tomorrow, and see if the ASPM register has the right value after boot there.

     
  • Nix

    Nix - 2011-02-26

    This still goes wrong with current git head.

     
  • Bruce Allan

    Bruce Allan - 2011-03-01

    I am unable to reproduce the issue with ASPM not getting disabled on the hardware that I currently have.

    To confirm (please forgive me if the information has already been provided):
    What type of system is this?
    When not specifying the pcie_aspm kernel parameter, do you see in the kernel log the message "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it"?

    Would you also please provide the output of 'lspci -t'?

     
  • Nix

    Nix - 2011-03-01

    This is a DNUK Netserver, a PCIe-based Nehalem L5520.

    The string 'ACPI FADT' appears nowhere in any boot logs I possess (going back to October last year), which are mostly logs of boots done without the pcie_aspm parameter.

    lspci -t output:

    -[0000:00]-+-00.0
    +-01.0-[0a]--
    +-03.0-[09]--
    +-05.0-[08]--
    +-07.0-[05-07]--+-00.0-[07]----0e.0
    | -00.2-[06]--
    +-13.0
    +-14.0
    +-14.1
    +-14.2
    +-14.3
    +-16.0
    +-16.1
    +-16.2
    +-16.3
    +-16.4
    +-16.5
    +-16.6
    +-16.7
    +-1a.0
    +-1a.1
    +-1a.2
    +-1a.7
    +-1b.0
    +-1c.0-[04]--
    +-1c.4-[03]----00.0
    +-1c.5-[02]----00.0
    +-1d.0
    +-1d.1
    +-1d.2
    +-1d.7
    +-1e.0-[01]--+-00.0
    | -05.0
    +-1f.0
    +-1f.2
    -1f.3

    (aside: I have had not a single networking freeze since forcing ASPM off. We can, I think, confirm that ASPM being incorrectly enabled was the cause of the freeze. It seems not to cause problems on 100Mb/s links.)

     
  • Nix

    Nix - 2011-03-04

    Was I ever wrong. This affects 100Mb/s interfaces too, with the same symptoms in the register dump. It is definitely correlated with packet volume: I saw it while doing a 500Mb/s download (all of which was flowing over that link at about 6Mb/s).

    After the failure, lspci was... interesting:

    03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel driver in use: e1000e

    I don't think I've ever seen that error before.

     
  • Bruce Allan

    Bruce Allan - 2011-03-05

    I've finally found a system on which I can reproduce this. One difference is my system dumps "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it" to the system log, and you said yours does not, but the end-result is the same - ASPM L0s is enabled for the 82574L device when it should not be. It looks like the kernel is lead to believe that ASPM is disabled on the system when it apparently is not(*); and when the driver calls the kernel API to disable it for the 82574L device, the API call returns without doing anything since it thinks ASPM is disabled.

    Would you please try the attached patch (foo.diff) created against 2.6.37 to see if your setup is doing the same (or similar) thing as mine. After booting with the patched kernel, grab the outputs of 'lspci -t', 'lspci -vvv' and dmesg and attach them to this bug tracker.

    (*) there are a number of ways this can happen, one of them results in the "ACPI FADT..." message.

    Thanks,
    Bruce.

     
  • Bruce Allan

    Bruce Allan - 2011-03-05

    debug patch against 2.6.37

     
    Attachments
  • Nix

    Nix - 2011-03-06

    Your debugging fired. Looks like get_port_device_capability() is returning 0, either because of an error or because there are no capabilities on any of these ports at all (seems a tad unlikely).

    I could add more debugging to figure out which possibliity it is, I suppose.

     
  • Nix

    Nix - 2011-03-06

    lspci with debugging hack

     
    Attachments
  • Nix

    Nix - 2011-03-06

    I can't add the lspci -vvv or dmesg output as an attachment: massive floods of internal errors from SourceForge's dire bugtracker (how totally unsurprising).

    Here's the essential bit of the dmesg, repeated numerous times:

    [ 1.910645] WARNING: at drivers/pci/pcie/aspm.c:905 pcie_no_aspm+0x2c/0x38()
    [ 1.910892] Hardware name: empty
    [ 1.911125] Disabling PCIe ASPM
    [ 1.911358] Modules linked in:
    [ 1.911626] Pid: 1, comm: swapper Not tainted 2.6.37.2-dirty #2
    [ 1.911869] Call Trace:
    [ 1.912111] [<ffffffff81065fea>] warn_slowpath_common+0x85/0x9d
    [ 1.912356] [<ffffffff810660a5>] warn_slowpath_fmt+0x46/0x48
    [ 1.912608] [<ffffffff8122ad60>] ? pcie_port_acpi_setup+0x80/0xd8
    [ 1.912852] [<ffffffff8122a244>] pcie_no_aspm+0x2c/0x38
    [ 1.913085] [<ffffffff8122a4c8>] pcie_port_device_register+0x34/0x4b5
    [ 1.913332] [<ffffffff8115463d>] ? sysfs_add_one+0x63/0x91
    [ 1.913575]
    [<ffffffff8115434d>] ? sysfs_addrm_finish+0x34/0xc3
    [ 1.913819]
    [<ffffffff8115468c>] ? sysfs_add_one+0x21/0xfd
    [ 1.914062]
    [<ffffffff8145c3ad>] pcie_portdrv_probe+0x59/0x70
    [ 1.914302]
    [<ffffffff8122439a>] local_pci_probe+0x4d/0x96
    [ 1.914545]
    [<ffffffff81224713>] pci_device_probe+0x6a/0x9b
    [ 1.914788]
    [<ffffffff81154f84>] ? sysfs_create_link+0x13/0x15
    [ 1.915032]
    [<ffffffff8128e6bc>] driver_probe_device+0xb7/0x151
    [ 1.915276]
    [<ffffffff8128e7b3>]
    driver_attach+0x5d/0x81
    [ 1.915518] [<ffffffff8128e756>] ? driver_attach+0x0/0x81
    [ 1.915768]
    [<ffffffff8128d844>] bus_for_each_dev+0x56/0x82
    [ 1.916011]
    [<ffffffff8128e38a>] driver_attach+0x1e/0x20
    [ 1.916254]
    [<ffffffff8128dfcc>] bus_add_driver+0x101/0x255
    [ 1.916498]
    [<ffffffff818a26ed>] ? pcie_portdrv_init+0x0/0x6d
    [ 1.916739]
    [<ffffffff8128ea12>] driver_register+0x9b/0x108
    [ 1.916981]
    [<ffffffff818a26ed>] ? pcie_portdrv_init+0x0/0x6d
    [ 1.917225]
    [<ffffffff81224973>]
    pci_register_driver+0x56/0xc2
    [ 1.917469] [<ffffffff818a26ed>] ? pcie_portdrv_init+0x0/0x6d
    [ 1.917712] [<ffffffff818a2749>] pcie_portdrv_init+0x5c/0x6d
    [ 1.917956] [<ffffffff810002ff>] do_one_initcall+0x7f/0x130
    [ 1.918204] [<ffffffff81882c9f>] kernel_init+0x143/0x1cc
    [ 1.918444] [<ffffffff81030e54>] kernel_thread_helper+0x4/0x10
    [ 1.918685] [<ffffffff81882b5c>] ? kernel_init+0x0/0x1cc
    [ 1.918925] [<ffffffff81030e50>] ? kernel_thread_helper+0x0/0x10
    [ 1.919171] ---[ end trace 6d450e935ee1897c ]---

     
  • Bruce Allan

    Bruce Allan - 2011-03-08

    So, this looks like a bug in the PCI subsystem; should probably pull in linux-pci@vger.kernel.org to the discussion. IIRC, you said this used to work but started breaking in 2.6.36 through now, correct? Would it be possible for you to do a git bisect to find the commit that is causing the problem?

     
  • Nix

    Nix - 2011-03-08

    I really don't think bisection will work here: I tried a while back, but it's pointless. The problem is so very intermittent that I'd need to run for a couple of weeks to conclude that a revision was good, and the whole time I'd need to be prepared for this headless server's NIC going dead :/

    However, It feels to me like we're closing in on a smoking gun. I'll blow some time piling more trace in tomorrow evening and tracking down the point where the kernel comes to the conclusion that we are ASPMless. get_port_device_capability() is definitely the next place to check out: why does it think we have none? lspci shows that this must be wrong. Thus, this is probably an error return which is being discarded. Next thing to do: printk() out that error.

     
  • Bruce Allan

    Bruce Allan - 2011-03-17

    Nix, any update on your investigation on why the kernel incorrectly assumes ASPM is disabled?

     
  • Nix

    Nix - 2011-03-17

    No, sorry for the delay: I've been flooded with job-switch-related stuff and have had no time at all.

    I'll see if I can squeeze in some diagnostics this weekend.

     
  • Nix

    Nix - 2011-03-19

    Done some debugging (with 2.6.38: you will be unsurprised to learn that the problem persists there).

    get_port_device_capability() is apparently not failing for these devices. It returns 8 for both cards (for devices 0000:00:1c.[45], and also for the peculiar unlabelled 0000:00:1c.0, which lspci -vvv lists as a PCI bridge but which (despite the similar numbering) lspci -t says these two devices are not behind. For every other device on the system, it returns zero (no capabilities).

    This is distinctly peculiar, given that earlier debugging suggests that get_port_device_capability() must be returning zero to pcie_port_device_register() in this case. It occurs to me that said earlier debugging did not in fact indicate which device it was warning about: I bet it was one of the non-ASPM-capable devices, i.e., not the e1000 at all. So that earlier debugging patch probably gave no useful info.

    I'll pile in more printk()s into pcie_port_device_register() and report back.

     
  • Nix

    Nix - 2011-03-19

    Proved, I'm afraid: the pcie_no_aspm() call which fired your warning related to the four other PCIe devices on the system which do not support ASPM.

    0000:00:1c.0, 4, and 5 all leave pcie_port_device_register() with nr_service set to 1 (thus, they return successfully) and do not call pcie_no_aspm().

    Thus, your debugging patch earlier produced no output at all related to the e1000e: the problem must lie elsewhere.

     
  • Bruce Allan

    Bruce Allan - 2011-03-22

    The issue with ASPM remaining enabled when it shouldn't be appears to be fixed (on my system at least) using the for-linus branch in Jesse Barnes' pci-2.6 tree[1]. I have not dug into exactly which patch(es) fixes the issue (possibly the patchset that was applied just today). Would it be possible for you to try that tree to see if you still have problems with your 82574 device?

    [1] git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6.git

     
1 2 > >> (Page 1 of 2)

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks