#100 e1000e intermittent freeze-until-reboot in 2.6.36+

closed
Bruce Allan
e1000e (106)
in-kernel_driver
6
2013-07-09
2011-02-02
Nix
No

This is possibly ASPM-related: diagnostics to determine it are going on now.

Described in full in http://sourceforge.net/mailarchive/forum.php?thread_name=87k4kfq1at.fsf%40spindle.srvr.nix&forum_name=e1000-devel, in brief, after the hang, a register dump looks like this:

Offset Values
-------- -----
000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
010: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
020: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
030: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
060: 06 88 00 00 06 88 00 00 00 00 00 00 00 00 00 00
070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Keeping the adapter totally idle or persistently active (via pingflooding and apparently even ping -s 1) keeps the hang from happening.

Discussion

<< < 1 2 3 4 > >> (Page 2 of 4)
  • Bruce Allan
    Bruce Allan
    2011-03-01

    I am unable to reproduce the issue with ASPM not getting disabled on the hardware that I currently have.

    To confirm (please forgive me if the information has already been provided):
    What type of system is this?
    When not specifying the pcie_aspm kernel parameter, do you see in the kernel log the message "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it"?

    Would you also please provide the output of 'lspci -t'?

     
  • Nix
    Nix
    2011-03-01

    This is a DNUK Netserver, a PCIe-based Nehalem L5520.

    The string 'ACPI FADT' appears nowhere in any boot logs I possess (going back to October last year), which are mostly logs of boots done without the pcie_aspm parameter.

    lspci -t output:

    -[0000:00]-+-00.0
    +-01.0-[0a]--
    +-03.0-[09]--
    +-05.0-[08]--
    +-07.0-[05-07]--+-00.0-[07]----0e.0
    | -00.2-[06]--
    +-13.0
    +-14.0
    +-14.1
    +-14.2
    +-14.3
    +-16.0
    +-16.1
    +-16.2
    +-16.3
    +-16.4
    +-16.5
    +-16.6
    +-16.7
    +-1a.0
    +-1a.1
    +-1a.2
    +-1a.7
    +-1b.0
    +-1c.0-[04]--
    +-1c.4-[03]----00.0
    +-1c.5-[02]----00.0
    +-1d.0
    +-1d.1
    +-1d.2
    +-1d.7
    +-1e.0-[01]--+-00.0
    | -05.0
    +-1f.0
    +-1f.2
    -1f.3

    (aside: I have had not a single networking freeze since forcing ASPM off. We can, I think, confirm that ASPM being incorrectly enabled was the cause of the freeze. It seems not to cause problems on 100Mb/s links.)

     
  • Nix
    Nix
    2011-03-04

    Was I ever wrong. This affects 100Mb/s interfaces too, with the same symptoms in the register dump. It is definitely correlated with packet volume: I saw it while doing a 500Mb/s download (all of which was flowing over that link at about 6Mb/s).

    After the failure, lspci was... interesting:

    03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel driver in use: e1000e

    I don't think I've ever seen that error before.

     
  • Bruce Allan
    Bruce Allan
    2011-03-05

    I've finally found a system on which I can reproduce this. One difference is my system dumps "ACPI FADT declares the system doesn't support PCIe ASPM, so disable it" to the system log, and you said yours does not, but the end-result is the same - ASPM L0s is enabled for the 82574L device when it should not be. It looks like the kernel is lead to believe that ASPM is disabled on the system when it apparently is not(*); and when the driver calls the kernel API to disable it for the 82574L device, the API call returns without doing anything since it thinks ASPM is disabled.

    Would you please try the attached patch (foo.diff) created against 2.6.37 to see if your setup is doing the same (or similar) thing as mine. After booting with the patched kernel, grab the outputs of 'lspci -t', 'lspci -vvv' and dmesg and attach them to this bug tracker.

    (*) there are a number of ways this can happen, one of them results in the "ACPI FADT..." message.

    Thanks,
    Bruce.

     
  • Bruce Allan
    Bruce Allan
    2011-03-05

    debug patch against 2.6.37

     
    Attachments
  • Nix
    Nix
    2011-03-06

    Your debugging fired. Looks like get_port_device_capability() is returning 0, either because of an error or because there are no capabilities on any of these ports at all (seems a tad unlikely).

    I could add more debugging to figure out which possibliity it is, I suppose.

     
  • Nix
    Nix
    2011-03-06

    lspci with debugging hack

     
    Attachments
  • Nix
    Nix
    2011-03-06

    I can't add the lspci -vvv or dmesg output as an attachment: massive floods of internal errors from SourceForge's dire bugtracker (how totally unsurprising).

    Here's the essential bit of the dmesg, repeated numerous times:

    [ 1.910645] WARNING: at drivers/pci/pcie/aspm.c:905 pcie_no_aspm+0x2c/0x38()
    [ 1.910892] Hardware name: empty
    [ 1.911125] Disabling PCIe ASPM
    [ 1.911358] Modules linked in:
    [ 1.911626] Pid: 1, comm: swapper Not tainted 2.6.37.2-dirty #2
    [ 1.911869] Call Trace:
    [ 1.912111] [] warn_slowpath_common+0x85/0x9d
    [ 1.912356] [] warn_slowpath_fmt+0x46/0x48
    [ 1.912608] [] ? pcie_port_acpi_setup+0x80/0xd8
    [ 1.912852] [] pcie_no_aspm+0x2c/0x38
    [ 1.913085] [] pcie_port_device_register+0x34/0x4b5
    [ 1.913332] [] ? sysfs_add_one+0x63/0x91
    [ 1.913575]
    [] ? sysfs_addrm_finish+0x34/0xc3
    [ 1.913819]
    [] ? sysfs_add_one+0x21/0xfd
    [ 1.914062]
    [] pcie_portdrv_probe+0x59/0x70
    [ 1.914302]
    [] local_pci_probe+0x4d/0x96
    [ 1.914545]
    [] pci_device_probe+0x6a/0x9b
    [ 1.914788]
    [] ? sysfs_create_link+0x13/0x15
    [ 1.915032]
    [] driver_probe_device+0xb7/0x151
    [ 1.915276]
    []
    driver_attach+0x5d/0x81
    [ 1.915518] [] ? driver_attach+0x0/0x81
    [ 1.915768]
    [] bus_for_each_dev+0x56/0x82
    [ 1.916011]
    [] driver_attach+0x1e/0x20
    [ 1.916254]
    [] bus_add_driver+0x101/0x255
    [ 1.916498]
    [] ? pcie_portdrv_init+0x0/0x6d
    [ 1.916739]
    [] driver_register+0x9b/0x108
    [ 1.916981]
    [] ? pcie_portdrv_init+0x0/0x6d
    [ 1.917225]
    []
    pci_register_driver+0x56/0xc2
    [ 1.917469] [] ? pcie_portdrv_init+0x0/0x6d
    [ 1.917712] [] pcie_portdrv_init+0x5c/0x6d
    [ 1.917956] [] do_one_initcall+0x7f/0x130
    [ 1.918204] [] kernel_init+0x143/0x1cc
    [ 1.918444] [] kernel_thread_helper+0x4/0x10
    [ 1.918685] [] ? kernel_init+0x0/0x1cc
    [ 1.918925] [] ? kernel_thread_helper+0x0/0x10
    [ 1.919171] ---[ end trace 6d450e935ee1897c ]---

     
  • Bruce Allan
    Bruce Allan
    2011-03-08

    So, this looks like a bug in the PCI subsystem; should probably pull in linux-pci@vger.kernel.org to the discussion. IIRC, you said this used to work but started breaking in 2.6.36 through now, correct? Would it be possible for you to do a git bisect to find the commit that is causing the problem?

     
  • Nix
    Nix
    2011-03-08

    I really don't think bisection will work here: I tried a while back, but it's pointless. The problem is so very intermittent that I'd need to run for a couple of weeks to conclude that a revision was good, and the whole time I'd need to be prepared for this headless server's NIC going dead :/

    However, It feels to me like we're closing in on a smoking gun. I'll blow some time piling more trace in tomorrow evening and tracking down the point where the kernel comes to the conclusion that we are ASPMless. get_port_device_capability() is definitely the next place to check out: why does it think we have none? lspci shows that this must be wrong. Thus, this is probably an error return which is being discarded. Next thing to do: printk() out that error.

     
<< < 1 2 3 4 > >> (Page 2 of 4)