#385 e1000e - eth0: Reset adapter unexpectedly - leading to kpanic

closed
Todd Fujinaka
None
in-kernel_driver
1
42 minutes ago
2013-12-03
Nick Pegg
No

I had recently posted this to LKML/e1000-devel and was asked to open a ticket here.

I've been seeing some servers hit a condition where they receive a
large number of packets (over 500,000 per second, for example) which ultimately
causes a kernel panic due to a null pointer dereference. I've included
the tracebacks below.

I also got the output of gdb showing the line number where the crash occurs, which appears to be because the skb pointer is null. The above code is from kernel version 3.9.2 and e1000e driver version 2.2.14-k.

I have not been able to reproduce this condition in my lab, but out in the field
I've seen this happen with kernel versions 3.7.6 through 3.9.2 (e1000e
driver versions 2.1.4-k through 2.2.14-k), running with Intel 82574L
NICs. The kernel is running as a Xen Dom0.

I've seen previous posts to this mailing list suggesting that this is
a hardware issue (the mitigation being turning TSO/GSO off), however
those tracebacks didn't show the interface getting unexpectedly reset,
causing the null pointer dereference. In my case, is this possibly a problem with
the e1000e driver not gracefully handling the reset?

Please let me know if you need any other information beyond what I've provided.


Nov 16 07:03:19 rx ------------[ cut here ]------------
Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255
dev_watchdog+0x25b/0x270()
Nov 16 07:03:19 rx Hardware name: X8DT6
Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
ebtable_filter 8021q mrp e1000e ptp pps_core
Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1
Nov 16 07:03:19 rx Call Trace:
Nov 16 07:03:19 rx  <IRQ>  [<ffffffff8105070a>] warn_slowpath_common+0x7a/0xc0
Nov 16 07:03:19 rx  [<ffffffff810507f1>] warn_slowpath_fmt+0x41/0x50
Nov 16 07:03:19 rx  [<ffffffff8168e2ab>] dev_watchdog+0x25b/0x270
Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
Nov 16 07:03:19 rx  [<ffffffff81060464>] call_timer_fn+0x44/0x120
Nov 16 07:03:19 rx  [<ffffffff81060a71>] run_timer_softirq+0x241/0x2b0
Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
Nov 16 07:03:19 rx  <EOI>  [<ffffffff813b3722>] ? delay_tsc+0x32/0x80
Nov 16 07:03:19 rx  [<ffffffff813b373a>] ? delay_tsc+0x4a/0x80
Nov 16 07:03:19 rx  [<ffffffff813b36e8>] ? __const_udelay+0x28/0x30
Nov 16 07:03:19 rx  [<ffffffffa0025b0e>] ?
e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0018bd0>] ?
e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0027a75>] ?
e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa00330d6>] ?
e1000e_flush_descriptors+0x96/0x270 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa00181b7>] ?
e1000_check_phy_82574+0x27/0x60 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0034178>] ?
e1000_watchdog_task+0x648/0x830 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]---
Nov 16 07:03:19 rx e1000e 0000:03:00.0 eth0: Reset adapter unexpectedly
Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer
dereference at 00000000000000d0
Nov 16 07:03:19 rx IP: [<ffffffffa0031d51>]
e1000_clean_rx_irq+0x101/0x490 [e1000e]
Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0
Nov 16 07:03:19 rx Oops: 0000 [#1] SMP
Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
ebtable_filter 8021q mrp e1000e ptp pps_core
Nov 16 07:03:19 rx CPU 0
Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Tainted: G        W
 3.9.2-1 #1 Supermicro X8DT6/X8DT6
Nov 16 07:03:19 rx RIP: e030:[<ffffffffa0031d51>]
[<ffffffffa0031d51>] e1000_clean_rx_irq+0x101/0x490 [e1000e]
Nov 16 07:03:19 rx RSP: e02b:ffff88008ea03d98  EFLAGS: 00010202
Nov 16 07:03:19 rx RAX: 000000000000001a RBX: ffffc900115f9000 RCX:
ffff88008ea03e64
Nov 16 07:03:19 rx RDX: ffff88008ea03e64 RSI: ffff880002dc6e00 RDI:
ffff8800844026c0
Nov 16 07:03:19 rx RBP: ffff88008ea03e38 R08: 00000000000169e0 R09:
ffffea00000b7180
Nov 16 07:03:19 rx R10: 0000000000020063 R11: 0000000000000000 R12:
0000000000000000
Nov 16 07:03:19 rx R13: ffff880081fb6000 R14: ffff88008016c700 R15:
ffff880081fb6000
Nov 16 07:03:19 rx FS:  00007fde5acb8700(0000)
GS:ffff88008ea00000(0000) knlGS:0000000000000000
Nov 16 07:03:19 rx CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 16 07:03:19 rx CR2: 00000000000000d0 CR3: 0000000071e80000 CR4:
0000000000002660
Nov 16 07:03:19 rx DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
Nov 16 07:03:19 rx DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
Nov 16 07:03:19 rx Process kworker/0:0 (pid: 14268, threadinfo
ffff880045f8a000, task ffff88005c9c0000)
Nov 16 07:03:19 rx Stack:
Nov 16 07:03:19 rx  ffff880080220000 ffff880002dc6a00 000001d70000003c
0000001100020063
Nov 16 07:03:19 rx  000000408ea03e18 ffff880083d7d000 000005dc00000019
ffff88008016c000
Nov 16 07:03:19 rx  ffff880083d7d098 0000000000000000 ffffc900115f9000
ffff88008ea03e64
Nov 16 07:03:19 rx Call Trace:
Nov 16 07:03:19 rx  <IRQ>
Nov 16 07:03:19 rx  [<ffffffffa003354d>] e1000e_poll+0xbd/0x2d0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff81673d7f>] net_rx_action+0xaf/0x220
Nov 16 07:03:19 rx  [<ffffffff810c4e72>] ? handle_edge_irq+0x92/0x140
Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
Nov 16 07:03:19 rx  <EOI>
Nov 16 07:03:19 rx  [<ffffffff81664200>] ? skb_release_data+0x80/0xf0
Nov 16 07:03:19 rx  [<ffffffff8113291b>] ? __slab_free+0x14b/0x380
Nov 16 07:03:19 rx  [<ffffffff810f9dcb>] ? __free_pages_ok+0x8b/0xa0
Nov 16 07:03:19 rx  [<ffffffff810f9df6>] ? free_compound_page+0x16/0x20
Nov 16 07:03:19 rx  [<ffffffff810fdf27>] ? __put_compound_page+0x17/0x20
Nov 16 07:03:19 rx  [<ffffffff81132dcc>] ? kmem_cache_free+0x14c/0x170
Nov 16 07:03:19 rx  [<ffffffff816642f2>] ? __kfree_skb+0x42/0xa0
Nov 16 07:03:19 rx  [<ffffffff816643bc>] ? consume_skb+0x2c/0x70
Nov 16 07:03:19 rx  [<ffffffffa00324d7>] ?
e1000_clean_rx_ring+0xb7/0x2c0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0034bec>] ? e1000e_down+0x10c/0x260 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0037ff4>] ?
e1000e_reinit_locked+0x44/0x60 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0038102>] ? e1000_reset_task+0x42/0x7b0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff8106145a>] ? mod_timer+0x11a/0x1f0
Nov 16 07:03:19 rx  [<ffffffffa00340a6>] ?
e1000_watchdog_task+0x576/0x830 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx Code: 3b 45 84 0f 8d f1 01 00 00 48 8b 5d b0 4d 89
fd 48 03 5d a8 48 8b 4d b8 ff c0 89 01 0f ae e8 4c 8b 63 08 48 c7 43
08 00 00 00 00 <49> 8b 84 24 d0 00 00 00 48 8b 75 c0 ff 45 cc 0f 18 08
8b 45 cc
Nov 16 07:03:19 rx RIP  [<ffffffffa0031d51>]
e1000_clean_rx_irq+0x101/0x490 [e1000e]
Nov 16 07:03:19 rx  RSP <ffff88008ea03d98>
Nov 16 07:03:19 rx CR2: 00000000000000d0
Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182e ]---
Nov 16 07:03:19 rx Kernel panic - not syncing: Fatal exception in interrupt


(gdb) l *e1000_clean_rx_irq+0x101
0x19d81 is in e1000_clean_rx_irq
(drivers/net/ethernet/intel/e1000e/netdev.c:933).
928                     rmb();  /* read descriptor and rx_buffer_info after status DD */
929
930                     skb = buffer_info->skb;
931                     buffer_info->skb = NULL;
932
933                     prefetch(skb->data - NET_IP_ALIGN);
934
935                     i++;
936                     if (i == rx_ring->count)
937                             i = 0;

Related

Bugs: #1

Discussion

<< < 1 2 3 > >> (Page 2 of 3)
  • Nick Pegg
    Nick Pegg
    2014-01-29

    Yeah, sorry, I should have mentioned that it's happened multiple times in the field but I've never been able to make it happen in my lab.

    I've attached the output of lspci -vvv as lspci.txt.

    What logs are you interested in specifically? Syslog? Serial console logs? The log file in my last update including the traceback are from the serial console and are the only things in the logs from near the crash up to the server rebooting.

     
    Attachments
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-30

    So my theory was that ASPM wasn't turned off for all the machines in the field, but if you're sure that the lspci is from the failing system then we should look into other possible points of failure. The serial console log, dmesg, lspci -vvv as root, and eeprom dump would be the first things I'd want to look at.

     
  • Nick Pegg
    Nick Pegg
    2014-02-13

    Attached to this update is a tarball of some console logs, lspci -vvv, and eeprom dumps from a multitude of hosts which have recently come across this issue. I apologize that some of the console logs were cut off due to the DoS but some should contain full stack traces.

    Additionally, these were triggered by an NTP reflection attack instead of the usual SYN flood. The rate of packets was still incredibly high, one packet every 2 or 3 microseconds according to the tcpdump.

     
    Attachments
  • Nick Pegg
    Nick Pegg
    2014-02-27

    Todd- Have you gotten any closer to tracking down the source of the problem? Any luck with reproducing this in your lab?

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-02-27

    No luck at all. I think the lab wanted to close the internal bug.

    I completely missed the 2/13 reply, so let me check out the logs right now.

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-02-27

    Have you been able to try the out-of-tree driver from e1000.sourceforge.net? And what driver are you running in your VMs?

     
  • Nick Pegg
    Nick Pegg
    2014-02-27

    I got e1000e version 2.5.4 (out-of-tree) rolled into our latest server stack and have about ~50 servers running with it. I haven't seen them crash out, but it doesn't necessarily mean that it solved the problem. I'll let you know if I see evidence that the new driver fixed it.

    As for the driver in the VMs, ethtool -i eth0 reports that the driver is 'vif'.

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-02-27

    Sorry for the thrash, but I just went through the notes and talked to the repro lab as well as our VM guys and here's how we'd like to proceed. We haven't tried a local repro without exact repro steps.

    First, we need you to reproduce the issue in your lab. If you can't reproduce it, we're pretty sure we wouldn't be able to reproduce it either. I'm not sure how much traffic you sent to your system under test, but you need to hammer it with as much traffic as you're seeing in your live systems. One way you may be able to do this is with Chariot. Another way is to run DPDK and just send out as much traffic as it can send to your SUT. Yet another way is to have several traffic generator systems and a fast enough switch.

    Second, after you reproduce the issue, we're going to need you to reproduce this on the latest driver, or at least the latest kernel from the distro you're running.

    Finally, we're going to need the exact setup and repro steps so we can examine the system locally.

    Unfortunately I haven't found a bug in our database that sounds exactly like what you're seeing so without your repro I'm not going to be able to give you a concrete answer.

    I hope that helps.

     
  • Nick Pegg
    Nick Pegg
    2014-04-23

    Sorry for the lack of updates. I still haven't been able to reproduce this bug in the lab.

    However, we have a decent amount of our e1000e-based fleet running driver version 2.5.4 and we haven't seen any crashes in the path month. I'm willing to say that using a later out-of-tree driver did the trick. You can go ahead and close this bug report, and I'll re-open it or open a new bug report if this problem is still around.

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-04-23

    • status: open --> closed
     
<< < 1 2 3 > >> (Page 2 of 3)