#385 e1000e - eth0: Reset adapter unexpectedly - leading to kpanic

closed
None
in-kernel_driver
1
2015-03-02
2013-12-03
Nick Pegg
No

I had recently posted this to LKML/e1000-devel and was asked to open a ticket here.

I've been seeing some servers hit a condition where they receive a
large number of packets (over 500,000 per second, for example) which ultimately
causes a kernel panic due to a null pointer dereference. I've included
the tracebacks below.

I also got the output of gdb showing the line number where the crash occurs, which appears to be because the skb pointer is null. The above code is from kernel version 3.9.2 and e1000e driver version 2.2.14-k.

I have not been able to reproduce this condition in my lab, but out in the field
I've seen this happen with kernel versions 3.7.6 through 3.9.2 (e1000e
driver versions 2.1.4-k through 2.2.14-k), running with Intel 82574L
NICs. The kernel is running as a Xen Dom0.

I've seen previous posts to this mailing list suggesting that this is
a hardware issue (the mitigation being turning TSO/GSO off), however
those tracebacks didn't show the interface getting unexpectedly reset,
causing the null pointer dereference. In my case, is this possibly a problem with
the e1000e driver not gracefully handling the reset?

Please let me know if you need any other information beyond what I've provided.


Nov 16 07:03:19 rx ------------[ cut here ]------------
Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255
dev_watchdog+0x25b/0x270()
Nov 16 07:03:19 rx Hardware name: X8DT6
Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
ebtable_filter 8021q mrp e1000e ptp pps_core
Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1
Nov 16 07:03:19 rx Call Trace:
Nov 16 07:03:19 rx  <IRQ>  [<ffffffff8105070a>] warn_slowpath_common+0x7a/0xc0
Nov 16 07:03:19 rx  [<ffffffff810507f1>] warn_slowpath_fmt+0x41/0x50
Nov 16 07:03:19 rx  [<ffffffff8168e2ab>] dev_watchdog+0x25b/0x270
Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
Nov 16 07:03:19 rx  [<ffffffff81060464>] call_timer_fn+0x44/0x120
Nov 16 07:03:19 rx  [<ffffffff81060a71>] run_timer_softirq+0x241/0x2b0
Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
Nov 16 07:03:19 rx  <EOI>  [<ffffffff813b3722>] ? delay_tsc+0x32/0x80
Nov 16 07:03:19 rx  [<ffffffff813b373a>] ? delay_tsc+0x4a/0x80
Nov 16 07:03:19 rx  [<ffffffff813b36e8>] ? __const_udelay+0x28/0x30
Nov 16 07:03:19 rx  [<ffffffffa0025b0e>] ?
e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0018bd0>] ?
e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0027a75>] ?
e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa00330d6>] ?
e1000e_flush_descriptors+0x96/0x270 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa00181b7>] ?
e1000_check_phy_82574+0x27/0x60 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0034178>] ?
e1000_watchdog_task+0x648/0x830 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]---
Nov 16 07:03:19 rx e1000e 0000:03:00.0 eth0: Reset adapter unexpectedly
Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer
dereference at 00000000000000d0
Nov 16 07:03:19 rx IP: [<ffffffffa0031d51>]
e1000_clean_rx_irq+0x101/0x490 [e1000e]
Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0
Nov 16 07:03:19 rx Oops: 0000 [#1] SMP
Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
ebtable_filter 8021q mrp e1000e ptp pps_core
Nov 16 07:03:19 rx CPU 0
Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Tainted: G        W
 3.9.2-1 #1 Supermicro X8DT6/X8DT6
Nov 16 07:03:19 rx RIP: e030:[<ffffffffa0031d51>]
[<ffffffffa0031d51>] e1000_clean_rx_irq+0x101/0x490 [e1000e]
Nov 16 07:03:19 rx RSP: e02b:ffff88008ea03d98  EFLAGS: 00010202
Nov 16 07:03:19 rx RAX: 000000000000001a RBX: ffffc900115f9000 RCX:
ffff88008ea03e64
Nov 16 07:03:19 rx RDX: ffff88008ea03e64 RSI: ffff880002dc6e00 RDI:
ffff8800844026c0
Nov 16 07:03:19 rx RBP: ffff88008ea03e38 R08: 00000000000169e0 R09:
ffffea00000b7180
Nov 16 07:03:19 rx R10: 0000000000020063 R11: 0000000000000000 R12:
0000000000000000
Nov 16 07:03:19 rx R13: ffff880081fb6000 R14: ffff88008016c700 R15:
ffff880081fb6000
Nov 16 07:03:19 rx FS:  00007fde5acb8700(0000)
GS:ffff88008ea00000(0000) knlGS:0000000000000000
Nov 16 07:03:19 rx CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 16 07:03:19 rx CR2: 00000000000000d0 CR3: 0000000071e80000 CR4:
0000000000002660
Nov 16 07:03:19 rx DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
Nov 16 07:03:19 rx DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
Nov 16 07:03:19 rx Process kworker/0:0 (pid: 14268, threadinfo
ffff880045f8a000, task ffff88005c9c0000)
Nov 16 07:03:19 rx Stack:
Nov 16 07:03:19 rx  ffff880080220000 ffff880002dc6a00 000001d70000003c
0000001100020063
Nov 16 07:03:19 rx  000000408ea03e18 ffff880083d7d000 000005dc00000019
ffff88008016c000
Nov 16 07:03:19 rx  ffff880083d7d098 0000000000000000 ffffc900115f9000
ffff88008ea03e64
Nov 16 07:03:19 rx Call Trace:
Nov 16 07:03:19 rx  <IRQ>
Nov 16 07:03:19 rx  [<ffffffffa003354d>] e1000e_poll+0xbd/0x2d0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff81673d7f>] net_rx_action+0xaf/0x220
Nov 16 07:03:19 rx  [<ffffffff810c4e72>] ? handle_edge_irq+0x92/0x140
Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
Nov 16 07:03:19 rx  <EOI>
Nov 16 07:03:19 rx  [<ffffffff81664200>] ? skb_release_data+0x80/0xf0
Nov 16 07:03:19 rx  [<ffffffff8113291b>] ? __slab_free+0x14b/0x380
Nov 16 07:03:19 rx  [<ffffffff810f9dcb>] ? __free_pages_ok+0x8b/0xa0
Nov 16 07:03:19 rx  [<ffffffff810f9df6>] ? free_compound_page+0x16/0x20
Nov 16 07:03:19 rx  [<ffffffff810fdf27>] ? __put_compound_page+0x17/0x20
Nov 16 07:03:19 rx  [<ffffffff81132dcc>] ? kmem_cache_free+0x14c/0x170
Nov 16 07:03:19 rx  [<ffffffff816642f2>] ? __kfree_skb+0x42/0xa0
Nov 16 07:03:19 rx  [<ffffffff816643bc>] ? consume_skb+0x2c/0x70
Nov 16 07:03:19 rx  [<ffffffffa00324d7>] ?
e1000_clean_rx_ring+0xb7/0x2c0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0034bec>] ? e1000e_down+0x10c/0x260 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0037ff4>] ?
e1000e_reinit_locked+0x44/0x60 [e1000e]
Nov 16 07:03:19 rx  [<ffffffffa0038102>] ? e1000_reset_task+0x42/0x7b0 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff8106145a>] ? mod_timer+0x11a/0x1f0
Nov 16 07:03:19 rx  [<ffffffffa00340a6>] ?
e1000_watchdog_task+0x576/0x830 [e1000e]
Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
kthread_freezable_should_stop+0x70/0x70
Nov 16 07:03:19 rx Code: 3b 45 84 0f 8d f1 01 00 00 48 8b 5d b0 4d 89
fd 48 03 5d a8 48 8b 4d b8 ff c0 89 01 0f ae e8 4c 8b 63 08 48 c7 43
08 00 00 00 00 <49> 8b 84 24 d0 00 00 00 48 8b 75 c0 ff 45 cc 0f 18 08
8b 45 cc
Nov 16 07:03:19 rx RIP  [<ffffffffa0031d51>]
e1000_clean_rx_irq+0x101/0x490 [e1000e]
Nov 16 07:03:19 rx  RSP <ffff88008ea03d98>
Nov 16 07:03:19 rx CR2: 00000000000000d0
Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182e ]---
Nov 16 07:03:19 rx Kernel panic - not syncing: Fatal exception in interrupt


(gdb) l *e1000_clean_rx_irq+0x101
0x19d81 is in e1000_clean_rx_irq
(drivers/net/ethernet/intel/e1000e/netdev.c:933).
928                     rmb();  /* read descriptor and rx_buffer_info after status DD */
929
930                     skb = buffer_info->skb;
931                     buffer_info->skb = NULL;
932
933                     prefetch(skb->data - NET_IP_ALIGN);
934
935                     i++;
936                     if (i == rx_ring->count)
937                             i = 0;

Related

Bugs: #1

Discussion

  • Nick Pegg

    Nick Pegg - 2013-12-03
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -22,163 +22,162 @@
    
     --------
    -Nov 16 07:03:19 rx ------------[ cut here ]------------
    -Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255
    -dev_watchdog+0x25b/0x270()
    -Nov 16 07:03:19 rx Hardware name: X8DT6
    -Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
    -Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
    -ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
    -ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
    -ebtable_filter 8021q mrp e1000e ptp pps_core
    -Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1
    -Nov 16 07:03:19 rx Call Trace:
    -Nov 16 07:03:19 rx  <IRQ>  [<ffffffff8105070a>] warn_slowpath_common+0x7a/0xc0
    -Nov 16 07:03:19 rx  [<ffffffff810507f1>] warn_slowpath_fmt+0x41/0x50
    -Nov 16 07:03:19 rx  [<ffffffff8168e2ab>] dev_watchdog+0x25b/0x270
    -Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
    -Nov 16 07:03:19 rx  [<ffffffff81060464>] call_timer_fn+0x44/0x120
    -Nov 16 07:03:19 rx  [<ffffffff81060a71>] run_timer_softirq+0x241/0x2b0
    -Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
    -Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
    -Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
    -Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
    -Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
    -Nov 16 07:03:19 rx  <EOI>  [<ffffffff813b3722>] ? delay_tsc+0x32/0x80
    -Nov 16 07:03:19 rx  [<ffffffff813b373a>] ? delay_tsc+0x4a/0x80
    -Nov 16 07:03:19 rx  [<ffffffff813b36e8>] ? __const_udelay+0x28/0x30
    -Nov 16 07:03:19 rx  [<ffffffffa0025b0e>] ?
    -e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa0018bd0>] ?
    -e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa0027a75>] ?
    -e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa00330d6>] ?
    -e1000e_flush_descriptors+0x96/0x270 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa00181b7>] ?
    -e1000_check_phy_82574+0x27/0x60 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa0034178>] ?
    -e1000_watchdog_task+0x648/0x830 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
    -Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
    -Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
    -Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
    -Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
    -Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
    -Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    -kthread_freezable_should_stop+0x70/0x70
    -Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
    -Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    -kthread_freezable_should_stop+0x70/0x70
    -Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]---
    -Nov 16 07:03:19 rx e1000e 0000:03:00.0 eth0: Reset adapter unexpectedly
    -Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer
    -dereference at 00000000000000d0
    -Nov 16 07:03:19 rx IP: [<ffffffffa0031d51>]
    -e1000_clean_rx_irq+0x101/0x490 [e1000e]
    -Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0
    -Nov 16 07:03:19 rx Oops: 0000 [#1] SMP
    -Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
    -ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
    -ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
    -ebtable_filter 8021q mrp e1000e ptp pps_core
    -Nov 16 07:03:19 rx CPU 0
    -Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Tainted: G        W
    - 3.9.2-1 #1 Supermicro X8DT6/X8DT6
    -Nov 16 07:03:19 rx RIP: e030:[<ffffffffa0031d51>]
    -[<ffffffffa0031d51>] e1000_clean_rx_irq+0x101/0x490 [e1000e]
    -Nov 16 07:03:19 rx RSP: e02b:ffff88008ea03d98  EFLAGS: 00010202
    -Nov 16 07:03:19 rx RAX: 000000000000001a RBX: ffffc900115f9000 RCX:
    -ffff88008ea03e64
    -Nov 16 07:03:19 rx RDX: ffff88008ea03e64 RSI: ffff880002dc6e00 RDI:
    -ffff8800844026c0
    -Nov 16 07:03:19 rx RBP: ffff88008ea03e38 R08: 00000000000169e0 R09:
    -ffffea00000b7180
    -Nov 16 07:03:19 rx R10: 0000000000020063 R11: 0000000000000000 R12:
    -0000000000000000
    -Nov 16 07:03:19 rx R13: ffff880081fb6000 R14: ffff88008016c700 R15:
    -ffff880081fb6000
    -Nov 16 07:03:19 rx FS:  00007fde5acb8700(0000)
    -GS:ffff88008ea00000(0000) knlGS:0000000000000000
    -Nov 16 07:03:19 rx CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
    -Nov 16 07:03:19 rx CR2: 00000000000000d0 CR3: 0000000071e80000 CR4:
    -0000000000002660
    -Nov 16 07:03:19 rx DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    -0000000000000000
    -Nov 16 07:03:19 rx DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
    -0000000000000400
    -Nov 16 07:03:19 rx Process kworker/0:0 (pid: 14268, threadinfo
    -ffff880045f8a000, task ffff88005c9c0000)
    -Nov 16 07:03:19 rx Stack:
    -Nov 16 07:03:19 rx  ffff880080220000 ffff880002dc6a00 000001d70000003c
    -0000001100020063
    -Nov 16 07:03:19 rx  000000408ea03e18 ffff880083d7d000 000005dc00000019
    -ffff88008016c000
    -Nov 16 07:03:19 rx  ffff880083d7d098 0000000000000000 ffffc900115f9000
    -ffff88008ea03e64
    -Nov 16 07:03:19 rx Call Trace:
    -Nov 16 07:03:19 rx  <IRQ>
    -Nov 16 07:03:19 rx  [<ffffffffa003354d>] e1000e_poll+0xbd/0x2d0 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffff81673d7f>] net_rx_action+0xaf/0x220
    -Nov 16 07:03:19 rx  [<ffffffff810c4e72>] ? handle_edge_irq+0x92/0x140
    -Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
    -Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
    -Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
    -Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
    -Nov 16 07:03:19 rx  <EOI>
    -Nov 16 07:03:19 rx  [<ffffffff81664200>] ? skb_release_data+0x80/0xf0
    -Nov 16 07:03:19 rx  [<ffffffff8113291b>] ? __slab_free+0x14b/0x380
    -Nov 16 07:03:19 rx  [<ffffffff810f9dcb>] ? __free_pages_ok+0x8b/0xa0
    -Nov 16 07:03:19 rx  [<ffffffff810f9df6>] ? free_compound_page+0x16/0x20
    -Nov 16 07:03:19 rx  [<ffffffff810fdf27>] ? __put_compound_page+0x17/0x20
    -Nov 16 07:03:19 rx  [<ffffffff81132dcc>] ? kmem_cache_free+0x14c/0x170
    -Nov 16 07:03:19 rx  [<ffffffff816642f2>] ? __kfree_skb+0x42/0xa0
    -Nov 16 07:03:19 rx  [<ffffffff816643bc>] ? consume_skb+0x2c/0x70
    -Nov 16 07:03:19 rx  [<ffffffffa00324d7>] ?
    -e1000_clean_rx_ring+0xb7/0x2c0 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa0034bec>] ? e1000e_down+0x10c/0x260 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa0037ff4>] ?
    -e1000e_reinit_locked+0x44/0x60 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffffa0038102>] ? e1000_reset_task+0x42/0x7b0 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffff8106145a>] ? mod_timer+0x11a/0x1f0
    -Nov 16 07:03:19 rx  [<ffffffffa00340a6>] ?
    -e1000_watchdog_task+0x576/0x830 [e1000e]
    -Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
    -Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
    -Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
    -Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
    -Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
    -Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
    -Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    -kthread_freezable_should_stop+0x70/0x70
    -Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
    -Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    -kthread_freezable_should_stop+0x70/0x70
    -Nov 16 07:03:19 rx Code: 3b 45 84 0f 8d f1 01 00 00 48 8b 5d b0 4d 89
    -fd 48 03 5d a8 48 8b 4d b8 ff c0 89 01 0f ae e8 4c 8b 63 08 48 c7 43
    -08 00 00 00 00 <49> 8b 84 24 d0 00 00 00 48 8b 75 c0 ff 45 cc 0f 18 08
    -8b 45 cc
    -Nov 16 07:03:19 rx RIP  [<ffffffffa0031d51>]
    -e1000_clean_rx_irq+0x101/0x490 [e1000e]
    -Nov 16 07:03:19 rx  RSP <ffff88008ea03d98>
    -Nov 16 07:03:19 rx CR2: 00000000000000d0
    -Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182e ]---
    -Nov 16 07:03:19 rx Kernel panic - not syncing: Fatal exception in interrupt
    +    Nov 16 07:03:19 rx ------------[ cut here ]------------
    +    Nov 16 07:03:19 rx WARNING: at net/sched/sch_generic.c:255
    +    dev_watchdog+0x25b/0x270()
    +    Nov 16 07:03:19 rx Hardware name: X8DT6
    +    Nov 16 07:03:19 rx NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
    +    Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
    +    ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
    +    ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
    +    ebtable_filter 8021q mrp e1000e ptp pps_core
    +    Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Not tainted 3.9.2-1 #1
    +    Nov 16 07:03:19 rx Call Trace:
    +    Nov 16 07:03:19 rx  <IRQ>  [<ffffffff8105070a>] warn_slowpath_common+0x7a/0xc0
    +    Nov 16 07:03:19 rx  [<ffffffff810507f1>] warn_slowpath_fmt+0x41/0x50
    +    Nov 16 07:03:19 rx  [<ffffffff8168e2ab>] dev_watchdog+0x25b/0x270
    +    Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
    +    Nov 16 07:03:19 rx  [<ffffffff81060464>] call_timer_fn+0x44/0x120
    +    Nov 16 07:03:19 rx  [<ffffffff81060a71>] run_timer_softirq+0x241/0x2b0
    +    Nov 16 07:03:19 rx  [<ffffffff8168e050>] ? __netdev_watchdog_up+0x80/0x80
    +    Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
    +    Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
    +    Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
    +    Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
    +    Nov 16 07:03:19 rx  <EOI>  [<ffffffff813b3722>] ? delay_tsc+0x32/0x80
    +    Nov 16 07:03:19 rx  [<ffffffff813b373a>] ? delay_tsc+0x4a/0x80
    +    Nov 16 07:03:19 rx  [<ffffffff813b36e8>] ? __const_udelay+0x28/0x30
    +    Nov 16 07:03:19 rx  [<ffffffffa0025b0e>] ?
    +    e1000e_read_phy_reg_mdic+0xce/0x120 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa0018bd0>] ?
    +    e1000_get_hw_semaphore_82574+0x20/0x40 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa0027a75>] ?
    +    e1000e_read_phy_reg_bm2+0x55/0xb0 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa00330d6>] ?
    +    e1000e_flush_descriptors+0x96/0x270 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa00181b7>] ?
    +    e1000_check_phy_82574+0x27/0x60 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa0034178>] ?
    +    e1000_watchdog_task+0x648/0x830 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
    +    Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
    +    Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
    +    Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
    +    Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
    +    Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
    +    Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    +    kthread_freezable_should_stop+0x70/0x70
    +    Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
    +    Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    +    kthread_freezable_should_stop+0x70/0x70
    +    Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182d ]---
    +    Nov 16 07:03:19 rx e1000e 0000:03:00.0 eth0: Reset adapter unexpectedly
    +    Nov 16 07:03:19 rx BUG: unable to handle kernel NULL pointer
    +    dereference at 00000000000000d0
    +    Nov 16 07:03:19 rx IP: [<ffffffffa0031d51>]
    +    e1000_clean_rx_irq+0x101/0x490 [e1000e]
    +    Nov 16 07:03:19 rx PGD 4a6c3067 PUD 4f440067 PMD 0
    +    Nov 16 07:03:19 rx Oops: 0000 [#1] SMP
    +    Nov 16 07:03:19 rx Modules linked in: xt_u32 xt_physdev
    +    ip6table_mangle ip6_tables ebt_comment ebt_set ebt_arp ebt_limit
    +    ebt_ip6 ebt_ip ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding
    +    ebtable_filter 8021q mrp e1000e ptp pps_core
    +    Nov 16 07:03:19 rx CPU 0
    +    Nov 16 07:03:19 rx Pid: 14268, comm: kworker/0:0 Tainted: G        W
    +     3.9.2-1 #1 Supermicro X8DT6/X8DT6
    +    Nov 16 07:03:19 rx RIP: e030:[<ffffffffa0031d51>]
    +    [<ffffffffa0031d51>] e1000_clean_rx_irq+0x101/0x490 [e1000e]
    +    Nov 16 07:03:19 rx RSP: e02b:ffff88008ea03d98  EFLAGS: 00010202
    +    Nov 16 07:03:19 rx RAX: 000000000000001a RBX: ffffc900115f9000 RCX:
    +    ffff88008ea03e64
    +    Nov 16 07:03:19 rx RDX: ffff88008ea03e64 RSI: ffff880002dc6e00 RDI:
    +    ffff8800844026c0
    +    Nov 16 07:03:19 rx RBP: ffff88008ea03e38 R08: 00000000000169e0 R09:
    +    ffffea00000b7180
    +    Nov 16 07:03:19 rx R10: 0000000000020063 R11: 0000000000000000 R12:
    +    0000000000000000
    +    Nov 16 07:03:19 rx R13: ffff880081fb6000 R14: ffff88008016c700 R15:
    +    ffff880081fb6000
    +    Nov 16 07:03:19 rx FS:  00007fde5acb8700(0000)
    +    GS:ffff88008ea00000(0000) knlGS:0000000000000000
    +    Nov 16 07:03:19 rx CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
    +    Nov 16 07:03:19 rx CR2: 00000000000000d0 CR3: 0000000071e80000 CR4:
    +    0000000000002660
    +    Nov 16 07:03:19 rx DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    +    0000000000000000
    +    Nov 16 07:03:19 rx DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
    +    0000000000000400
    +    Nov 16 07:03:19 rx Process kworker/0:0 (pid: 14268, threadinfo
    +    ffff880045f8a000, task ffff88005c9c0000)
    +    Nov 16 07:03:19 rx Stack:
    +    Nov 16 07:03:19 rx  ffff880080220000 ffff880002dc6a00 000001d70000003c
    +    0000001100020063
    +    Nov 16 07:03:19 rx  000000408ea03e18 ffff880083d7d000 000005dc00000019
    +    ffff88008016c000
    +    Nov 16 07:03:19 rx  ffff880083d7d098 0000000000000000 ffffc900115f9000
    +    ffff88008ea03e64
    +    Nov 16 07:03:19 rx Call Trace:
    +    Nov 16 07:03:19 rx  <IRQ>
    +    Nov 16 07:03:19 rx  [<ffffffffa003354d>] e1000e_poll+0xbd/0x2d0 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffff81673d7f>] net_rx_action+0xaf/0x220
    +    Nov 16 07:03:19 rx  [<ffffffff810c4e72>] ? handle_edge_irq+0x92/0x140
    +    Nov 16 07:03:19 rx  [<ffffffff8105893f>] __do_softirq+0xef/0x270
    +    Nov 16 07:03:19 rx  [<ffffffff81058bc5>] irq_exit+0xb5/0xc0
    +    Nov 16 07:03:19 rx  [<ffffffff81435a8f>] xen_evtchn_do_upcall+0x2f/0x40
    +    Nov 16 07:03:19 rx  [<ffffffff817a24fe>] xen_do_hypervisor_callback+0x1e/0x30
    +    Nov 16 07:03:19 rx  <EOI>
    +    Nov 16 07:03:19 rx  [<ffffffff81664200>] ? skb_release_data+0x80/0xf0
    +    Nov 16 07:03:19 rx  [<ffffffff8113291b>] ? __slab_free+0x14b/0x380
    +    Nov 16 07:03:19 rx  [<ffffffff810f9dcb>] ? __free_pages_ok+0x8b/0xa0
    +    Nov 16 07:03:19 rx  [<ffffffff810f9df6>] ? free_compound_page+0x16/0x20
    +    Nov 16 07:03:19 rx  [<ffffffff810fdf27>] ? __put_compound_page+0x17/0x20
    +    Nov 16 07:03:19 rx  [<ffffffff81132dcc>] ? kmem_cache_free+0x14c/0x170
    +    Nov 16 07:03:19 rx  [<ffffffff816642f2>] ? __kfree_skb+0x42/0xa0
    +    Nov 16 07:03:19 rx  [<ffffffff816643bc>] ? consume_skb+0x2c/0x70
    +    Nov 16 07:03:19 rx  [<ffffffffa00324d7>] ?
    +    e1000_clean_rx_ring+0xb7/0x2c0 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa0034bec>] ? e1000e_down+0x10c/0x260 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa0037ff4>] ?
    +    e1000e_reinit_locked+0x44/0x60 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffffa0038102>] ? e1000_reset_task+0x42/0x7b0 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffff8106145a>] ? mod_timer+0x11a/0x1f0
    +    Nov 16 07:03:19 rx  [<ffffffffa00340a6>] ?
    +    e1000_watchdog_task+0x576/0x830 [e1000e]
    +    Nov 16 07:03:19 rx  [<ffffffff81797877>] ? __schedule+0x3a7/0x7c0
    +    Nov 16 07:03:19 rx  [<ffffffff8106c74e>] ? process_one_work+0x16e/0x430
    +    Nov 16 07:03:19 rx  [<ffffffff8106ea3c>] ? worker_thread+0x11c/0x410
    +    Nov 16 07:03:19 rx  [<ffffffff8106e920>] ? manage_workers+0x360/0x360
    +    Nov 16 07:03:19 rx  [<ffffffff810738f6>] ? kthread+0xc6/0xd0
    +    Nov 16 07:03:19 rx  [<ffffffff81003869>] ? xen_end_context_switch+0x19/0x20
    +    Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    +    kthread_freezable_should_stop+0x70/0x70
    +    Nov 16 07:03:19 rx  [<ffffffff817a10bc>] ? ret_from_fork+0x7c/0xb0
    +    Nov 16 07:03:19 rx  [<ffffffff81073830>] ?
    +    kthread_freezable_should_stop+0x70/0x70
    +    Nov 16 07:03:19 rx Code: 3b 45 84 0f 8d f1 01 00 00 48 8b 5d b0 4d 89
    +    fd 48 03 5d a8 48 8b 4d b8 ff c0 89 01 0f ae e8 4c 8b 63 08 48 c7 43
    +    08 00 00 00 00 <49> 8b 84 24 d0 00 00 00 48 8b 75 c0 ff 45 cc 0f 18 08
    +    8b 45 cc
    +    Nov 16 07:03:19 rx RIP  [<ffffffffa0031d51>]
    +    e1000_clean_rx_irq+0x101/0x490 [e1000e]
    +    Nov 16 07:03:19 rx  RSP <ffff88008ea03d98>
    +    Nov 16 07:03:19 rx CR2: 00000000000000d0
    +    Nov 16 07:03:19 rx ---[ end trace f896af6c9f44182e ]---
    +    Nov 16 07:03:19 rx Kernel panic - not syncing: Fatal exception in interrupt
     --------
    
     --------
    -(gdb) l *e1000_clean_rx_irq+0x101
    -0x19d81 is in e1000_clean_rx_irq
    -(drivers/net/ethernet/intel/e1000e/netdev.c:933).
    -928                     rmb();  /* read descriptor and rx_buffer_info
    -after status DD */
    -929
    -930                     skb = buffer_info->skb;
    -931                     buffer_info->skb = NULL;
    -932
    -933                     prefetch(skb->data - NET_IP_ALIGN);
    -934
    -935                     i++;
    -936                     if (i == rx_ring->count)
    -937                             i = 0;
    +    (gdb) l *e1000_clean_rx_irq+0x101
    +    0x19d81 is in e1000_clean_rx_irq
    +    (drivers/net/ethernet/intel/e1000e/netdev.c:933).
    +    928                     rmb();  /* read descriptor and rx_buffer_info after status DD */
    +    929
    +    930                     skb = buffer_info->skb;
    +    931                     buffer_info->skb = NULL;
    +    932
    +    933                     prefetch(skb->data - NET_IP_ALIGN);
    +    934
    +    935                     i++;
    +    936                     if (i == rx_ring->count)
    +    937                             i = 0;
     --------
    
     
  • Todd Fujinaka

    Todd Fujinaka - 2013-12-03
    • assigned_to: Todd Fujinaka
     
  • Todd Fujinaka

    Todd Fujinaka - 2013-12-03

    I'd like to get this reproduced locally. Can you give us more details on your exact hardware and software setup, and exact repro steps such as set up of the system and the link partner?

    Thanks.

     
  • Nick Pegg

    Nick Pegg - 2013-12-03

    I'm not sure what specific hardware/software information you're exactly looking for. Let me know if there's other information or output of a specific command that you want to see.

    The affected servers all have Supermicro X8DT6 motherboards with dual Intel 82574L onboard NICs. The server host that I know was most recently affected has dual Intel L5520 processors. The all of these servers are using LACP to bond eth0 and eth1. On the other end of the links are Cisco Nexus 5548UP switches. Below are ethtool and lspci output for the NICs.

    The newest of the affected servers that I've seen is a Xen 4.1.6 host running Linux kernel 3.9.2 using with the e1000e driver from the kernel source tree (2.2.14-k). All Xen DomUs are PV and connect via bridged networking on the host.

    The traffic which has triggered this out in the wild is usually a large TCP SYN flood against one of the DomUs (output of one of the tcpdumps is below). I've tried unsuccessfully to replicate this in my lab with identical hardware/software and tcpdump sees a SYN packet every 6 microseconds, but I don't know if that's simply not enough or if the other traffic going in/out of the host is compounding the problem.


    # ethtool -i eth0
    driver: e1000e
    version: 2.2.14-k
    firmware-version: 1.9-0
    bus-info: 0000:03:00.0
    # ethtool -i eth1
    driver: e1000e
    version: 2.2.14-k
    firmware-version: 1.9-0
    bus-info: 0000:04:00.0
    

    # lspci -vvv -nn -d 8086:10d3
    03:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
        Subsystem: Super Micro Computer Inc Device [15d9:10d3]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fade0000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at cf80 [size=32]
        Region 3: Memory at faddc000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [c8] Power Management version 2
            Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
            Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
            Address: 0000000000000000  Data: 0000
        Capabilities: [e0] Express (v1) Endpoint, MSI 00
            DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
            DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                MaxPayload 128 bytes, MaxReadReq 512 bytes
            DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
            LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
                ClockPM- Surprise- LLActRep- BwNot-
            LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
            LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
            Vector table: BAR=3 offset=00000000
            PBA: BAR=3 offset=00002000
        Capabilities: [100 v1] Advanced Error Reporting
            UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
            UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
            UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
            CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
            CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
            AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-66-a7-44
        Kernel driver in use: e1000e
    
    04:00.0 Ethernet controller [0200]: Intel Corporation 82574L Gigabit Network Connection [8086:10d3]
        Subsystem: Super Micro Computer Inc Device [15d9:10d3]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 17
        Region 0: Memory at faee0000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at df80 [size=32]
        Region 3: Memory at faedc000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [c8] Power Management version 2
            Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
            Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
            Address: 0000000000000000  Data: 0000
        Capabilities: [e0] Express (v1) Endpoint, MSI 00
            DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
            DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                MaxPayload 128 bytes, MaxReadReq 512 bytes
            DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
            LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
                ClockPM- Surprise- LLActRep- BwNot-
            LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
            LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
            Vector table: BAR=3 offset=00000000
            PBA: BAR=3 offset=00002000
        Capabilities: [100 v1] Advanced Error Reporting
            UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
            UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
            UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
            CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
            CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
            AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-66-a7-45
        Kernel driver in use: e1000e
    

    07:03:21.006427 IP 125.64.33.62.764 > x.x.x.x.62764: Flags [S], seq 674729391, win 16384, length 0
    07:03:21.006436 IP 125.64.33.62.764 > x.x.x.x.62764: Flags [S], seq 674729392, win 16384, length 0
    07:03:21.006439 IP 23.110.53.83.917 > x.x.x.x.62764: Flags [S], seq 674763526, win 16384, length 0
    07:03:21.006441 IP 23.110.53.83.917 > x.x.x.x.62764: Flags [S], seq 674763527, win 16384, length 0
    07:03:21.006444 IP 23.110.53.83.917 > x.x.x.x.62764: Flags [S], seq 674763528, win 16384, length 0
    07:03:21.006446 IP 23.110.53.83.917 > x.x.x.x.62764: Flags [S], seq 674763529, win 16384, length 0
    07:03:21.006449 IP 125.64.33.62.764 > x.x.x.x.62764: Flags [S], seq 674729393, win 16384, length 0
    07:03:21.006452 IP 125.64.33.62.764 > x.x.x.x.62764: Flags [S], seq 674729394, win 16384, length 0
    07:03:21.006454 IP 74.208.9.230.201 > x.x.x.x.62764: Flags [S], seq 674727252, win 16384, length 0
    07:03:21.006457 IP 27.54.181.180.37 > x.x.x.x.62764: Flags [S], seq 674749865, win 16384, length 0
    07:03:21.006459 IP 23.110.53.83.917 > x.x.x.x.62764: Flags [S], seq 674763559, win 16384, length 0
    07:03:21.006462 IP 23.110.53.83.917 > x.x.x.x.62764: Flags [S], seq 674763560, win 16384, length 0
    

     
  • Nathan March

    Nathan March - 2013-12-24

    I've hit this bug as well, also on a xen host. Using a supermicro x8dtt-h board with dual onboard nic's, also bonded to a cisco (non-nexus).

     
  • Todd Fujinaka

    Todd Fujinaka - 2013-12-24

    This is still in our repro queue. I'll have mroe details when we actually get it set up in-house.

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-01-27

    This is now up to the top of the queue, but looking more carefully at the issue I'm wondering if the systems that do hang are using an older version of the driver that doesn't properly disable ASPM.

    We'll give it a try, but if we can't reproduce this I think I'm going to close the issue.

     
  • Nick Pegg

    Nick Pegg - 2014-01-28

    I checked the other bug report (#385) where you mention ASPM being a potential problem. I checked with lspci -vvv | grep -i ASPM and everything is disabled. Do you know what driver version is affected by the ASPM bug?

    We're close to pushing out a new software stack, so I'll make sure it gets e1000e v2.5.4 rolled into it to try to rule this out.

     
  • Nick Pegg

    Nick Pegg - 2014-01-29

    It looks like one of my servers ran across this again last night. This particular one had e1000e version 2.2.14-k, from the kernel tree.

    I've attached two files, the traceback from the affected server (traceback.log), and a sample of the tcpdump output showing some of the traffic that triggered this (tcpdump.sample.log).

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-01-29

    So you can get the problem to reproduce, just not in the lab? We were thinking you had reports but weren't actually seeing the problem directly.

    Can you send us all the logs from the system, lspci -vvv, etc?

    Thanks.

     
  • Nick Pegg

    Nick Pegg - 2014-01-29

    Yeah, sorry, I should have mentioned that it's happened multiple times in the field but I've never been able to make it happen in my lab.

    I've attached the output of lspci -vvv as lspci.txt.

    What logs are you interested in specifically? Syslog? Serial console logs? The log file in my last update including the traceback are from the serial console and are the only things in the logs from near the crash up to the server rebooting.

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-01-30

    So my theory was that ASPM wasn't turned off for all the machines in the field, but if you're sure that the lspci is from the failing system then we should look into other possible points of failure. The serial console log, dmesg, lspci -vvv as root, and eeprom dump would be the first things I'd want to look at.

     
  • Nick Pegg

    Nick Pegg - 2014-02-13

    Attached to this update is a tarball of some console logs, lspci -vvv, and eeprom dumps from a multitude of hosts which have recently come across this issue. I apologize that some of the console logs were cut off due to the DoS but some should contain full stack traces.

    Additionally, these were triggered by an NTP reflection attack instead of the usual SYN flood. The rate of packets was still incredibly high, one packet every 2 or 3 microseconds according to the tcpdump.

     
  • Nick Pegg

    Nick Pegg - 2014-02-27

    Todd- Have you gotten any closer to tracking down the source of the problem? Any luck with reproducing this in your lab?

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-02-27

    No luck at all. I think the lab wanted to close the internal bug.

    I completely missed the 2/13 reply, so let me check out the logs right now.

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-02-27

    Have you been able to try the out-of-tree driver from e1000.sourceforge.net? And what driver are you running in your VMs?

     
  • Nick Pegg

    Nick Pegg - 2014-02-27

    I got e1000e version 2.5.4 (out-of-tree) rolled into our latest server stack and have about ~50 servers running with it. I haven't seen them crash out, but it doesn't necessarily mean that it solved the problem. I'll let you know if I see evidence that the new driver fixed it.

    As for the driver in the VMs, ethtool -i eth0 reports that the driver is 'vif'.

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-02-27

    Sorry for the thrash, but I just went through the notes and talked to the repro lab as well as our VM guys and here's how we'd like to proceed. We haven't tried a local repro without exact repro steps.

    First, we need you to reproduce the issue in your lab. If you can't reproduce it, we're pretty sure we wouldn't be able to reproduce it either. I'm not sure how much traffic you sent to your system under test, but you need to hammer it with as much traffic as you're seeing in your live systems. One way you may be able to do this is with Chariot. Another way is to run DPDK and just send out as much traffic as it can send to your SUT. Yet another way is to have several traffic generator systems and a fast enough switch.

    Second, after you reproduce the issue, we're going to need you to reproduce this on the latest driver, or at least the latest kernel from the distro you're running.

    Finally, we're going to need the exact setup and repro steps so we can examine the system locally.

    Unfortunately I haven't found a bug in our database that sounds exactly like what you're seeing so without your repro I'm not going to be able to give you a concrete answer.

    I hope that helps.

     
  • Nick Pegg

    Nick Pegg - 2014-04-23

    Sorry for the lack of updates. I still haven't been able to reproduce this bug in the lab.

    However, we have a decent amount of our e1000e-based fleet running driver version 2.5.4 and we haven't seen any crashes in the path month. I'm willing to say that using a later out-of-tree driver did the trick. You can go ahead and close this bug report, and I'll re-open it or open a new bug report if this problem is still around.

     
  • Todd Fujinaka

    Todd Fujinaka - 2014-04-23
    • status: open --> closed
     
  • Todd Fujinaka

    Todd Fujinaka - 2014-04-23

    I'm glad to hear that the problems have cleared up. Let us know if you have further issues.

     
  • Nitinkumar Balar

    • status: closed --> open
     
  • Nitinkumar Balar

    I am facing the same issue in e1000e driver and it's reproducible can anyone help me?

    Aug 19 01:17:01 1408391221 inet: inetd shutdown succeeded Aug 19 01:17:01
    1408391221 inet: inetd startup succeeded Aug 19 01:17:37 1408391257 kernel:
    ------------[ cut here ]------------ Aug 19 01:17:37 1408391257 kernel:
    WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x111/0x194() Aug 19
    01:17:37 1408391257 kernel: Hardware name: To be filled by O.E.M.
    Aug 19 01:17:37 1408391257 kernel: NETDEV WATCHDOG: eth2 (e1000e): transmit
    queue 0 timed out Aug 19 01:17:37 1408391257 kernel: Modules linked in: ext2
    loop cls_fw sch_sfq af_packet sch_htb imq ipt_MASQUERADE xt_mac arptable_
    filter arp_tables xt_NOTRACK xt_IMQ xt_IPMARK compat_xtables ip6_tables
    xt_connmark xt_set ipt_REDIRECT xt_hashlimit xt_mark xt_tcpu dp xt_multiport
    xt_state ip_set_bitmap_ipmac ip_set_hash_ip ip_set_list_set ip_set_hash_net
    iptable_raw iptable_nat iptable_filter i ptable_mangle ip_set nfnetlink
    ip_tables x_tables pppoe pppox ppp_generic slhc nf_nat_tftp nf_nat_pptp
    nf_nat_proto_gre nf_nat_irc n f_nat_ftp nf_nat nf_conntrack_tftp
    nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_ftp
    sg e1000e nf_conntrack
    _ipv4 nf_conntrack nf_defrag_ipv4 8021q
    Aug 19 01:17:37 1408391257 kernel: Pid: 0, comm: kworker/0:0 Not tainted
    2.6.39.4-1smp #91 Aug 19 01:17:37 1408391257 kernel: Call Trace:
    Aug 19 01:17:37 1408391257 kernel: [<c10267c8>]
    warn_slowpath_common+0x65/0x7a Aug 19 01:17:37 1408391257 kernel:
    [<c119f449>] ? dev_watchdog+0x111/0x194 Aug 19 01:17:37 1408391257 kernel:
    [<c1026841>] warn_slowpath_fmt+0x26/0x2a Aug 19 01:17:37 1408391257 kernel:
    [<c119f449>] dev_watchdog+0x111/0x194 Aug 19 01:17:37 1408391257 kernel:
    [<c1006530>] ? hw_breakpoint_exceptions_notify+0x13/0xde
    Aug 19 01:17:37 1408391257 kernel: [<c102eb7a>]
    run_timer_softirq+0x10a/0x182 Aug 19 01:17:37 1408391257 kernel:
    [<c119f338>] ? netdev_watchdog_up+0x52/0x52 Aug 19 01:17:37 1408391257
    kernel: [<c102abb6>]
    do_softirq+0x6b/0xe5 Aug 19 01:17:37 1408391257
    kernel: [<c102ab4b>] ? __local_bh_enable+0x68/0x68 Aug 19 01:17:37
    1408391257 kernel: <IRQ> [<c102aac7>] ? irq_exit+0x32/0x3b Aug 19 01:17:37
    1408391257 kernel: [<c10149f2>] ? smp_apic_timer_interrupt+0x6e/0x7c
    Aug 19 01:17:37 1408391257 kernel: [<c11fbc82>] ?
    apic_timer_interrupt+0x2a/0x30 Aug 19 01:17:37 1408391257 kernel:
    [<c100779b>] ? default_idle+0x2e/0x43 Aug 19 01:17:37 1408391257 kernel:
    [<c100198c>] ? cpu_idle+0x3a/0x52 Aug 19 01:17:37 1408391257 kernel:
    [<c11f72af>] ? start_secondary+0x190/0x195 Aug 19 01:17:37 1408391257
    kernel: ---[ end trace 8ff04e093ab9f565 ]--- Aug 19 01:17:37 1408391257
    kernel: e1000e 0000:03:00.0: eth2: Reset adapter unexpectedly Aug 19
    01:17:41 1408391261 inet: inetd shutdown succeeded Aug 19 01:17:41
    1408391261 kernel: e1000e 0000:03:00.0: eth2: Timesync Tx Control register
    not set as expected Aug 19 01:17:41 1408391261 inet: inetd startup succeeded

     
    Last edit: Nitinkumar Balar 2014-08-28
  • Todd Fujinaka

    Todd Fujinaka - 2014-08-28
    • status: open --> closed
     
  • Todd Fujinaka

    Todd Fujinaka - 2014-08-28

    If you have the exact same motherboard, kernel, and Ethernet adapter, the solution appears to be to update your driver. If not, please open another issue.

    I'm closing this one.

     

Log in to post a comment.