#389 e1000e: skb->tail > skb->end => BUG

closed
None
in-kernel_driver
1
2014-08-22
2014-01-02
No

We are experiencing a recurrent Linux kernel BUG/panic with the e1000e driver on a Kontron mSP1 COM Express "mini" CPU module and other similar COM Express modules.

The BUG occurs frequently (right after booting and bringing the interface up) but not constantly. I suspect that we enter some kind of race condition.

However it can easily be reproduced by bringing the interface up and down in a loop -and ping-flooding it from another host.

on Kontron (with e1000e):
while true; do ifdown eth0; ifup eth0; done

on other Linux host:
ping -s1024 -w0 -f 192.168.4.1

The BUG/panic will then happen every time after 1-240 seconds.

If the attached patch: skb_put_nopanic.dump.patch is applied, we can see that the panic is caused by skb->tail > skb->end that leads to skb_over_panic() and BUG().

From attached messages:
skbuff: Warning: skb_put l:9250 t:0xdd455c62 e:0xdd453f40 len:9250

Furthermore, I can add, that when the patch is applied, everything seems to work perfectly before and after the situation occurs, and network traffic continues to flow. However, I know that this is only a temporary work-around and not a general solution.

I have been experimenting with various vanilla Linux kernels from kernel.org starting from 3.4 to 3.10.20 and various tweaked configurations -but we are currently clinging on to 3.10.10 for other reasons. I have also experimented with various versions of the e1000e driver from kernel.org and the e1000(e) Sourceforge project. However the problem seems identical for all kernels and driver versions.

I have attached kernel configuration, and various information from the machine, that might help identifying the issue.

The bom.txt file includes information about the (busybox based) userspace, all other output files should be easily identified by their name.

I will be glad to help with further investigations and information and any help or ideas for debugging and problem solving will be greatly appreciated.

Best regards

Morten Østergaard

19 Attachments

Discussion

  • Todd Fujinaka
    Todd Fujinaka
    2014-01-02

    One thing I would like you to check is the Spec Update (the latest is on intel.com, search for "82574 spec update") especially the latest errata 24 and 25. Please make sure your EEPROM is set correctly (I think bit 10 in the NVM is not set) and see if that changes anything.

    We're currently investigating locally.

     
  • Hi

    Thanks for your reply. I have read the 82574 spec update, but I am not entirely sure how to understand the erratas -and exactly which bit it is referring to. Below is my output of "ethtool -e eth0", maybe you can help me to see whether the bit is set. Is it eg. in byte "0x00" at Offset 0x003c or "0xf0" at 0x003d? Or am I entirely wrong here?

    Offset      Values
    ------      ------
    0x0000:     00 e0 4b 42 2b 65 30 03 46 f7 60 10 ff ff ff ff 
    0x0010:     ff ff ff ff 6b 02 00 00 86 80 d3 10 ff ff 5a 80 
    0x0020:     00 00 01 20 74 7e ff ff 00 00 c8 00 00 00 04 27 
    0x0030:     c9 6c 50 21 3e 07 0b 45 47 2d 40 01 00 f0 84 46 
    0x0040:     00 60 80 00 04 0f ff 7f 01 4d ec 92 5c fc 83 f0 
    0x0050:     20 00 83 00 50 00 1f 7d 61 19 f3 03 80 00 ff ff 
    0x0060:     ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
    0x0070:     ff ff ff ff ff ff ff ff ff ff 00 01 ff ff 82 b5 
    

    As far as I understand, changing the bit could then be done like this:

    ethtool -E eth0 magic 0x10d38086 offset 0x3d value 0xf4
    

    But is this what you suggest? and if I may ask, what is the implication of this?

    The value 0x10d38086 is found by "lspci", like this:

    ~ # lspci -k | grep "Class 0200"
    02:00.0 Class 0200: 8086:10d3 e1000e
    ~ # 
    

    Once again thanks for your interest, and I will be glad to help in any way possible.

    Best Regards
    Morten Østergaard

     
    Last edit: Morten Østergaard 2014-01-07
    • Todd Fujinaka
      Todd Fujinaka
      2014-01-07

      I'm not sure what you're talking about. What I see are, "... set bit 10 of word 0x1E in the NVM," and, "Set NVM word 0x0F, bit 1," which I see are already set. So this is probably not related to those issues.

       
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-07

    The other question is whether you are able to install a more "complete" distro and see if the issue still occurs there. It would be helpful, as it would be much easier to debug with a complete toolchain.

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-15

    We are not seeing this issue reproduce locally on other hardware. Can you try on other hardware and can you ship us the failing hardware?

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-21

    I think we're going to try to get the hardware. Can you tell me exactly what you're running?

    Thanks.

     
  • Hi

    Unfortunately I am not allowed to ship the hardware, as all our current units are all used for prototype hardware and development activities.

    I have also not been able to install a standard distro on the units yet -but I will continue this task, when time and hardware are available. Preliminary tests by my colleagues, using different live-usb-stick images haven't show the same bug. However the kernel config have been very different, and they have not yet tried reproducing with the config attached above. I will however see to do this as soon as possible.

    When you tried reproducing the bug, did you use the same kernel version (3.10.10, vanilla kernel.org, no patches) and the config above?

    We have seen the bug on more than 10 physical CPU modules, and I have tested the issues on several Kontron modules with various part numbers, and the results are identical. Using a Portwell P-com B21A Module in the same setups will also give similar results.

    We use a custom made carrier board for our prototypes, but again, if I test on this -or on the recommended "Evaluation Carrier" from Kontron, the results are identical.

    Here are the details of our hardware setup:

    COM Express module:
    Vendor: Kontron
    Model: COMe-mSP1; Atom Z510 1.1GHz, 512MB DDR2, 512MB SSD
    Part-#: 34001-5151-11-1

    Evaluation Carrier:
    Vendor: Kontron
    Model: COMe Eval Carrier T10
    Part-#: 34101-0000-00-1

    Thanks again for your interest, please tell me if I can do more to help

    Best Regards
    Morten Østergaard

     
    Last edit: Morten Østergaard 2014-01-22
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-22

    We don't have that much Atom-based hardware available as we've been focused on the larger servers locally. This looks like it could be a bug with either 32-bit or the Atom in the kernel network stack, but it's hard to say without a repro.

    We're trying to get some of the Kontron boards in-house for testing.

    The only other thing I'd suggest is to remember to use the make rules to edit your kernel config. You can edit the config by hand and get conflicting options set that way. As I recall, there was a posted configuration on the web that only half set-up the wake features and had problems. This probably has nothing to do with your problems, but I jsut wanted to point that out.

     
    Last edit: Todd Fujinaka 2014-01-22
  • Todd Fujinaka
    Todd Fujinaka
    2014-01-23

    We had another customer doing his own debugging on this issue, and he said it only happened when "CONFIG_CMA=y". Can you try turning that off? On his kernel it was still marked as EXPERIMENTAL.

     
  • Although I could not reproduce the crash on my ThinkPad X230 so far, the race detection tools [1] have found a couple of potential races that could be related to this.

    OS: ROSA Fresh R2 GNOME, x86-64
    Kernel: 3.10.27

    Hardware (from lspci -vnn):

    --------------
    00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection [8086:1502] (rev 04)
        Subsystem: Lenovo Device [17aa:21f3]
        Flags: bus master, fast devsel, latency 0, IRQ 43
        Memory at f2500000 (32-bit, non-prefetchable) [size=128K]
        Memory at f253b000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at 5080 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [e0] PCI Advanced Features
        Kernel driver in use: e1000e
        Kernel modules: e1000e
    --------------
    

    The reported potential races are between e1000e_poll() and e1000e_down(). I did "ping flood" to the system under analysis and changed MTU there at the same time. I suspect, this may also happen with just making the interface up/down like suggested above but I haven't observed the problems in that scenario yet.

    From the report of the race detector:
    --------------
    (R1) Possible data race during read of size 2 at 0xffff8800a200e3a2: {{{
       T11 (L{}):
        #0  e1000_clean_rx_irq_ps (netdev.c:1289)
        #1  e1000e_poll (netdev.c:2654)
      Concurrent write(s) happened at (OR AFTER) these points:
       T39 (L{L3}):
        #0  e1000_clean_rx_ring (netdev.c:1701)
        #1  e1000e_down (netdev.c:4032)
        #2  e1000_change_mtu (netdev.c:5688)
      Location 0xffff8800a200e3a2 is 34 bytes inside a block 
      starting at 0xffff8800a200e380 of size 128 allocated by T1 from heap:
        #0  kmalloc (slub_def.h:170)
        #0  kzalloc (slab.h:519)
        #0  e1000_alloc_queues (netdev.c:2621)
        #0  e1000_sw_init (netdev.c:4096)
        #0  e1000_probe (netdev.c:6606)
      Locks involved in this report: {L3} # this is rtnl_lock
    }}}
    Thread T11: IRQ on CPU #2
    Thread T39: ifconfig
    
    (R2) Possible data race during read of size 8 at 0xffffc900047ef7e0: {{{
       T5 (L{}):
        #0  e1000_alloc_rx_buffers (netdev.c:653)
        #1  e1000_clean_rx_irq (netdev.c:1041)
        #2  e1000e_poll (netdev.c:2654)
      Concurrent write(s) happened at (OR AFTER) these points:
       T43 (L{L3}):
        #0  e1000_clean_rx_ring (netdev.c:1677)
        #1  e1000e_down (netdev.c:4032)
        #2  e1000_change_mtu (netdev.c:5688)
      Location 0xffffc900047ef7e0 is 10208 bytes inside a block 
      starting at 0xffffc900047ed000 of size 10240 allocated by T3 from heap:
        #0  e1000e_setup_rx_resources (netdev.c:2341)
        #1  e1000_open (netdev.c:4273)
      Locks involved in this report: {L3} # this is rtnl_lock
    }}}
    Thread T5:  IRQ on CPU #1
    Thread T43: ifconfig
    --------------
    

    If e1000e_poll() may run concurrently with e1000_clean_rx_ring, the driver may end up cleaning the same skb twice and get use-after-free.

    Consider the potential race R1 (technically, it is on the accesses to next_to_clean field).

    e1000_clean_rx_ring() cleans the rx_ring, frees the appropriate skbs (netdev.c:1675):

    if (buffer_info->skb) {
        dev_kfree_skb(buffer_info->skb);
        buffer_info->skb = NULL;
    }
    

    and then sets next_to_clean to 0, (netdev.c:1701):

    rx_ring->next_to_clean = 0;
    

    Suppose, e1000_clean_rx_irq_ps() executes at the same time, gets next_to_clean, gets the appropriate skb after dev_kfree_skb() has been called for it by e1000_clean_rx_ring() but before 'buffer_info->skb = NULL' is executed (netdev.c:1289).

    i = rx_ring->next_to_clean;
    rx_desc = E1000_RX_DESC_PS(*rx_ring, i);
    staterr = le32_to_cpu(rx_desc->wb.middle.status_error);
    buffer_info = &rx_ring->buffer_info[i];
    
    while (staterr & E1000_RXD_STAT_DD) {
        if (*work_done >= work_to_do)
            break;
        (*work_done)++;
        skb = buffer_info->skb;
    

    Then e1000_clean_rx_irq_ps() processes the skb that may contain garbage now and skb_put() could result in skb_over_panic().

    The story is similar for the potential race R2.

    e1000_alloc_rx_buffers() gets the skb pointer and processes it (netdev.c:653):

    skb = buffer_info->skb;
    if (skb) {
        skb_trim(skb, 0);
        goto map_skb;
    }
    

    but the memory might have been freed already by e1000_clean_rx_ring() (netdev.c:1675):

    if (buffer_info->skb) {
        dev_kfree_skb(buffer_info->skb);
        buffer_info->skb = NULL;
    }
    

    If the pointer is obtained before it has been set to NULL, there may be a trouble.

    e1000e_down() calls napi_synchronize(&adapter->napi) (e1000e/netdev.c:4019) before e1000_clean_rx_ring() but that alone does not prevent poll() from being called concurrently with e1000_clean_rx_ring. It just waits for the currently scheduled poll() callbacks to finish. I cannot see in the code, what else may prevent execution of poll() then.

    Hope this info helps investigate the problem.

    [1] http://code.google.com/p/kernel-strider/

     
    • Todd Fujinaka
      Todd Fujinaka
      2014-02-05

      Please open a separate bug for this. 82579 != 82574.

       
  • Todd Fujinaka
    Todd Fujinaka
    2014-05-19

    We can't reproduce this issue, so we're going to have to close it.

    Please re-open if you can ship us hardware or see it on something other than your custom hardware.

     
  • Todd Fujinaka
    Todd Fujinaka
    2014-05-19

    • status: open --> closed