From: Dylan A. S. <dy...@dy...> - 2005-01-27 00:47:55
|
We're running into what looks to be a bug in e1000 with NAPI. The stack trace from a 2.6.10 (actually Fedora's 2.6.10-1.741_FC3) kernel with 5.5.4-k2-NAPI is below. We also see the problem on a few other 2.6 kernels with 5.3.19-k2-NAPI (patched by FC3) and 5.6.10.1-k2-NAPI (unpatched). It happens both on x86_64 and i686. The BUG occurs in the: BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state)); check in netif_rx_complete called from e1000_clean. All machines we test on have multiple e1000 (82546EB) cards. In the most repeatable case (where the problem usually happens within 30 minutes of bringing the machine up) there are three e1000 cards (two fiber, one quad copper). All machines are also SMP (we have not tested this on UP yet). All of the interfaces on these machines are being used for bridging (using the linux bridge). We also have our own kernel module (packet_proxy_mod below) that acts as a bridge filter (using netfilter) which may be contributing to the problem in some way, though it works without problem in a number of other environments. We have gone over both the e1000 driver and our own module but haven't found what could be causing this. (We plan to test without our module, just bridging, when we get a chance.) The situation seems clear enough: if e1000_clean is being called, the device must be on the poll_list, and so __LINK_STATE_RX_SCHED must be set. netif_rx_complete should be safe to call from the poll. Any ideas? Thanks, :-Dylan [root@mrs4 ~]# ------------[ cut here ]------------ kernel BUG at include/linux/netdevice.h:863! invalid operand: 0000 [#1] SMP Modules linked in: packet_proxy_mod binfmt_misc dm_mod e1000 tg3 floppy sd_mod CPU: 0 EIP: 0060:[<f8845417>] Tainted: P VLI EFLAGS: 00010046 (2.6.10-rs3) EIP is at e1000_clean+0xb3/0xbd [e1000] eax: 00000006 ebx: 00000283 ecx: f7c3db7c edx: f8892680 esi: 00000001 edi: f7c3d800 ebp: 00000040 esp: c0351f48 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c0350000 task=c02cfb40) Stack: f7c3da20 00000001 f7c3d900 f7c3d800 c2c3101c c2c31000 c023e5b4 000a8905 000003e7 00000001 c0346f98 c037fbc0 00000000 c011c1a2 0000000a 00000046 c0350000 c0383120 003e3007 c011c23a c0350008 c01042de c0102c8e c0350008 Call Trace: [<c023e5b4>] net_rx_action+0x70/0xf2 [<c011c1a2>] __do_softirq+0x62/0xcd [<c011c23a>] do_softirq+0x2d/0x35 [<c01042de>] do_IRQ+0x1e/0x24 [<c0102c8e>] common_interrupt+0x1a/0x20 [<c010070d>] mwait_idle+0x25/0x4a [<c01006da>] cpu_idle+0x2e/0x3c [<c035286a>] start_kernel+0x170/0x1af [<c0352316>] unknown_bootoption+0x0/0x1b1 Code: 9d 8b 04 24 e8 58 fe ff ff 31 d2 83 c4 08 89 d0 5b 5e 5f 5d c3 8b 47 24 ba 01 00 00 00 a8 02 74 a9 83 c4 08 89 d0 5b 5e 5f 5d c3 <0f> 0b 5f 03 4d 00 85 f8 eb 9f 55 57 31 ff 56 53 83 ec 24 89 44 <0>Kernel panic - not syncing: Fatal exception in interrupt |