From: Mark H. <ma...@os...> - 2004-05-18 14:23:45
|
I was running a modified tipc benchmark program that used 64 processes instead of 32. I also had my kernel compiled with page alloc debug turned on (memory allocations are unmapped when free to catch bad access as soon as possible). It was part way through the 16K size when the panic happened. It was on the server side. Any thoughts on using the sourceforge bugzilla to keep track of current bugs? Mark. [root@cl019 root]# Unable to handle kernel paging request at virtual address f1067fe4 printing eip: f8a8617e *pde = 00585067 *pte = 31067163 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 EIP: 0060:[<f8a8617e>] Not tainted EFLAGS: 00010206 (2.6.6-rc3) EIP is at tipc_recv_msg+0x15e/0x880 [tipc] eax: 00000000 ebx: f1067f50 ecx: f104a7f8 edx: f105df50 esi: efb89e18 edi: f4d8fbf8 ebp: f01cda70 esp: f01cda18 ds: 007b es: 007b ss: 0068 Process server_tipc (pid: 1439, threadinfo=f01cc000 task=f01f3a60) Stack: f632fbf8 00000000 00000086 c1620ce0 00000000 00000086 f01cda3c c01338cb 00000000 00002ae8 0000abe6 efb89e18 effcbf50 f104a7f8 00000001 00000000 f01cda90 00000286 c1620ce0 f8aaf400 effcbf50 c054f970 f01cda80 f8aa4319 Call Trace: [<c01338cb>] kernel_text_address+0x3b/0x50 [<f8aa4319>] recv_msg+0x39/0x50 [tipc] [<c0375e92>] netif_receive_skb+0x172/0x1b0 [<c0375f54>] process_backlog+0x84/0x120 [<c0376070>] net_rx_action+0x80/0x120 [<c0124c38>] __do_softirq+0xb8/0xc0 [<c0124c75>] do_softirq+0x35/0x40 [<c0107cf5>] do_IRQ+0x175/0x230 [<c0375e92>] netif_receive_skb+0x172/0x1b0 [<c0105ce0>] common_interrupt+0x18/0x20 [<c0221e39>] __copy_user_zeroing_intel+0x19/0xb0 [<c0221fc2>] __copy_from_user_ll+0x72/0x80 [<c0222088>] copy_from_user+0x58/0x80 [<f8a8530f>] link_send_sections_long+0x30f/0xad0 [tipc] [<f8a824de>] link_schedule_port+0xfe/0x1b0 [tipc] [<f8a84d39>] link_send_sections_fast+0x559/0x820 [tipc] [<c0124c38>] __do_softirq+0xb8/0xc0 [<f8a96312>] tipc_send+0x92/0x9d0 [tipc] [<c03f186f>] schedule+0x37f/0x7a0 [<c0370bd5>] kfree_skbmem+0x25/0x30 [<f8aa1946>] recv_msg+0x2b6/0x560 [tipc] [<f8aa12f0>] send_packet+0x90/0x180 [tipc] [<c011b140>] default_wake_function+0x0/0x20 [<c036cbde>] sock_sendmsg+0x8e/0xb0 [<c01193f8>] kernel_map_pages+0x28/0x64 [<c036c9ba>] sockfd_lookup+0x1a/0x80 [<c036e101>] sys_sendto+0xe1/0x100 [<c0128fd2>] del_timer_sync+0x42/0x140 [<c036d4a9>] sock_poll+0x29/0x30 [<c01193f8>] kernel_map_pages+0x28/0x64 [<c036e156>] sys_send+0x36/0x40 [<c036e9ae>] sys_socketcall+0x12e/0x240 [<c0105373>] syscall_call+0x7/0xb Code: 8b 83 94 00 00 00 48 0f 85 f9 00 00 00 8d 73 44 8b 46 0c 8b <0>Kernel panic: Fatal exception in interrupt In interrupt handler - not syncing -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-18 17:39:56
|
Hi Mark, I haven't tried with 64 processes yet, but I will try to reproduce and trouble-shoot this problem when I have time. Right now I am spending some time on making the lock handling at port/socket level more symmetric and easier to follow. It will have some performance implications, but it will not have any major impact, I think. I must admit that I am not familiar with Bugzilla. Is this the base for the bug reporting/tracking system we already have for each project, or is it something else ? The bug report system has only been used sporadically, as you may have noticed, but I have no problems with starting to use it more systematically; - I think indeed we will have to if TIPC makes it into the kernel. /Jon Mark Haverkamp wrote: >I was running a modified tipc benchmark program that used 64 processes >instead of 32. I also had my kernel compiled with page alloc debug >turned on (memory allocations are unmapped when free to catch bad access >as soon as possible). It was part way through the 16K size when the >panic happened. It was on the server side. > >Any thoughts on using the sourceforge bugzilla to keep track of current >bugs? > >Mark. > > >[root@cl019 root]# >Unable to handle kernel paging request at virtual address f1067fe4 > printing eip: >f8a8617e >*pde = 00585067 >*pte = 31067163 >Oops: 0000 [#1] >SMP DEBUG_PAGEALLOC >CPU: 1 >EIP: 0060:[<f8a8617e>] Not tainted >EFLAGS: 00010206 (2.6.6-rc3) >EIP is at tipc_recv_msg+0x15e/0x880 [tipc] >eax: 00000000 ebx: f1067f50 ecx: f104a7f8 edx: f105df50 >esi: efb89e18 edi: f4d8fbf8 ebp: f01cda70 esp: f01cda18 >ds: 007b es: 007b ss: 0068 >Process server_tipc (pid: 1439, threadinfo=f01cc000 task=f01f3a60) >Stack: f632fbf8 00000000 00000086 c1620ce0 00000000 00000086 f01cda3c c01338cb > 00000000 00002ae8 0000abe6 efb89e18 effcbf50 f104a7f8 00000001 00000000 > f01cda90 00000286 c1620ce0 f8aaf400 effcbf50 c054f970 f01cda80 f8aa4319 >Call Trace: > [<c01338cb>] kernel_text_address+0x3b/0x50 > [<f8aa4319>] recv_msg+0x39/0x50 [tipc] > [<c0375e92>] netif_receive_skb+0x172/0x1b0 > [<c0375f54>] process_backlog+0x84/0x120 > [<c0376070>] net_rx_action+0x80/0x120 > [<c0124c38>] __do_softirq+0xb8/0xc0 > [<c0124c75>] do_softirq+0x35/0x40 > [<c0107cf5>] do_IRQ+0x175/0x230 > [<c0375e92>] netif_receive_skb+0x172/0x1b0 > [<c0105ce0>] common_interrupt+0x18/0x20 > [<c0221e39>] __copy_user_zeroing_intel+0x19/0xb0 > [<c0221fc2>] __copy_from_user_ll+0x72/0x80 > [<c0222088>] copy_from_user+0x58/0x80 > [<f8a8530f>] link_send_sections_long+0x30f/0xad0 [tipc] > [<f8a824de>] link_schedule_port+0xfe/0x1b0 [tipc] > [<f8a84d39>] link_send_sections_fast+0x559/0x820 [tipc] > [<c0124c38>] __do_softirq+0xb8/0xc0 > [<f8a96312>] tipc_send+0x92/0x9d0 [tipc] > [<c03f186f>] schedule+0x37f/0x7a0 > [<c0370bd5>] kfree_skbmem+0x25/0x30 > [<f8aa1946>] recv_msg+0x2b6/0x560 [tipc] > [<f8aa12f0>] send_packet+0x90/0x180 [tipc] > [<c011b140>] default_wake_function+0x0/0x20 > [<c036cbde>] sock_sendmsg+0x8e/0xb0 > [<c01193f8>] kernel_map_pages+0x28/0x64 > [<c036c9ba>] sockfd_lookup+0x1a/0x80 > [<c036e101>] sys_sendto+0xe1/0x100 > [<c0128fd2>] del_timer_sync+0x42/0x140 > [<c036d4a9>] sock_poll+0x29/0x30 > [<c01193f8>] kernel_map_pages+0x28/0x64 > [<c036e156>] sys_send+0x36/0x40 > [<c036e9ae>] sys_socketcall+0x12e/0x240 > [<c0105373>] syscall_call+0x7/0xb > >Code: 8b 83 94 00 00 00 48 0f 85 f9 00 00 00 8d 73 44 8b 46 0c 8b > <0>Kernel panic: Fatal exception in interrupt >In interrupt handler - not syncing > > > |
From: Mark H. <ma...@os...> - 2004-05-19 16:25:48
|
On Tue, 2004-05-18 at 10:39, Jon Maloy wrote: > Hi Mark, > I haven't tried with 64 processes yet, but I will try to reproduce > and trouble-shoot this problem when I have time. Right now I am > spending some time on making the lock handling at port/socket level > more symmetric and easier to follow. It will have some > performance implications, but it will not have any major impact, > I think. > > I must admit that I am not familiar with Bugzilla. Is this the base > for the bug reporting/tracking system we already have for each > project, or is it something else ? The bug report system has only > been used sporadically, as you may have noticed, but I have no > problems with starting to use it more systematically; - I think > indeed we will have to if TIPC makes it into the kernel. > > /Jon > > Mark Haverkamp wrote: > > >I was running a modified tipc benchmark program that used 64 processes > >instead of 32. I also had my kernel compiled with page alloc debug > >turned on (memory allocations are unmapped when free to catch bad access > >as soon as possible). It was part way through the 16K size when the > >panic happened. It was on the server side. > > > >Any thoughts on using the sourceforge bugzilla to keep track of current > >bugs? > > > >Mark. > > > > > >[root@cl019 root]# > >Unable to handle kernel paging request at virtual address f1067fe4 > > printing eip: > >f8a8617e > >*pde = 00585067 [ ... ] > > After I Installed the latest tipc this morning I got another crash similar to the last except this time the pointer is NULL. Looking at the disassembly, it is calling buf_busy in tipc_recv_msg. I think at line 1624. [root@cl019 root]# Unable to handle kernel NULL pointer dereference at virtual address 00000008 printing eip: f8e43204 *pde = 00000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 0 EIP: 0060:[<f8e43204>] Not tainted EFLAGS: 00010246 (2.6.6-rc3) EIP is at tipc_recv_msg+0x174/0x8a0 [tipc] eax: 00000000 ebx: f650af50 ecx: 000091bc edx: f4903f50 esi: f650af94 edi: f3f24bf8 ebp: c04fbea0 esp: c04fbe48 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c04fa000 task=c04621c0) Stack: f5b52bf8 00000000 f7fffe60 ee376000 0000005a 0000005a 00000246 f650af50 0000000c 000091bc 0000920b ef0f8e18 f475cf50 f48767f8 00000001 00000000 f743ba20 0000000b c04fbeb0 f8e6cdc0 f475cf50 c054f970 c04fbeb0 f8e61959 Call Trace: [<f8e61959>] recv_msg+0x39/0x50 [tipc] [<c0375e92>] netif_receive_skb+0x172/0x1b0 [<c0375f54>] process_backlog+0x84/0x120 [<c0376070>] net_rx_action+0x80/0x120 [<c0124c38>] __do_softirq+0xb8/0xc0 [<c0124c75>] do_softirq+0x35/0x40 [<c0107cf5>] do_IRQ+0x175/0x230 [<c0103040>] default_idle+0x0/0x40 [<c0105ce0>] common_interrupt+0x18/0x20 [<c0103040>] default_idle+0x0/0x40 [<c0103070>] default_idle+0x30/0x40 [<c0103106>] cpu_idle+0x46/0x50 [<c04fc9aa>] start_kernel+0x18a/0x1d0 [<c04fc520>] unknown_bootoption+0x0/0x130 -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-19 17:02:19
|
Hmm, this doesn't look good. The send queue is probably inconsistent, (crs == 0) and (next_out != 0), since a valid buffer pointer hardly can cause this crash. (buf_busy() calls skb_shared(), which does atomic_read() on skb->users; => no more pointer accesses.) In older versions of TIPC we had a configurable check of link consistency. You could re-introduce some simplified version of it and check at each send/recv, and then dump the link (link_print()) when it happens. I suspect a buffer overrun, where the UB->next pointer has been overwritten by some earlier packet sending. Time to suspect the bundling send function again.... /Jon Mark Haverkamp wrote: On Tue, 2004-05-18 at 10:39, Jon Maloy wrote: Hi Mark, I haven't tried with 64 processes yet, but I will try to reproduce and trouble-shoot this problem when I have time. Right now I am spending some time on making the lock handling at port/socket level more symmetric and easier to follow. It will have some performance implications, but it will not have any major impact, I think. I must admit that I am not familiar with Bugzilla. Is this the base for the bug reporting/tracking system we already have for each project, or is it something else ? The bug report system has only been used sporadically, as you may have noticed, but I have no problems with starting to use it more systematically; - I think indeed we will have to if TIPC makes it into the kernel. /Jon Mark Haverkamp wrote: I was running a modified tipc benchmark program that used 64 processes instead of 32. I also had my kernel compiled with page alloc debug turned on (memory allocations are unmapped when free to catch bad access as soon as possible). It was part way through the 16K size when the panic happened. It was on the server side. Any thoughts on using the sourceforge bugzilla to keep track of current bugs? Mark. [root@cl019 root]# Unable to handle kernel paging request at virtual address f1067fe4 printing eip: f8a8617e *pde = 00585067 [ ... ] After I Installed the latest tipc this morning I got another crash similar to the last except this time the pointer is NULL. Looking at the disassembly, it is calling buf_busy in tipc_recv_msg. I think at line 1624. [root@cl019 root]# Unable to handle kernel NULL pointer dereference at virtual address 00000008 printing eip: f8e43204 *pde = 00000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 0 EIP: 0060:[<f8e43204>] Not tainted EFLAGS: 00010246 (2.6.6-rc3) EIP is at tipc_recv_msg+0x174/0x8a0 [tipc] eax: 00000000 ebx: f650af50 ecx: 000091bc edx: f4903f50 esi: f650af94 edi: f3f24bf8 ebp: c04fbea0 esp: c04fbe48 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c04fa000 task=c04621c0) Stack: f5b52bf8 00000000 f7fffe60 ee376000 0000005a 0000005a 00000246 f650af50 0000000c 000091bc 0000920b ef0f8e18 f475cf50 f48767f8 00000001 00000000 f743ba20 0000000b c04fbeb0 f8e6cdc0 f475cf50 c054f970 c04fbeb0 f8e61959 Call Trace: [<f8e61959>] recv_msg+0x39/0x50 [tipc] [<c0375e92>] netif_receive_skb+0x172/0x1b0 [<c0375f54>] process_backlog+0x84/0x120 [<c0376070>] net_rx_action+0x80/0x120 [<c0124c38>] __do_softirq+0xb8/0xc0 [<c0124c75>] do_softirq+0x35/0x40 [<c0107cf5>] do_IRQ+0x175/0x230 [<c0103040>] default_idle+0x0/0x40 [<c0105ce0>] common_interrupt+0x18/0x20 [<c0103040>] default_idle+0x0/0x40 [<c0103070>] default_idle+0x30/0x40 [<c0103106>] cpu_idle+0x46/0x50 [<c04fc9aa>] start_kernel+0x18a/0x1d0 [<c04fc520>] unknown_bootoption+0x0/0x130 |