From: Mark H. <ma...@os...> - 2004-06-08 15:37:05
|
I ran my 4 node test yesterday with a lock around access to the quarantine_head in buf_safe_discard. It didn't hang this time but after about 14 hours or so two of the machines got something like this: net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): TIPC: Lost Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A TIPC: Lost contact with <1.1.17> bad: scheduling while atomic! TIPC: Established Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 Debug: sleeping function called from invalid context at mm/slab.c:1994 in_atomic():1, irqs_disabled():0 [<c010618e>] dump_stack+0x1e/0x30 [<c011e0c9>] __might_sleep+0x99/0xb0 [<c014bcdf>] kmem_cache_alloc+0x21f/0x230 [<c03786a3>] alloc_skb+0x23/0xf0 [<c037795e>] sock_alloc_send_pskb+0xce/0x1f0 [<c0377aae>] sock_alloc_send_skb+0x2e/0x40 [<c03dfe69>] unix_stream_sendmsg+0x199/0x3f0 [<c0374a3d>] sock_aio_write+0xbd/0xe0 [<c0165cd7>] do_sync_write+0x87/0xc0 [<c0165df9>] vfs_write+0xe9/0x120 [<c0165ecf>] sys_write+0x3f/0x60 [<c0105363>] syscall_call+0x7/0xb bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c010538a>] work_resched+0x5/0x16 bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c03f95ce>] schedule_timeout+0x6e/0xc0 [<c01941c5>] ep_poll+0x135/0x1b0 [<c0192e8b>] sys_epoll_wait+0xab/0xb0 [<c0105363>] syscall_call+0x7/0xb bad: scheduling while atomic! [<c010618e>] dump_stack+0x1e/0x30 [<c03f8d84>] schedule+0x6b4/0x6c0 [<c011d0cd>] sys_sched_yield+0x5d/0x90 [<c01741c3>] coredump_wait+0x43/0xb0 [<c0174398>] do_coredump+0x168/0x271 [<c012e1a7>] get_signal_to_deliver+0x287/0x510 [<c0105126>] do_signal+0xb6/0xf0 [<c01051bb>] do_notify_resume+0x5b/0x5d [<c01053ae>] work_notifysig+0x13/0x15 Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing I'm not sure what to make of this. I don't see TIPC on the stack, but who knows. I'll try page alloc debug to see if there is some re-using of free memory going on. Mark -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-06-08 18:00:33
|
The dropped messages are rather confusing. They seem to have destination address <1.1.13>, but have somehow ended up on <1.1.19> according to the dump. Maybe this is ok, since they are multicast messages, but only if they were carried as broadcast messages over the network. I think the correct destination address should be <1.1.0> if that is the case, but I haven't studied the implementation well enough to know how it works here. It is also obvious that net_route_named_msg() in net.c should allow a second lookup even of multicast messages, not only of named messages as it does now, so this is a bug that must be corrected (I will fix it). But I can not see any relation to the crash here. Did anything happen to node <1.1.17>, or is the lost/re-established link a result of the dropped messages ? /Jon Mark Haverkamp wrote: >I ran my 4 node test yesterday with a lock around access to the >quarantine_head in buf_safe_discard. It didn't hang this time but after >about 14 hours or so two of the machines got something like this: > > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >TIPC: Lost Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A >TIPC: Lost contact with <1.1.17> >bad: scheduling while atomic! >TIPC: Established Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >Debug: sleeping function called from invalid context at mm/slab.c:1994 >in_atomic():1, irqs_disabled():0 > [<c010618e>] dump_stack+0x1e/0x30 > [<c011e0c9>] __might_sleep+0x99/0xb0 > [<c014bcdf>] kmem_cache_alloc+0x21f/0x230 > [<c03786a3>] alloc_skb+0x23/0xf0 > [<c037795e>] sock_alloc_send_pskb+0xce/0x1f0 > [<c0377aae>] sock_alloc_send_skb+0x2e/0x40 > [<c03dfe69>] unix_stream_sendmsg+0x199/0x3f0 > [<c0374a3d>] sock_aio_write+0xbd/0xe0 > [<c0165cd7>] do_sync_write+0x87/0xc0 > [<c0165df9>] vfs_write+0xe9/0x120 > [<c0165ecf>] sys_write+0x3f/0x60 > [<c0105363>] syscall_call+0x7/0xb > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c03f95ce>] schedule_timeout+0x6e/0xc0 > [<c01941c5>] ep_poll+0x135/0x1b0 > [<c0192e8b>] sys_epoll_wait+0xab/0xb0 > [<c0105363>] syscall_call+0x7/0xb > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c011d0cd>] sys_sched_yield+0x5d/0x90 > [<c01741c3>] coredump_wait+0x43/0xb0 > [<c0174398>] do_coredump+0x168/0x271 > [<c012e1a7>] get_signal_to_deliver+0x287/0x510 > [<c0105126>] do_signal+0xb6/0xf0 > [<c01051bb>] do_notify_resume+0x5b/0x5d > [<c01053ae>] work_notifysig+0x13/0x15 > >Kernel panic: Aiee, killing interrupt handler! >In interrupt handler - not syncing > > >I'm not sure what to make of this. I don't see TIPC on the stack, but >who knows. I'll try page alloc debug to see if there is some re-using >of free memory going on. > >Mark > > |
From: Jon M. <jon...@er...> - 2004-06-08 18:45:48
|
I took a little closer look at the recvbcast code, and notice a couple of things: First, the code does consistently use buf_safe_discard() when it seems to be sufficient with buf_discard(). This function is more expensive to use, but should not cause any problems if it were correctly implemented. Unfortunately, it is not. I have forgotten to protect the quarantine queue with a lock, and this may quite well cause havoc in the both this buffer queue and elsewhere. My guess is that the very strange messages we see in the dump in reality are invalid, -maybe a mix of different messages. Otherwise I can not explain the destination port number zero in the messages, which seems impossible if one follows the call chain bcast_port_recv_msg()->nameseq_deliver_msg()-> port_recv_msg()->net_route_msg()->net_route_named_msg(). An extra lock for the quarantine queue is needed, and this will hopefully fix the problem, but buf_safe_discard() should anyway be changed to buf_discard() if there is no particular reason for using it. /jon Mark Haverkamp wrote: >I ran my 4 node test yesterday with a lock around access to the >quarantine_head in buf_safe_discard. It didn't hang this time but after >about 14 hours or so two of the machines got something like this: > > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >TIPC: Lost Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A >TIPC: Lost contact with <1.1.17> >bad: scheduling while atomic! >TIPC: Established Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >Debug: sleeping function called from invalid context at mm/slab.c:1994 >in_atomic():1, irqs_disabled():0 > [<c010618e>] dump_stack+0x1e/0x30 > [<c011e0c9>] __might_sleep+0x99/0xb0 > [<c014bcdf>] kmem_cache_alloc+0x21f/0x230 > [<c03786a3>] alloc_skb+0x23/0xf0 > [<c037795e>] sock_alloc_send_pskb+0xce/0x1f0 > [<c0377aae>] sock_alloc_send_skb+0x2e/0x40 > [<c03dfe69>] unix_stream_sendmsg+0x199/0x3f0 > [<c0374a3d>] sock_aio_write+0xbd/0xe0 > [<c0165cd7>] do_sync_write+0x87/0xc0 > [<c0165df9>] vfs_write+0xe9/0x120 > [<c0165ecf>] sys_write+0x3f/0x60 > [<c0105363>] syscall_call+0x7/0xb > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c03f95ce>] schedule_timeout+0x6e/0xc0 > [<c01941c5>] ep_poll+0x135/0x1b0 > [<c0192e8b>] sys_epoll_wait+0xab/0xb0 > [<c0105363>] syscall_call+0x7/0xb > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c011d0cd>] sys_sched_yield+0x5d/0x90 > [<c01741c3>] coredump_wait+0x43/0xb0 > [<c0174398>] do_coredump+0x168/0x271 > [<c012e1a7>] get_signal_to_deliver+0x287/0x510 > [<c0105126>] do_signal+0xb6/0xf0 > [<c01051bb>] do_notify_resume+0x5b/0x5d > [<c01053ae>] work_notifysig+0x13/0x15 > >Kernel panic: Aiee, killing interrupt handler! >In interrupt handler - not syncing > > >I'm not sure what to make of this. I don't see TIPC on the stack, but >who knows. I'll try page alloc debug to see if there is some re-using >of free memory going on. > >Mark > > |
From: Mark H. <ma...@os...> - 2004-06-08 20:23:30
|
On Tue, 2004-06-08 at 11:45, Jon Maloy wrote: > I took a little closer look at the recvbcast code, and notice > a couple of things: > First, the code does consistently use buf_safe_discard() when it seems > to be sufficient with buf_discard(). This function is more > expensive to use, but should not cause any problems if it > were correctly implemented. > Unfortunately, it is not. I have forgotten to protect the quarantine > queue with a lock, and this may quite well cause havoc in the > both this buffer queue and elsewhere. My guess is that the very > strange messages we see in the dump in reality are invalid, > -maybe a mix of different messages. Otherwise I can not > explain the destination port number zero in the messages, which > seems impossible if one follows the call chain > bcast_port_recv_msg()->nameseq_deliver_msg()-> > port_recv_msg()->net_route_msg()->net_route_named_msg(). > > An extra lock for the quarantine queue is needed, and this will hopefully > fix the problem, but buf_safe_discard() should anyway be changed to > buf_discard() if there is no particular reason for using it. The code that I was testing had a lock on the quarantine queue. One thing that may be the cause of problems in this case was that I did have page alloc debug turned on after all. It uses a whole page regardless of the allocation size as a debug tool. We may have just run out of pages. I am trying out the test once again without the page alloc debug compiled into the kernel. Mark. > > /jon > > Mark Haverkamp wrote: > > >I ran my 4 node test yesterday with a lock around access to the > >quarantine_head in buf_safe_discard. It didn't hang this time but after > >about 14 hours or so two of the machines got something like this: > > > > > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >TIPC: Lost Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > >TIPC: Lost contact with <1.1.17> > >bad: scheduling while atomic! > >TIPC: Established Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >Debug: sleeping function called from invalid context at mm/slab.c:1994 > >in_atomic():1, irqs_disabled():0 > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c011e0c9>] __might_sleep+0x99/0xb0 > > [<c014bcdf>] kmem_cache_alloc+0x21f/0x230 > > [<c03786a3>] alloc_skb+0x23/0xf0 > > [<c037795e>] sock_alloc_send_pskb+0xce/0x1f0 > > [<c0377aae>] sock_alloc_send_skb+0x2e/0x40 > > [<c03dfe69>] unix_stream_sendmsg+0x199/0x3f0 > > [<c0374a3d>] sock_aio_write+0xbd/0xe0 > > [<c0165cd7>] do_sync_write+0x87/0xc0 > > [<c0165df9>] vfs_write+0xe9/0x120 > > [<c0165ecf>] sys_write+0x3f/0x60 > > [<c0105363>] syscall_call+0x7/0xb > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c03f95ce>] schedule_timeout+0x6e/0xc0 > > [<c01941c5>] ep_poll+0x135/0x1b0 > > [<c0192e8b>] sys_epoll_wait+0xab/0xb0 > > [<c0105363>] syscall_call+0x7/0xb > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c011d0cd>] sys_sched_yield+0x5d/0x90 > > [<c01741c3>] coredump_wait+0x43/0xb0 > > [<c0174398>] do_coredump+0x168/0x271 > > [<c012e1a7>] get_signal_to_deliver+0x287/0x510 > > [<c0105126>] do_signal+0xb6/0xf0 > > [<c01051bb>] do_notify_resume+0x5b/0x5d > > [<c01053ae>] work_notifysig+0x13/0x15 > > > >Kernel panic: Aiee, killing interrupt handler! > >In interrupt handler - not syncing > > > > > >I'm not sure what to make of this. I don't see TIPC on the stack, but > >who knows. I'll try page alloc debug to see if there is some re-using > >of free memory going on. > > > >Mark > > > > -- Mark Haverkamp <ma...@os...> |
From: Mark H. <ma...@os...> - 2004-06-09 17:25:01
|
On Tue, 2004-06-08 at 13:21, Mark Haverkamp wrote: > On Tue, 2004-06-08 at 11:45, Jon Maloy wrote: > > I took a little closer look at the recvbcast code, and notice > > a couple of things: > > First, the code does consistently use buf_safe_discard() when it seems > > to be sufficient with buf_discard(). This function is more > > expensive to use, but should not cause any problems if it > > were correctly implemented. > > Unfortunately, it is not. I have forgotten to protect the quarantine > > queue with a lock, and this may quite well cause havoc in the > > both this buffer queue and elsewhere. My guess is that the very > > strange messages we see in the dump in reality are invalid, > > -maybe a mix of different messages. Otherwise I can not > > explain the destination port number zero in the messages, which > > seems impossible if one follows the call chain > > bcast_port_recv_msg()->nameseq_deliver_msg()-> > > port_recv_msg()->net_route_msg()->net_route_named_msg(). > > > > An extra lock for the quarantine queue is needed, and this will hopefully > > fix the problem, but buf_safe_discard() should anyway be changed to > > buf_discard() if there is no particular reason for using it. > > The code that I was testing had a lock on the quarantine queue. One > thing that may be the cause of problems in this case was that I did have > page alloc debug turned on after all. It uses a whole page regardless > of the allocation size as a debug tool. We may have just run out of > pages. I am trying out the test once again without the page alloc debug > compiled into the kernel. I still had two of the machines panic about 4 AM today even with page alloc debug compiled out. Here is the code I have been using with the lock for the quarantine list. It should probably be checked in anyway since it does seem to fix the hang that I was seeing. cvs diff -u buf.c Index: buf.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/buf.c,v retrieving revision 1.12 diff -u -r1.12 buf.c --- buf.c 5 May 2004 19:09:03 -0000 1.12 +++ buf.c 9 Jun 2004 17:18:46 -0000 @@ -106,10 +106,13 @@ * queue instead. */ +static spinlock_t qb_lock = SPIN_LOCK_UNLOCKED; void buf_safe_discard(struct sk_buff *buf) { - struct sk_buff *qbuf = quarantine_head; + struct sk_buff *qbuf; + spin_lock_bh(&qb_lock); + qbuf = quarantine_head; while (qbuf) { struct sk_buff *next = buf_next(qbuf); if (buf_busy(qbuf)) @@ -118,6 +121,7 @@ qbuf = next; } quarantine_head = qbuf; + spin_unlock_bh(&qb_lock); if (!buf) return; @@ -126,12 +130,14 @@ return; } buf_set_next(buf, 0); + spin_lock_bh(&qb_lock); if (!quarantine_head) { quarantine_head = quarantine_tail = buf; } else { buf_set_next(quarantine_tail, buf); quarantine_tail = buf; } + spin_unlock_bh(&qb_lock); } void buf_stop(void) ---------------------------------------------------------------------------------------------------- Also, how do I interpret the output from tipc when the trouble happens. e.g.: net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-06-09 17:56:54
|
You can check it in. I have still not merged with my code, so send me a note when it is ready. Thanks /jon Mark Haverkamp wrote: On Tue, 2004-06-08 at 13:21, Mark Haverkamp wrote: On Tue, 2004-06-08 at 11:45, Jon Maloy wrote: I took a little closer look at the recvbcast code, and notice a couple of things: First, the code does consistently use buf_safe_discard() when it seems to be sufficient with buf_discard(). This function is more expensive to use, but should not cause any problems if it were correctly implemented. Unfortunately, it is not. I have forgotten to protect the quarantine queue with a lock, and this may quite well cause havoc in the both this buffer queue and elsewhere. My guess is that the very strange messages we see in the dump in reality are invalid, -maybe a mix of different messages. Otherwise I can not explain the destination port number zero in the messages, which seems impossible if one follows the call chain bcast_port_recv_msg()->nameseq_deliver_msg()-> port_recv_msg()->net_route_msg()->net_route_named_msg(). An extra lock for the quarantine queue is needed, and this will hopefully fix the problem, but buf_safe_discard() should anyway be changed to buf_discard() if there is no particular reason for using it. The code that I was testing had a lock on the quarantine queue. One thing that may be the cause of problems in this case was that I did have page alloc debug turned on after all. It uses a whole page regardless of the allocation size as a debug tool. We may have just run out of pages. I am trying out the test once again without the page alloc debug compiled into the kernel. I still had two of the machines panic about 4 AM today even with page alloc debug compiled out. Here is the code I have been using with the lock for the quarantine list. It should probably be checked in anyway since it does seem to fix the hang that I was seeing. cvs diff -u buf.c Index: buf.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/buf.c,v retrieving revision 1.12 diff -u -r1.12 buf.c --- buf.c 5 May 2004 19:09:03 -0000 1.12 +++ buf.c 9 Jun 2004 17:18:46 -0000 @@ -106,10 +106,13 @@ * queue instead. */ +static spinlock_t qb_lock = SPIN_LOCK_UNLOCKED; void buf_safe_discard(struct sk_buff *buf) { - struct sk_buff *qbuf = quarantine_head; + struct sk_buff *qbuf; + spin_lock_bh(&qb_lock); + qbuf = quarantine_head; while (qbuf) { struct sk_buff *next = buf_next(qbuf); if (buf_busy(qbuf)) @@ -118,6 +121,7 @@ qbuf = next; } quarantine_head = qbuf; + spin_unlock_bh(&qb_lock); if (!buf) return; @@ -126,12 +130,14 @@ return; } buf_set_next(buf, 0); + spin_lock_bh(&qb_lock); if (!quarantine_head) { quarantine_head = quarantine_tail = buf; } else { buf_set_next(quarantine_tail, buf); quarantine_tail = buf; } + spin_unlock_bh(&qb_lock); } void buf_stop(void) ------------------------------------------------------------------------ ---------------------------- Also, how do I interpret the output from tipc when the trouble happens. e.g.: net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0 ):PRND(1001008):ORIG(1001008:1678278664)::DEST(1001011:0): |
From: Mark H. <ma...@os...> - 2004-06-09 18:05:04
|
On Wed, 2004-06-09 at 10:56, Jon Maloy wrote: > You can check it in. I have still not merged with my code, > so send me a note when it is ready. OK, It's checked in. Mark. -- Mark Haverkamp <ma...@os...> |