From: Jon M. <jon...@er...> - 2004-06-08 18:45:48
|
I took a little closer look at the recvbcast code, and notice a couple of things: First, the code does consistently use buf_safe_discard() when it seems to be sufficient with buf_discard(). This function is more expensive to use, but should not cause any problems if it were correctly implemented. Unfortunately, it is not. I have forgotten to protect the quarantine queue with a lock, and this may quite well cause havoc in the both this buffer queue and elsewhere. My guess is that the very strange messages we see in the dump in reality are invalid, -maybe a mix of different messages. Otherwise I can not explain the destination port number zero in the messages, which seems impossible if one follows the call chain bcast_port_recv_msg()->nameseq_deliver_msg()-> port_recv_msg()->net_route_msg()->net_route_named_msg(). An extra lock for the quarantine queue is needed, and this will hopefully fix the problem, but buf_safe_discard() should anyway be changed to buf_discard() if there is no particular reason for using it. /jon Mark Haverkamp wrote: >I ran my 4 node test yesterday with a lock around access to the >quarantine_head in buf_safe_discard. It didn't hang this time but after >about 14 hours or so two of the machines got something like this: > > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): >TIPC: Lost Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A >TIPC: Lost contact with <1.1.17> >bad: scheduling while atomic! >TIPC: Established Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >Debug: sleeping function called from invalid context at mm/slab.c:1994 >in_atomic():1, irqs_disabled():0 > [<c010618e>] dump_stack+0x1e/0x30 > [<c011e0c9>] __might_sleep+0x99/0xb0 > [<c014bcdf>] kmem_cache_alloc+0x21f/0x230 > [<c03786a3>] alloc_skb+0x23/0xf0 > [<c037795e>] sock_alloc_send_pskb+0xce/0x1f0 > [<c0377aae>] sock_alloc_send_skb+0x2e/0x40 > [<c03dfe69>] unix_stream_sendmsg+0x199/0x3f0 > [<c0374a3d>] sock_aio_write+0xbd/0xe0 > [<c0165cd7>] do_sync_write+0x87/0xc0 > [<c0165df9>] vfs_write+0xe9/0x120 > [<c0165ecf>] sys_write+0x3f/0x60 > [<c0105363>] syscall_call+0x7/0xb > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c010538a>] work_resched+0x5/0x16 > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c03f95ce>] schedule_timeout+0x6e/0xc0 > [<c01941c5>] ep_poll+0x135/0x1b0 > [<c0192e8b>] sys_epoll_wait+0xab/0xb0 > [<c0105363>] syscall_call+0x7/0xb > >bad: scheduling while atomic! > [<c010618e>] dump_stack+0x1e/0x30 > [<c03f8d84>] schedule+0x6b4/0x6c0 > [<c011d0cd>] sys_sched_yield+0x5d/0x90 > [<c01741c3>] coredump_wait+0x43/0xb0 > [<c0174398>] do_coredump+0x168/0x271 > [<c012e1a7>] get_signal_to_deliver+0x287/0x510 > [<c0105126>] do_signal+0xb6/0xf0 > [<c01051bb>] do_notify_resume+0x5b/0x5d > [<c01053ae>] work_notifysig+0x13/0x15 > >Kernel panic: Aiee, killing interrupt handler! >In interrupt handler - not syncing > > >I'm not sure what to make of this. I don't see TIPC on the stack, but >who knows. I'll try page alloc debug to see if there is some re-using >of free memory going on. > >Mark > > |