From: Mark H. <ma...@os...> - 2004-06-08 20:23:30
|
On Tue, 2004-06-08 at 11:45, Jon Maloy wrote: > I took a little closer look at the recvbcast code, and notice > a couple of things: > First, the code does consistently use buf_safe_discard() when it seems > to be sufficient with buf_discard(). This function is more > expensive to use, but should not cause any problems if it > were correctly implemented. > Unfortunately, it is not. I have forgotten to protect the quarantine > queue with a lock, and this may quite well cause havoc in the > both this buffer queue and elsewhere. My guess is that the very > strange messages we see in the dump in reality are invalid, > -maybe a mix of different messages. Otherwise I can not > explain the destination port number zero in the messages, which > seems impossible if one follows the call chain > bcast_port_recv_msg()->nameseq_deliver_msg()-> > port_recv_msg()->net_route_msg()->net_route_named_msg(). > > An extra lock for the quarantine queue is needed, and this will hopefully > fix the problem, but buf_safe_discard() should anyway be changed to > buf_discard() if there is no particular reason for using it. The code that I was testing had a lock on the quarantine queue. One thing that may be the cause of problems in this case was that I did have page alloc debug turned on after all. It uses a whole page regardless of the allocation size as a debug tool. We may have just run out of pages. I am trying out the test once again without the page alloc debug compiled into the kernel. Mark. > > /jon > > Mark Haverkamp wrote: > > >I ran my 4 node test yesterday with a lock around access to the > >quarantine_head in buf_safe_discard. It didn't hang this time but after > >about 14 hours or so two of the machines got something like this: > > > > > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001011):ORIG(1001011:1642938376)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >net->drop_nam:DAT0:MCST:REROUTED(1):HZ(44):SZ(713):SQNO(0):ACK(0):BACK(0):PRND(1001012):ORIG(1001012:937762824)::DEST(1001013:0): > >TIPC: Lost Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > >TIPC: Lost contact with <1.1.17> > >bad: scheduling while atomic! > >TIPC: Established Link <1.1.19:eth1-1.1.17:eth1> on Network Plane A > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >Debug: sleeping function called from invalid context at mm/slab.c:1994 > >in_atomic():1, irqs_disabled():0 > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c011e0c9>] __might_sleep+0x99/0xb0 > > [<c014bcdf>] kmem_cache_alloc+0x21f/0x230 > > [<c03786a3>] alloc_skb+0x23/0xf0 > > [<c037795e>] sock_alloc_send_pskb+0xce/0x1f0 > > [<c0377aae>] sock_alloc_send_skb+0x2e/0x40 > > [<c03dfe69>] unix_stream_sendmsg+0x199/0x3f0 > > [<c0374a3d>] sock_aio_write+0xbd/0xe0 > > [<c0165cd7>] do_sync_write+0x87/0xc0 > > [<c0165df9>] vfs_write+0xe9/0x120 > > [<c0165ecf>] sys_write+0x3f/0x60 > > [<c0105363>] syscall_call+0x7/0xb > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c010538a>] work_resched+0x5/0x16 > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c03f95ce>] schedule_timeout+0x6e/0xc0 > > [<c01941c5>] ep_poll+0x135/0x1b0 > > [<c0192e8b>] sys_epoll_wait+0xab/0xb0 > > [<c0105363>] syscall_call+0x7/0xb > > > >bad: scheduling while atomic! > > [<c010618e>] dump_stack+0x1e/0x30 > > [<c03f8d84>] schedule+0x6b4/0x6c0 > > [<c011d0cd>] sys_sched_yield+0x5d/0x90 > > [<c01741c3>] coredump_wait+0x43/0xb0 > > [<c0174398>] do_coredump+0x168/0x271 > > [<c012e1a7>] get_signal_to_deliver+0x287/0x510 > > [<c0105126>] do_signal+0xb6/0xf0 > > [<c01051bb>] do_notify_resume+0x5b/0x5d > > [<c01053ae>] work_notifysig+0x13/0x15 > > > >Kernel panic: Aiee, killing interrupt handler! > >In interrupt handler - not syncing > > > > > >I'm not sure what to make of this. I don't see TIPC on the stack, but > >who knows. I'll try page alloc debug to see if there is some re-using > >of free memory going on. > > > >Mark > > > > -- Mark Haverkamp <ma...@os...> |