From: Mark H. <ma...@os...> - 2004-05-11 22:24:14
|
On Thu, 2004-05-06 at 16:37, Jon Maloy wrote: > I think your analysis is correct, but I don't know where this omission > happens. The problem I am trying to solve may be related, > - when I run parallel links between two nodes comunication on > one link sometimes seem to stop under heavy load. > > With a little luck we are looking for the same bug. > > /Jon > Jon, I was thinking about this today and it occurred to me that spin_lock_bh doesn't prevent interrupts from happening. If true, we can get in a deadlock situation when a CPU has the node lock, an ethernet interrupt happens causing tipc_recv_msg to get called. One of the first things that tipc_recv_msg does is try to get the node lock. This seems to be a possible explanation for the spin hang on the node lock. Does this make sense to you? Mark. > > Mark Haverkamp wrote: > > >Jon, > > > >Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill > >a running application in a certain order. > > > >While running the tipc benchmark program we can get tipc to hang the > >computer by killing the client while it has the 32 processes running. > >Although, to get the hang, I have to have tried to run some management > >port accesses which are stalled due to congestion. After doing some > >tracing, I have narrowed it down to an exiting process spinning while > >trying to get the node lock. Our assumption is that some other process > >hasn't released the lock by accident, although its not obvious where. I > >have included the stack dump from the sysrq P console command. > > > >SysRq : Show Regs > > > >Pid: 2001, comm: client_tipc_tp > >EIP: 0060:[<f8a913d9>] CPU: 0 > >EIP is at .text.lock.link+0xd7/0x3ce [tipc] > > EFLAGS: 00000286 Not tainted (2.6.6-rc3) > >EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 > >ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b > >CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 > >Call Trace: > > [<c0126a38>] __do_softirq+0xb8/0xc0 > > [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] > > [<c015b3a1>] __pte_chain_free+0x81/0x90 > > [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] > > [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] > > [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] > > [<f8aa7ae2>] release+0x72/0x130 [tipc] > > [<c0378ff9>] sock_release+0x99/0xf0 > > [<c0379a16>] sock_close+0x36/0x50 > > [<c016740d>] __fput+0x12d/0x140 > > [<c0165857>] filp_close+0x57/0x90 > > [<c0123adc>] put_files_struct+0x7c/0xf0 > > [<c0124b1c>] do_exit+0x26c/0x600 > > [<c012cc05>] __dequeue_signal+0xf5/0x1b0 > > [<c0125057>] do_group_exit+0x107/0x190 > > [<c012cced>] dequeue_signal+0x2d/0x90 > > [<c012f14c>] get_signal_to_deliver+0x28c/0x590 > > [<c0105286>] do_signal+0xb6/0xf0 > > [<c037a736>] sys_send+0x36/0x40 > > [<c037af8e>] sys_socketcall+0x12e/0x240 > > [<c010531b>] do_notify_resume+0x5b/0x5d > > [<c010554a>] work_notifysig+0x13/0x15 > > > >You can see that the process is trying to exit. I have traced the EIP to > >the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of > >link.o. > > > >Any ideas on this? > > > >Mark. > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Sleepycat Software > Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver > higher performing products faster, at low TCO. > http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 > _______________________________________________ > TIPC-discussion mailing list > TIP...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion -- Mark Haverkamp <ma...@os...> |