From: Jon M. <jon...@er...> - 2004-05-12 14:15:58
|
Mark, I think you can easily verify your hypothesis by replacing spin_lock_bh() with a spin_trylock_bh() in tipc_recv_msg(), and then just discard the packet and continue the loop if it fails. (If this happens the packet will be retransmitted anyway.) It is worth a try. /Jon Mark Haverkamp wrote: On Thu, 2004-05-06 at 16:37, Jon Maloy wrote: I think your analysis is correct, but I don't know where this omission happens. The problem I am trying to solve may be related, - when I run parallel links between two nodes comunication on one link sometimes seem to stop under heavy load. With a little luck we are looking for the same bug. /Jon Jon, I was thinking about this today and it occurred to me that spin_lock_bh doesn't prevent interrupts from happening. If true, we can get in a deadlock situation when a CPU has the node lock, an ethernet interrupt happens causing tipc_recv_msg to get called. One of the first things that tipc_recv_msg does is try to get the node lock. This seems to be a possible explanation for the spin hang on the node lock. Does this make sense to you? Mark. Mark Haverkamp wrote: Jon, Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill a running application in a certain order. While running the tipc benchmark program we can get tipc to hang the computer by killing the client while it has the 32 processes running. Although, to get the hang, I have to have tried to run some management port accesses which are stalled due to congestion. After doing some tracing, I have narrowed it down to an exiting process spinning while trying to get the node lock. Our assumption is that some other process hasn't released the lock by accident, although its not obvious where. I have included the stack dump from the sysrq P console command. SysRq : Show Regs Pid: 2001, comm: client_tipc_tp EIP: 0060:[<f8a913d9>] CPU: 0 EIP is at .text.lock.link+0xd7/0x3ce [tipc] EFLAGS: 00000286 Not tainted (2.6.6-rc3) EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 Call Trace: [<c0126a38>] __do_softirq+0xb8/0xc0 [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] [<c015b3a1>] __pte_chain_free+0x81/0x90 [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] [<f8aa7ae2>] release+0x72/0x130 [tipc] [<c0378ff9>] sock_release+0x99/0xf0 [<c0379a16>] sock_close+0x36/0x50 [<c016740d>] __fput+0x12d/0x140 [<c0165857>] filp_close+0x57/0x90 [<c0123adc>] put_files_struct+0x7c/0xf0 [<c0124b1c>] do_exit+0x26c/0x600 [<c012cc05>] __dequeue_signal+0xf5/0x1b0 [<c0125057>] do_group_exit+0x107/0x190 [<c012cced>] dequeue_signal+0x2d/0x90 [<c012f14c>] get_signal_to_deliver+0x28c/0x590 [<c0105286>] do_signal+0xb6/0xf0 [<c037a736>] sys_send+0x36/0x40 [<c037af8e>] sys_socketcall+0x12e/0x240 [<c010531b>] do_notify_resume+0x5b/0x5d [<c010554a>] work_notifysig+0x13/0x15 You can see that the process is trying to exit. I have traced the EIP to the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of link.o. Any ideas on this? Mark. ------------------------------------------------------- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 <http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3> _______________________________________________ TIPC-discussion mailing list TIP...@li... <mailto:TIP...@li...> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> |