From: Jon M. <jon...@er...> - 2004-05-06 23:37:40
|
I think your analysis is correct, but I don't know where this omission happens. The problem I am trying to solve may be related, - when I run parallel links between two nodes comunication on one link sometimes seem to stop under heavy load. With a little luck we are looking for the same bug. /Jon Mark Haverkamp wrote: >Jon, > >Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill >a running application in a certain order. > >While running the tipc benchmark program we can get tipc to hang the >computer by killing the client while it has the 32 processes running. >Although, to get the hang, I have to have tried to run some management >port accesses which are stalled due to congestion. After doing some >tracing, I have narrowed it down to an exiting process spinning while >trying to get the node lock. Our assumption is that some other process >hasn't released the lock by accident, although its not obvious where. I >have included the stack dump from the sysrq P console command. > >SysRq : Show Regs > >Pid: 2001, comm: client_tipc_tp >EIP: 0060:[<f8a913d9>] CPU: 0 >EIP is at .text.lock.link+0xd7/0x3ce [tipc] > EFLAGS: 00000286 Not tainted (2.6.6-rc3) >EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 >ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b >CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 >Call Trace: > [<c0126a38>] __do_softirq+0xb8/0xc0 > [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] > [<c015b3a1>] __pte_chain_free+0x81/0x90 > [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] > [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] > [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] > [<f8aa7ae2>] release+0x72/0x130 [tipc] > [<c0378ff9>] sock_release+0x99/0xf0 > [<c0379a16>] sock_close+0x36/0x50 > [<c016740d>] __fput+0x12d/0x140 > [<c0165857>] filp_close+0x57/0x90 > [<c0123adc>] put_files_struct+0x7c/0xf0 > [<c0124b1c>] do_exit+0x26c/0x600 > [<c012cc05>] __dequeue_signal+0xf5/0x1b0 > [<c0125057>] do_group_exit+0x107/0x190 > [<c012cced>] dequeue_signal+0x2d/0x90 > [<c012f14c>] get_signal_to_deliver+0x28c/0x590 > [<c0105286>] do_signal+0xb6/0xf0 > [<c037a736>] sys_send+0x36/0x40 > [<c037af8e>] sys_socketcall+0x12e/0x240 > [<c010531b>] do_notify_resume+0x5b/0x5d > [<c010554a>] work_notifysig+0x13/0x15 > >You can see that the process is trying to exit. I have traced the EIP to >the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of >link.o. > >Any ideas on this? > >Mark. > > |