Re: [Tipc-discussion] Re: hang while deleting ports

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, 2004-05-06 at 16:37, Jon Maloy wrote:
> I think your analysis is correct, but I don't know where this omission
> happens. The problem I am trying to solve may be related,  
> - when I run parallel links between two nodes comunication on
> one link sometimes seem to stop under heavy load.
> 
> With a little luck we are looking for the same bug.
> 
> /Jon
> 

Jon,

I was thinking about this today and it occurred to me that spin_lock_bh
doesn't prevent interrupts from happening.  If true, we can get in a
deadlock situation when a CPU has the node lock, an ethernet interrupt
happens causing tipc_recv_msg to get called.  One of the first things
that tipc_recv_msg does is try to get the node lock.  This seems to be a
possible explanation for the spin hang on the node lock.  Does this make
sense to you?  

Mark.

> 
> Mark Haverkamp wrote:
> 
> >Jon,
> >
> >Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill
> >a running application in a certain order.
> >
> >While running the tipc benchmark program we can get tipc to hang the
> >computer by killing the client while it has the 32 processes running.
> >Although, to get the hang, I have to have tried to run some management
> >port accesses which are stalled due to congestion.  After doing some
> >tracing, I have narrowed it down to an exiting process spinning while
> >trying to get the node lock.  Our assumption is that some other process
> >hasn't released the lock by accident, although its not obvious where.  I
> >have included the stack dump from the sysrq P console command.
> >
> >SysRq : Show Regs
> >                  
> >Pid: 2001, comm:       client_tipc_tp
> >EIP: 0060:[<f8a913d9>] CPU: 0
> >EIP is at .text.lock.link+0xd7/0x3ce [tipc]
> > EFLAGS: 00000286    Not tainted  (2.6.6-rc3)
> >EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013
> >ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b
> >CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0
> >Call Trace:
> > [<c0126a38>] __do_softirq+0xb8/0xc0
> > [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc]
> > [<c015b3a1>] __pte_chain_free+0x81/0x90
> > [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc]
> > [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc]
> > [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc]
> > [<f8aa7ae2>] release+0x72/0x130 [tipc]
> > [<c0378ff9>] sock_release+0x99/0xf0
> > [<c0379a16>] sock_close+0x36/0x50
> > [<c016740d>] __fput+0x12d/0x140
> > [<c0165857>] filp_close+0x57/0x90
> > [<c0123adc>] put_files_struct+0x7c/0xf0
> > [<c0124b1c>] do_exit+0x26c/0x600
> > [<c012cc05>] __dequeue_signal+0xf5/0x1b0
> > [<c0125057>] do_group_exit+0x107/0x190
> > [<c012cced>] dequeue_signal+0x2d/0x90
> > [<c012f14c>] get_signal_to_deliver+0x28c/0x590
> > [<c0105286>] do_signal+0xb6/0xf0
> > [<c037a736>] sys_send+0x36/0x40
> > [<c037af8e>] sys_socketcall+0x12e/0x240
> > [<c010531b>] do_notify_resume+0x5b/0x5d
> > [<c010554a>] work_notifysig+0x13/0x15
> >
> >You can see that the process is trying to exit. I have traced the EIP to
> >the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of
> >link.o.
> >
> >Any ideas on this?
> >
> >Mark.
> >  
> >
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by Sleepycat Software
> Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver
> higher performing products faster, at low TCO.
> http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3
> _______________________________________________
> TIPC-discussion mailing list
> TIP...@li...
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
-- 
Mark Haverkamp <ma...@os...>

Re: [Tipc-discussion] Re: hang while deleting ports

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [Tipc-discussion] Re: hang while deleting ports