From: Mark H. <ma...@os...> - 2004-05-06 22:41:27
|
Jon, Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill a running application in a certain order. While running the tipc benchmark program we can get tipc to hang the computer by killing the client while it has the 32 processes running. Although, to get the hang, I have to have tried to run some management port accesses which are stalled due to congestion. After doing some tracing, I have narrowed it down to an exiting process spinning while trying to get the node lock. Our assumption is that some other process hasn't released the lock by accident, although its not obvious where. I have included the stack dump from the sysrq P console command. SysRq : Show Regs Pid: 2001, comm: client_tipc_tp EIP: 0060:[<f8a913d9>] CPU: 0 EIP is at .text.lock.link+0xd7/0x3ce [tipc] EFLAGS: 00000286 Not tainted (2.6.6-rc3) EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 Call Trace: [<c0126a38>] __do_softirq+0xb8/0xc0 [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] [<c015b3a1>] __pte_chain_free+0x81/0x90 [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] [<f8aa7ae2>] release+0x72/0x130 [tipc] [<c0378ff9>] sock_release+0x99/0xf0 [<c0379a16>] sock_close+0x36/0x50 [<c016740d>] __fput+0x12d/0x140 [<c0165857>] filp_close+0x57/0x90 [<c0123adc>] put_files_struct+0x7c/0xf0 [<c0124b1c>] do_exit+0x26c/0x600 [<c012cc05>] __dequeue_signal+0xf5/0x1b0 [<c0125057>] do_group_exit+0x107/0x190 [<c012cced>] dequeue_signal+0x2d/0x90 [<c012f14c>] get_signal_to_deliver+0x28c/0x590 [<c0105286>] do_signal+0xb6/0xf0 [<c037a736>] sys_send+0x36/0x40 [<c037af8e>] sys_socketcall+0x12e/0x240 [<c010531b>] do_notify_resume+0x5b/0x5d [<c010554a>] work_notifysig+0x13/0x15 You can see that the process is trying to exit. I have traced the EIP to the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of link.o. Any ideas on this? Mark. -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-06 23:37:40
|
I think your analysis is correct, but I don't know where this omission happens. The problem I am trying to solve may be related, - when I run parallel links between two nodes comunication on one link sometimes seem to stop under heavy load. With a little luck we are looking for the same bug. /Jon Mark Haverkamp wrote: >Jon, > >Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill >a running application in a certain order. > >While running the tipc benchmark program we can get tipc to hang the >computer by killing the client while it has the 32 processes running. >Although, to get the hang, I have to have tried to run some management >port accesses which are stalled due to congestion. After doing some >tracing, I have narrowed it down to an exiting process spinning while >trying to get the node lock. Our assumption is that some other process >hasn't released the lock by accident, although its not obvious where. I >have included the stack dump from the sysrq P console command. > >SysRq : Show Regs > >Pid: 2001, comm: client_tipc_tp >EIP: 0060:[<f8a913d9>] CPU: 0 >EIP is at .text.lock.link+0xd7/0x3ce [tipc] > EFLAGS: 00000286 Not tainted (2.6.6-rc3) >EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 >ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b >CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 >Call Trace: > [<c0126a38>] __do_softirq+0xb8/0xc0 > [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] > [<c015b3a1>] __pte_chain_free+0x81/0x90 > [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] > [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] > [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] > [<f8aa7ae2>] release+0x72/0x130 [tipc] > [<c0378ff9>] sock_release+0x99/0xf0 > [<c0379a16>] sock_close+0x36/0x50 > [<c016740d>] __fput+0x12d/0x140 > [<c0165857>] filp_close+0x57/0x90 > [<c0123adc>] put_files_struct+0x7c/0xf0 > [<c0124b1c>] do_exit+0x26c/0x600 > [<c012cc05>] __dequeue_signal+0xf5/0x1b0 > [<c0125057>] do_group_exit+0x107/0x190 > [<c012cced>] dequeue_signal+0x2d/0x90 > [<c012f14c>] get_signal_to_deliver+0x28c/0x590 > [<c0105286>] do_signal+0xb6/0xf0 > [<c037a736>] sys_send+0x36/0x40 > [<c037af8e>] sys_socketcall+0x12e/0x240 > [<c010531b>] do_notify_resume+0x5b/0x5d > [<c010554a>] work_notifysig+0x13/0x15 > >You can see that the process is trying to exit. I have traced the EIP to >the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of >link.o. > >Any ideas on this? > >Mark. > > |
From: Mark H. <ma...@os...> - 2004-05-11 22:24:14
|
On Thu, 2004-05-06 at 16:37, Jon Maloy wrote: > I think your analysis is correct, but I don't know where this omission > happens. The problem I am trying to solve may be related, > - when I run parallel links between two nodes comunication on > one link sometimes seem to stop under heavy load. > > With a little luck we are looking for the same bug. > > /Jon > Jon, I was thinking about this today and it occurred to me that spin_lock_bh doesn't prevent interrupts from happening. If true, we can get in a deadlock situation when a CPU has the node lock, an ethernet interrupt happens causing tipc_recv_msg to get called. One of the first things that tipc_recv_msg does is try to get the node lock. This seems to be a possible explanation for the spin hang on the node lock. Does this make sense to you? Mark. > > Mark Haverkamp wrote: > > >Jon, > > > >Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill > >a running application in a certain order. > > > >While running the tipc benchmark program we can get tipc to hang the > >computer by killing the client while it has the 32 processes running. > >Although, to get the hang, I have to have tried to run some management > >port accesses which are stalled due to congestion. After doing some > >tracing, I have narrowed it down to an exiting process spinning while > >trying to get the node lock. Our assumption is that some other process > >hasn't released the lock by accident, although its not obvious where. I > >have included the stack dump from the sysrq P console command. > > > >SysRq : Show Regs > > > >Pid: 2001, comm: client_tipc_tp > >EIP: 0060:[<f8a913d9>] CPU: 0 > >EIP is at .text.lock.link+0xd7/0x3ce [tipc] > > EFLAGS: 00000286 Not tainted (2.6.6-rc3) > >EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 > >ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b > >CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 > >Call Trace: > > [<c0126a38>] __do_softirq+0xb8/0xc0 > > [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] > > [<c015b3a1>] __pte_chain_free+0x81/0x90 > > [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] > > [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] > > [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] > > [<f8aa7ae2>] release+0x72/0x130 [tipc] > > [<c0378ff9>] sock_release+0x99/0xf0 > > [<c0379a16>] sock_close+0x36/0x50 > > [<c016740d>] __fput+0x12d/0x140 > > [<c0165857>] filp_close+0x57/0x90 > > [<c0123adc>] put_files_struct+0x7c/0xf0 > > [<c0124b1c>] do_exit+0x26c/0x600 > > [<c012cc05>] __dequeue_signal+0xf5/0x1b0 > > [<c0125057>] do_group_exit+0x107/0x190 > > [<c012cced>] dequeue_signal+0x2d/0x90 > > [<c012f14c>] get_signal_to_deliver+0x28c/0x590 > > [<c0105286>] do_signal+0xb6/0xf0 > > [<c037a736>] sys_send+0x36/0x40 > > [<c037af8e>] sys_socketcall+0x12e/0x240 > > [<c010531b>] do_notify_resume+0x5b/0x5d > > [<c010554a>] work_notifysig+0x13/0x15 > > > >You can see that the process is trying to exit. I have traced the EIP to > >the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of > >link.o. > > > >Any ideas on this? > > > >Mark. > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Sleepycat Software > Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver > higher performing products faster, at low TCO. > http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 > _______________________________________________ > TIPC-discussion mailing list > TIP...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-12 14:15:58
|
Mark, I think you can easily verify your hypothesis by replacing spin_lock_bh() with a spin_trylock_bh() in tipc_recv_msg(), and then just discard the packet and continue the loop if it fails. (If this happens the packet will be retransmitted anyway.) It is worth a try. /Jon Mark Haverkamp wrote: On Thu, 2004-05-06 at 16:37, Jon Maloy wrote: I think your analysis is correct, but I don't know where this omission happens. The problem I am trying to solve may be related, - when I run parallel links between two nodes comunication on one link sometimes seem to stop under heavy load. With a little luck we are looking for the same bug. /Jon Jon, I was thinking about this today and it occurred to me that spin_lock_bh doesn't prevent interrupts from happening. If true, we can get in a deadlock situation when a CPU has the node lock, an ethernet interrupt happens causing tipc_recv_msg to get called. One of the first things that tipc_recv_msg does is try to get the node lock. This seems to be a possible explanation for the spin hang on the node lock. Does this make sense to you? Mark. Mark Haverkamp wrote: Jon, Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill a running application in a certain order. While running the tipc benchmark program we can get tipc to hang the computer by killing the client while it has the 32 processes running. Although, to get the hang, I have to have tried to run some management port accesses which are stalled due to congestion. After doing some tracing, I have narrowed it down to an exiting process spinning while trying to get the node lock. Our assumption is that some other process hasn't released the lock by accident, although its not obvious where. I have included the stack dump from the sysrq P console command. SysRq : Show Regs Pid: 2001, comm: client_tipc_tp EIP: 0060:[<f8a913d9>] CPU: 0 EIP is at .text.lock.link+0xd7/0x3ce [tipc] EFLAGS: 00000286 Not tainted (2.6.6-rc3) EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 Call Trace: [<c0126a38>] __do_softirq+0xb8/0xc0 [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] [<c015b3a1>] __pte_chain_free+0x81/0x90 [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] [<f8aa7ae2>] release+0x72/0x130 [tipc] [<c0378ff9>] sock_release+0x99/0xf0 [<c0379a16>] sock_close+0x36/0x50 [<c016740d>] __fput+0x12d/0x140 [<c0165857>] filp_close+0x57/0x90 [<c0123adc>] put_files_struct+0x7c/0xf0 [<c0124b1c>] do_exit+0x26c/0x600 [<c012cc05>] __dequeue_signal+0xf5/0x1b0 [<c0125057>] do_group_exit+0x107/0x190 [<c012cced>] dequeue_signal+0x2d/0x90 [<c012f14c>] get_signal_to_deliver+0x28c/0x590 [<c0105286>] do_signal+0xb6/0xf0 [<c037a736>] sys_send+0x36/0x40 [<c037af8e>] sys_socketcall+0x12e/0x240 [<c010531b>] do_notify_resume+0x5b/0x5d [<c010554a>] work_notifysig+0x13/0x15 You can see that the process is trying to exit. I have traced the EIP to the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of link.o. Any ideas on this? Mark. ------------------------------------------------------- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 <http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3> _______________________________________________ TIPC-discussion mailing list TIP...@li... <mailto:TIP...@li...> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> |
From: Mark H. <ma...@os...> - 2004-05-12 14:21:12
|
On Wed, 2004-05-12 at 07:15, Jon Maloy wrote: > Mark, > I think you can easily verify your hypothesis by replacing > spin_lock_bh() with a spin_trylock_bh() in tipc_recv_msg(), and > then just discard the packet and continue the loop if it fails. > (If this happens the packet will be retransmitted anyway.) > It is worth a try. I'll give it a try. Out of curiosity, why are you using the *_bh versions of spinlock? -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-12 14:55:10
|
Well, we do have bottom halves (timer and signal tasklets) that may interfere with message sending/reception if executed, even on UPs. I am sure there are cases when the _bh() version is not really needed, but I wanted to keep it safe when I developed it. Something we can look into later, maybe. /jon Mark Haverkamp wrote: On Wed, 2004-05-12 at 07:15, Jon Maloy wrote: Mark, I think you can easily verify your hypothesis by replacing spin_lock_bh() with a spin_trylock_bh() in tipc_recv_msg(), and then just discard the packet and continue the loop if it fails. (If this happens the packet will be retransmitted anyway.) It is worth a try. I'll give it a try. Out of curiosity, why are you using the *_bh versions of spinlock? |
From: Jon M. <jon...@er...> - 2004-05-11 23:22:49
|
I am not sure. In your first analysis of the problem you found that it was a dying process that was causing the processor to hang while trying to grab the node lock, which was then presumably held by the tipc_rev_msg interrupt, or a tasklet. That is the opposite scenario of what you describe here. Also, this was no different in the old implementation: we shared the same lock between user level/tasklet code and interrupts without any problems. If you are right, why don't we have deadlocks all the time then ? I think I will re-read Rusty Lynch "unreliable guide" once more... Regards /Jon Mark Haverkamp wrote: On Thu, 2004-05-06 at 16:37, Jon Maloy wrote: I think your analysis is correct, but I don't know where this omission happens. The problem I am trying to solve may be related, - when I run parallel links between two nodes comunication on one link sometimes seem to stop under heavy load. With a little luck we are looking for the same bug. /Jon Jon, I was thinking about this today and it occurred to me that spin_lock_bh doesn't prevent interrupts from happening. If true, we can get in a deadlock situation when a CPU has the node lock, an ethernet interrupt happens causing tipc_recv_msg to get called. One of the first things that tipc_recv_msg does is try to get the node lock. This seems to be a possible explanation for the spin hang on the node lock. Does this make sense to you? Mark. Mark Haverkamp wrote: Jon, Daniel and I have been seeing a tipc hang for 3 or 4 weeks when we kill a running application in a certain order. While running the tipc benchmark program we can get tipc to hang the computer by killing the client while it has the 32 processes running. Although, to get the hang, I have to have tried to run some management port accesses which are stalled due to congestion. After doing some tracing, I have narrowed it down to an exiting process spinning while trying to get the node lock. Our assumption is that some other process hasn't released the lock by accident, although its not obvious where. I have included the stack dump from the sysrq P console command. SysRq : Show Regs Pid: 2001, comm: client_tipc_tp EIP: 0060:[<f8a913d9>] CPU: 0 EIP is at .text.lock.link+0xd7/0x3ce [tipc] EFLAGS: 00000286 Not tainted (2.6.6-rc3) EAX: f7c8ef6c EBX: 00000000 ECX: 01001011 EDX: 00000013 ESI: f7c8eee0 EDI: f359a000 EBP: f359bcf8 DS: 007b ES: 007b CR0: 8005003b CR2: 080e2ce8 CR3: 0053d000 CR4: 000006d0 Call Trace: [<c0126a38>] __do_softirq+0xb8/0xc0 [<f8a9818b>] net_route_msg+0x48b/0x4ad [tipc] [<c015b3a1>] __pte_chain_free+0x81/0x90 [<f8a99e6e>] port_send_proto_msg+0x1ae/0x2d0 [tipc] [<f8a9af73>] port_abort_peer+0x83/0x90 [tipc] [<f8a999a1>] tipc_deleteport+0x181/0x2a0 [tipc] [<f8aa7ae2>] release+0x72/0x130 [tipc] [<c0378ff9>] sock_release+0x99/0xf0 [<c0379a16>] sock_close+0x36/0x50 [<c016740d>] __fput+0x12d/0x140 [<c0165857>] filp_close+0x57/0x90 [<c0123adc>] put_files_struct+0x7c/0xf0 [<c0124b1c>] do_exit+0x26c/0x600 [<c012cc05>] __dequeue_signal+0xf5/0x1b0 [<c0125057>] do_group_exit+0x107/0x190 [<c012cced>] dequeue_signal+0x2d/0x90 [<c012f14c>] get_signal_to_deliver+0x28c/0x590 [<c0105286>] do_signal+0xb6/0xf0 [<c037a736>] sys_send+0x36/0x40 [<c037af8e>] sys_socketcall+0x12e/0x240 [<c010531b>] do_notify_resume+0x5b/0x5d [<c010554a>] work_notifysig+0x13/0x15 You can see that the process is trying to exit. I have traced the EIP to the spin_lock_bh(&node->lock) in link_lock_select from a disassembly of link.o. Any ideas on this? Mark. ------------------------------------------------------- This SF.Net email is sponsored by Sleepycat Software Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver higher performing products faster, at low TCO. http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 <http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3> _______________________________________________ TIPC-discussion mailing list TIP...@li... <mailto:TIP...@li...> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> |
From: Mark H. <ma...@os...> - 2004-05-12 14:18:49
|
On Tue, 2004-05-11 at 16:22, Jon Maloy wrote: > I am not sure. In your first analysis of the problem you > found that it was a dying process that was causing the > processor to hang while trying to grab the node lock, which > was then presumably held by the tipc_rev_msg interrupt, > or a tasklet. > That is the opposite scenario of what you describe here. I see the hang both ways. > > Also, this was no different in the old implementation: > we shared the same lock between user level/tasklet code > and interrupts without any problems. If you are right, > why don't we have deadlocks all the time then ? That is a good point, but it does seem a possibility. > > I think I will re-read Rusty Lynch "unreliable guide" > once more... That is Rusty Russell. -- Mark Haverkamp <ma...@os...> |