From: Mark H. <ma...@os...> - 2004-05-12 18:24:38
|
Here is what looks to be happening with the spin lock deadlock. I replaced all the spin_lock_bh calls with a wrapper that tries to get the lock for a while then prints out a debug message if it can't get the lock. As an experiment, I changed the spin_lock_bh in link_wakeup_ports to a trylock and exited if it couldn't get the lock. I am now not able to get the deadlock. CPU 0: release -- tipc_delete_port (get port lock) -- port_abort_peer -- port_send_proto_msg -- net_route_msg -- link_send (get node lock) -- (hung spinning) CPU 1: common_interrupt -- do_softirq -- net_rx_action -- netif_receive_skb -- recv_msg (tipc eth) -- tipc_recv_msg (get node lock) -- link_wakeup_ports (get port lock) -- (hung spinning) Stack dumps: port lock timeout Call Trace: [<f8a837ab>] link_wakeup_ports+0x9b/0x230 [tipc] [<f8a87c2e>] tipc_recv_msg+0x7fe/0x8c0 [tipc] [<c014949d>] __kmalloc+0x19d/0x250 [<f8aa5db9>] recv_msg+0x39/0x50 [tipc] [<c0375af2>] netif_receive_skb+0x172/0x1b0 [<c0375bb4>] process_backlog+0x84/0x120 [<c0375cd0>] net_rx_action+0x80/0x120 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c0107ce5>] do_IRQ+0x175/0x230 [<c0105ce0>] common_interrupt+0x18/0x20 [<c0221c91>] copy_from_user+0x1/0x80 [<f8a866bf>] link_send_sections_long+0x30f/0xb30 [tipc] [<c0221694>] __delay+0x14/0x20 [<f8a8366f>] link_schedule_port+0x13f/0x1e0 [tipc] [<f8a860f5>] link_send_sections_fast+0x5b5/0x870 [tipc] [<c011b12a>] __wake_up_common+0x3a/0x60 [<f8a97bf2>] tipc_send+0x92/0x9d0 [tipc] [<c011d736>] __mmdrop+0x36/0x50 [<c03f15b7>] schedule+0x467/0x7a0 [<f8aa33e6>] recv_msg+0x2b6/0x560 [tipc] [<f8aa2d90>] send_packet+0x90/0x180 [tipc] [<c011b0d0>] default_wake_function+0x0/0x20 [<c036c83e>] sock_sendmsg+0x8e/0xb0 [<f8aa5db9>] recv_msg+0x39/0x50 [tipc] [<c01435ba>] buffered_rmqueue+0xfa/0x220 [<c036c61a>] sockfd_lookup+0x1a/0x80 [<c036dd61>] sys_sendto+0xe1/0x100 [<c0128f62>] del_timer_sync+0x42/0x140 [<c036d109>] sock_poll+0x29/0x30 [<c017884b>] do_pollfd+0x5b/0xa0 [<c036ddb6>] sys_send+0x36/0x40 [<c036e60e>] sys_socketcall+0x12e/0x240 [<c0105373>] syscall_call+0x7/0xb &node->lock lock timeout Call Trace: [<f8a8549a>] link_send+0xda/0x2a0 [tipc] [<f8a92cee>] net_route_msg+0x41e/0x43d [tipc] [<f8a949c2>] port_send_proto_msg+0x1a2/0x2a0 [tipc] [<f8a95983>] port_abort_peer+0x83/0x90 [tipc] [<f8a9458f>] tipc_deleteport+0x19f/0x280 [tipc] [<f8aa25b2>] release+0x72/0x130 [tipc] [<c036c76b>] sock_release+0x7b/0xc0 [<c036d176>] sock_close+0x36/0x50 [<c016315a>] __fput+0x10a/0x120 [<c0161597>] filp_close+0x57/0x90 [<c0121dbc>] put_files_struct+0x7c/0xf0 [<c0122d5a>] do_exit+0x23a/0x5a0 [<c012aa35>] __dequeue_signal+0xf5/0x1b0 [<c0123240>] do_group_exit+0xe0/0x150 [<c012ab1d>] dequeue_signal+0x2d/0x90 [<c012cbef>] get_signal_to_deliver+0x26f/0x510 [<c0105136>] do_signal+0xb6/0xf0 [<c036ddb6>] sys_send+0x36/0x40 [<c036e60e>] sys_socketcall+0x12e/0x240 [<c01051cb>] do_notify_resume+0x5b/0x5d [<c01053be>] work_notifysig+0x13/0x15 -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-12 19:10:00
|
Yeah, that's a classical one. I also think your solution is ok; the pending ports will be awakened at next message reception, so no harm done. Thanks /jon Mark Haverkamp wrote: >Here is what looks to be happening with the spin lock deadlock. I >replaced all the spin_lock_bh calls with a wrapper that tries to get the >lock for a while then prints out a debug message if it can't get the >lock. > >As an experiment, I changed the spin_lock_bh in link_wakeup_ports to a >trylock and exited if it couldn't get the lock. I am now not able to >get the deadlock. > > >CPU 0: >release -- > tipc_delete_port (get port lock) -- > port_abort_peer -- > port_send_proto_msg -- > net_route_msg -- > link_send (get node lock) -- (hung spinning) > >CPU 1: >common_interrupt -- > do_softirq -- > net_rx_action -- > netif_receive_skb -- > recv_msg (tipc eth) -- > tipc_recv_msg (get node lock) -- > link_wakeup_ports (get port lock) -- (hung spinning) > >Stack dumps: > >port lock timeout >Call Trace: > [<f8a837ab>] link_wakeup_ports+0x9b/0x230 [tipc] > [<f8a87c2e>] tipc_recv_msg+0x7fe/0x8c0 [tipc] > [<c014949d>] __kmalloc+0x19d/0x250 > [<f8aa5db9>] recv_msg+0x39/0x50 [tipc] > [<c0375af2>] netif_receive_skb+0x172/0x1b0 > [<c0375bb4>] process_backlog+0x84/0x120 > [<c0375cd0>] net_rx_action+0x80/0x120 > [<c0124bc8>] __do_softirq+0xb8/0xc0 > [<c0124c05>] do_softirq+0x35/0x40 > [<c0107ce5>] do_IRQ+0x175/0x230 > [<c0105ce0>] common_interrupt+0x18/0x20 > [<c0221c91>] copy_from_user+0x1/0x80 > [<f8a866bf>] link_send_sections_long+0x30f/0xb30 [tipc] > [<c0221694>] __delay+0x14/0x20 > [<f8a8366f>] link_schedule_port+0x13f/0x1e0 [tipc] > [<f8a860f5>] link_send_sections_fast+0x5b5/0x870 [tipc] > [<c011b12a>] __wake_up_common+0x3a/0x60 > [<f8a97bf2>] tipc_send+0x92/0x9d0 [tipc] > [<c011d736>] __mmdrop+0x36/0x50 > [<c03f15b7>] schedule+0x467/0x7a0 > [<f8aa33e6>] recv_msg+0x2b6/0x560 [tipc] > [<f8aa2d90>] send_packet+0x90/0x180 [tipc] > [<c011b0d0>] default_wake_function+0x0/0x20 > [<c036c83e>] sock_sendmsg+0x8e/0xb0 > [<f8aa5db9>] recv_msg+0x39/0x50 [tipc] > [<c01435ba>] buffered_rmqueue+0xfa/0x220 > [<c036c61a>] sockfd_lookup+0x1a/0x80 > [<c036dd61>] sys_sendto+0xe1/0x100 > [<c0128f62>] del_timer_sync+0x42/0x140 > [<c036d109>] sock_poll+0x29/0x30 > [<c017884b>] do_pollfd+0x5b/0xa0 > [<c036ddb6>] sys_send+0x36/0x40 > [<c036e60e>] sys_socketcall+0x12e/0x240 > [<c0105373>] syscall_call+0x7/0xb > >&node->lock lock timeout >Call Trace: > [<f8a8549a>] link_send+0xda/0x2a0 [tipc] > [<f8a92cee>] net_route_msg+0x41e/0x43d [tipc] > [<f8a949c2>] port_send_proto_msg+0x1a2/0x2a0 [tipc] > [<f8a95983>] port_abort_peer+0x83/0x90 [tipc] > [<f8a9458f>] tipc_deleteport+0x19f/0x280 [tipc] > [<f8aa25b2>] release+0x72/0x130 [tipc] > [<c036c76b>] sock_release+0x7b/0xc0 > [<c036d176>] sock_close+0x36/0x50 > [<c016315a>] __fput+0x10a/0x120 > [<c0161597>] filp_close+0x57/0x90 > [<c0121dbc>] put_files_struct+0x7c/0xf0 > [<c0122d5a>] do_exit+0x23a/0x5a0 > [<c012aa35>] __dequeue_signal+0xf5/0x1b0 > [<c0123240>] do_group_exit+0xe0/0x150 > [<c012ab1d>] dequeue_signal+0x2d/0x90 > [<c012cbef>] get_signal_to_deliver+0x26f/0x510 > [<c0105136>] do_signal+0xb6/0xf0 > [<c036ddb6>] sys_send+0x36/0x40 > [<c036e60e>] sys_socketcall+0x12e/0x240 > [<c01051cb>] do_notify_resume+0x5b/0x5d > [<c01053be>] work_notifysig+0x13/0x15 > > > |
From: Mark H. <ma...@os...> - 2004-05-12 20:26:01
|
On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: > Yeah, that's a classical one. I also think your solution is ok; the > pending ports will be awakened at next message reception, so > no harm done. > > Thanks /jon > > OK, here is what I plan to check in. Take a look. I fixed a few other things that seemed wrong. See comments by each diff. Mark. p.s. here is another problem I've been seeing: If I do management port access during congestion, the I/O never completes even when the congestion is over. ---------------------------------------- cvs diff -u media.c reg.c link.c -------------------------------------------------- I think that this should be an unlock since it is locked a few lines above. Index: media.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/media.c,v retrieving revision 1.13 diff -u -r1.13 media.c --- media.c 6 May 2004 15:35:31 -0000 1.13 +++ media.c 12 May 2004 20:10:30 -0000 @@ -221,7 +221,7 @@ bearer_schedule_unlocked(this, link); res = 0; } - spin_lock_bh(&this->publ.lock); + spin_unlock_bh(&this->publ.lock); return res; } --------------------------------------------------- Daniel found this one. ref_lock_deref can grab an invalid object pointer if the lock is released too soon. This waits until the ref number is changed so it won't match in ref_lock_deref after it gets the lock. Index: reg.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/reg.c,v retrieving revision 1.8 diff -u -r1.8 reg.c --- reg.c 5 May 2004 18:39:37 -0000 1.8 +++ reg.c 12 May 2004 20:10:31 -0000 @@ -167,7 +167,6 @@ assert(entry->object != 0); assert(entry->data.reference == ref_nb); entry->object = 0; - spin_unlock_bh(&entry->lock); if (ref_table.first_free == 0) ref_table.first_free = index; else @@ -175,6 +174,7 @@ ref_table.entries[ref_table.last_free].data.next_plus_upper |= index; ref_table.last_free = index; entry->data.next_plus_upper = (ref_nb & ~index_mask) + index_mask + 1; + spin_unlock_bh(&entry->lock); write_unlock_bh(&ref_lock); } ------------------------------------------------ link_schedule_port took the global port_lock/port lock in a different order than everywhere else. Added the spin_trylock_bh in link_wakeup_ports. Added an unlock in link_recv_fragment on an error exit. Fixed a couple places where cheking the this pointer looked wrong. Index: link.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/link.c,v retrieving revision 1.24 diff -u -r1.24 link.c --- link.c 7 May 2004 23:16:03 -0000 1.24 +++ link.c 12 May 2004 20:10:32 -0000 @@ -440,10 +440,14 @@ int link_schedule_port(struct link *this, uint origport,uint sz) { - struct port *port = port_lock_deref(origport); - if (!port) - return TIPC_CONGESTION; + struct port *port; + spin_lock_bh(&port_lock); + port = port_lock_deref(origport); + if (!port) { + spin_unlock_bh(&port_lock); + return TIPC_CONGESTION; + } if (!port->wakeup) goto exit; if (!list_empty(&port->wait_list)) @@ -453,8 +457,8 @@ port->waiting_pkts = 1 + sz/link_max_pkt(this); list_add_tail(&port->wait_list, &this->waiting_ports); exit: - spin_unlock_bh(&port_lock); spin_unlock_bh(port->publ.lock); + spin_unlock_bh(&port_lock); return TIPC_CONGESTION; } @@ -467,7 +471,8 @@ win = 100000; if (win <= 0) return; - spin_lock_bh(&port_lock); + if (!spin_trylock_bh(&port_lock)) + return; if (link_congested(this)) goto exit; list_for_each_entry_safe(port, tp, &this->waiting_ports, wait_list) { @@ -2365,6 +2370,7 @@ if (msg_size(imsg) > TIPC_MAX_MSG_SIZE + LONG_H_SIZE){ msg_print(CONS,fragm,"<REC<Oversized: "); buf_discard(fbuf); + spin_unlock_bh(&this->owner->lock); return; } buf = buf_acquire(msg_size(imsg)); @@ -2623,7 +2629,7 @@ struct link *this; read_lock_bh(&net_lock); this = link_find_by_name(name); - if (!this){ + if (this){ this->exp_msg_count = LINK_BLOCKED; spin_unlock_bh(&this->owner->lock); res = TIPC_OK; @@ -2640,7 +2646,7 @@ struct link *this; read_lock_bh(&net_lock); this = link_find_by_name(name); - if (!this){ + if (this){ this->exp_msg_count = 0; spin_unlock_bh(&this->owner->lock); res = TIPC_OK; -- Mark Haverkamp <ma...@os...> |
From: Jon M. <jon...@er...> - 2004-05-12 21:09:49
|
Hi, All your corrections look ok. The (!this) bug was a little embarassing... /jon Mark Haverkamp wrote: On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: Yeah, that's a classical one. I also think your solution is ok; the pending ports will be awakened at next message reception, so no harm done. Thanks /jon OK, here is what I plan to check in. Take a look. I fixed a few other things that seemed wrong. See comments by each diff. Mark. p.s. here is another problem I've been seeing: If I do management port access during congestion, the I/O never completes even when the congestion is over. This may be related to message priorities. When the internal manager tries to respond to a message during congestion he can not be blocked, as we can do with sockets/processes, so I set it's priority to TIPC_NON_REJECTABLE. Is it possible that tipc_send2name() etc still returns TIPC_CONGESTION in such cases ? This should not happen, but if the importance priority by some reason does not make it into the message, or is not handled properly along the data path, who knows? Hint: Check the return values of tipc_send2XX() in mng_respond(), that may give a clue. ---------------------------------------- cvs diff -u media.c reg.c link.c -------------------------------------------------- I think that this should be an unlock since it is locked a few lines above. Index: media.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/media.c,v retrieving revision 1.13 diff -u -r1.13 media.c --- media.c 6 May 2004 15:35:31 -0000 1.13 +++ media.c 12 May 2004 20:10:30 -0000 @@ -221,7 +221,7 @@ bearer_schedule_unlocked(this, link); res = 0; } - spin_lock_bh(&this->publ.lock); + spin_unlock_bh(&this->publ.lock); return res; } Definitely a bug. --------------------------------------------------- Daniel found this one. ref_lock_deref can grab an invalid object pointer if the lock is released too soon. This waits until the ref number is changed so it won't match in ref_lock_deref after it gets the lock. Index: reg.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/reg.c,v retrieving revision 1.8 diff -u -r1.8 reg.c --- reg.c 5 May 2004 18:39:37 -0000 1.8 +++ reg.c 12 May 2004 20:10:31 -0000 @@ -167,7 +167,6 @@ assert(entry->object != 0); assert(entry->data.reference == ref_nb); entry->object = 0; - spin_unlock_bh(&entry->lock); if (ref_table.first_free == 0) ref_table.first_free = index; else @@ -175,6 +174,7 @@ ref_table.entries[ref_table.last_free].data.next_plus_upper |= index; ref_table.last_free = index; entry->data.next_plus_upper = (ref_nb & ~index_mask) + index_mask + 1; + spin_unlock_bh(&entry->lock); write_unlock_bh(&ref_lock); } Also a bug. ------------------------------------------------ link_schedule_port took the global port_lock/port lock in a different order than everywhere else. Added the spin_trylock_bh in link_wakeup_ports. Added an unlock in link_recv_fragment on an error exit. Fixed a couple places where cheking the this pointer looked wrong. Index: link.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/link.c,v retrieving revision 1.24 diff -u -r1.24 link.c --- link.c 7 May 2004 23:16:03 -0000 1.24 +++ link.c 12 May 2004 20:10:32 -0000 @@ -440,10 +440,14 @@ int link_schedule_port(struct link *this, uint origport,uint sz) { - struct port *port = port_lock_deref(origport); - if (!port) - return TIPC_CONGESTION; + struct port *port; + spin_lock_bh(&port_lock); + port = port_lock_deref(origport); + if (!port) { + spin_unlock_bh(&port_lock); + return TIPC_CONGESTION; + } if (!port->wakeup) goto exit; if (!list_empty(&port->wait_list)) @@ -453,8 +457,8 @@ port->waiting_pkts = 1 + sz/link_max_pkt(this); list_add_tail(&port->wait_list, &this->waiting_ports); exit: - spin_unlock_bh(&port_lock); spin_unlock_bh(port->publ.lock); + spin_unlock_bh(&port_lock); return TIPC_CONGESTION; } @@ -467,7 +471,8 @@ win = 100000; if (win <= 0) return; - spin_lock_bh(&port_lock); + if (!spin_trylock_bh(&port_lock)) + return; if (link_congested(this)) goto exit; list_for_each_entry_safe(port, tp, &this->waiting_ports, wait_list) { @@ -2365,6 +2370,7 @@ if (msg_size(imsg) > TIPC_MAX_MSG_SIZE + LONG_H_SIZE){ msg_print(CONS,fragm,"<REC<Oversized: "); buf_discard(fbuf); + spin_unlock_bh(&this->owner->lock); return; } buf = buf_acquire(msg_size(imsg)); @@ -2623,7 +2629,7 @@ struct link *this; read_lock_bh(&net_lock); this = link_find_by_name(name); - if (!this){ + if (this){ this->exp_msg_count = LINK_BLOCKED; spin_unlock_bh(&this->owner->lock); res = TIPC_OK; @@ -2640,7 +2646,7 @@ struct link *this; read_lock_bh(&net_lock); this = link_find_by_name(name); - if (!this){ + if (this){ this->exp_msg_count = 0; spin_unlock_bh(&this->owner->lock); res = TIPC_OK; |
From: Jon M. <jon...@er...> - 2004-05-12 21:11:44
|
PS: There were more comments further down in my response, just in case you maye have missed them. /jon Mark Haverkamp wrote: On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: Yeah, that's a classical one. I also think your solution is ok; the pending ports will be awakened at next message reception, so no harm done. Thanks /jon OK, here is what I plan to check in. Take a look. I fixed a few other things that seemed wrong. See comments by each diff. Mark. p.s. here is another problem I've been seeing: If I do management port access during congestion, the I/O never completes even when the congestion is over. ---------------------------------------- cvs diff -u media.c reg.c link.c -------------------------------------------------- I think that this should be an unlock since it is locked a few lines above. Index: media.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/media.c,v retrieving revision 1.13 diff -u -r1.13 media.c --- media.c 6 May 2004 15:35:31 -0000 1.13 +++ media.c 12 May 2004 20:10:30 -0000 @@ -221,7 +221,7 @@ bearer_schedule_unlocked(this, link); res = 0; } - spin_lock_bh(&this->publ.lock); + spin_unlock_bh(&this->publ.lock); return res; } --------------------------------------------------- Daniel found this one. ref_lock_deref can grab an invalid object pointer if the lock is released too soon. This waits until the ref number is changed so it won't match in ref_lock_deref after it gets the lock. Index: reg.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/reg.c,v retrieving revision 1.8 diff -u -r1.8 reg.c --- reg.c 5 May 2004 18:39:37 -0000 1.8 +++ reg.c 12 May 2004 20:10:31 -0000 @@ -167,7 +167,6 @@ assert(entry->object != 0); assert(entry->data.reference == ref_nb); entry->object = 0; - spin_unlock_bh(&entry->lock); if (ref_table.first_free == 0) ref_table.first_free = index; else @@ -175,6 +174,7 @@ ref_table.entries[ref_table.last_free].data.next_plus_upper |= index; ref_table.last_free = index; entry->data.next_plus_upper = (ref_nb & ~index_mask) + index_mask + 1; + spin_unlock_bh(&entry->lock); write_unlock_bh(&ref_lock); } ------------------------------------------------ link_schedule_port took the global port_lock/port lock in a different order than everywhere else. Added the spin_trylock_bh in link_wakeup_ports. Added an unlock in link_recv_fragment on an error exit. Fixed a couple places where cheking the this pointer looked wrong. Index: link.c =================================================================== RCS file: /cvsroot/tipc/source/unstable/net/tipc/link.c,v retrieving revision 1.24 diff -u -r1.24 link.c --- link.c 7 May 2004 23:16:03 -0000 1.24 +++ link.c 12 May 2004 20:10:32 -0000 @@ -440,10 +440,14 @@ int link_schedule_port(struct link *this, uint origport,uint sz) { - struct port *port = port_lock_deref(origport); - if (!port) - return TIPC_CONGESTION; + struct port *port; + spin_lock_bh(&port_lock); + port = port_lock_deref(origport); + if (!port) { + spin_unlock_bh(&port_lock); + return TIPC_CONGESTION; + } if (!port->wakeup) goto exit; if (!list_empty(&port->wait_list)) @@ -453,8 +457,8 @@ port->waiting_pkts = 1 + sz/link_max_pkt(this); list_add_tail(&port->wait_list, &this->waiting_ports); exit: - spin_unlock_bh(&port_lock); spin_unlock_bh(port->publ.lock); + spin_unlock_bh(&port_lock); return TIPC_CONGESTION; } @@ -467,7 +471,8 @@ win = 100000; if (win <= 0) return; - spin_lock_bh(&port_lock); + if (!spin_trylock_bh(&port_lock)) + return; if (link_congested(this)) goto exit; list_for_each_entry_safe(port, tp, &this->waiting_ports, wait_list) { @@ -2365,6 +2370,7 @@ if (msg_size(imsg) > TIPC_MAX_MSG_SIZE + LONG_H_SIZE){ msg_print(CONS,fragm,"<REC<Oversized: "); buf_discard(fbuf); + spin_unlock_bh(&this->owner->lock); return; } buf = buf_acquire(msg_size(imsg)); @@ -2623,7 +2629,7 @@ struct link *this; read_lock_bh(&net_lock); this = link_find_by_name(name); - if (!this){ + if (this){ this->exp_msg_count = LINK_BLOCKED; spin_unlock_bh(&this->owner->lock); res = TIPC_OK; @@ -2640,7 +2646,7 @@ struct link *this; read_lock_bh(&net_lock); this = link_find_by_name(name); - if (!this){ + if (this){ this->exp_msg_count = 0; spin_unlock_bh(&this->owner->lock); res = TIPC_OK; |
From: Mark H. <ma...@os...> - 2004-05-13 15:11:09
|
On Wed, 2004-05-12 at 13:25, Mark Haverkamp wrote: > On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: > > Yeah, that's a classical one. I also think your solution is ok; the > > pending ports will be awakened at next message reception, so > > no harm done. > > > > Thanks /jon > > I found another one. I got another hang yesterday after the current deadlock fix. I re-added my spinlock debug code and found out that we're getting a deadlock between the node lock and the tipc_port lock. It looks like the port timeout handler is running on one CPU and a recv_msg is running on the other. I suppose as a workaround, we could make all the spin lock access in wakeup be conditional, but that will probably just make the problem just show up somewhere else. There should probably be an analysis of code paths and determine how the locks interact with each other. I have noticed that there is at least one place where three locks are required. This can cause problems like we've seen when different code paths need multiple locks unless there is some sort of back off method to insure no deadlocks. For instance if a function needs both the node lock and tipc_port lock, something like this could be done: - - - - - - again: spinlock(node_lock); spintrylock(tipc_port_lock) if failed to acquire then release node_lock and goto again. - - - - - - - - Unfortunately, if one of the locks was acquired before the function was entered, releasing that lock may allow some state to change that the caller was assuming wouldn't change and that could cause problems too. So, I'm not sure that this method could work in the link_wakeup_ports function. The node lock is held on entry and if the tipc_port lock can't be acquired, you probably can't just release the node lock and wait for it again since that could invalidate the tipc_recv_msg functions assumptions about what the node lock was protecting. Anyway, here is the raw trace from the debug: &node->lock lock timeout Call Trace: [<f8a8543a>] link_send+0xda/0x2a0 [tipc] [<f8a92c8e>] net_route_msg+0x41e/0x43d [tipc] [<f8a94962>] port_send_proto_msg+0x1a2/0x2a0 [tipc] [<f8aa1b10>] k_signal+0x60/0x170 [tipc] [<f8a956be>] port_timeout+0xbe/0x200 [tipc] [<f8aa1ccc>] timeout_receiver+0xac/0x110 [tipc] [<f8aa1a9f>] receive_signal+0x14f/0x160 [tipc] [<c011ab51>] scheduler_tick+0x111/0x690 [<c0124ec3>] tasklet_action+0x73/0xc0 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c011526d>] smp_apic_timer_interrupt+0xdd/0x150 [<c0105d62>] apic_timer_interrupt+0x1a/0x20 [<c01119c0>] delay_pit+0x20/0x30 [<c0221694>] __delay+0x14/0x20 [<f8a8685f>] link_send_sections_long+0x50f/0xb30 [tipc] [<c01152d1>] smp_apic_timer_interrupt+0x141/0x150 [<f8a8366f>] link_schedule_port+0x13f/0x1e0 [tipc] [<f8a86095>] link_send_sections_fast+0x5b5/0x870 [tipc] [<f8a97b92>] tipc_send+0x92/0x9d0 [tipc] [<c03f14cf>] schedule+0x37f/0x7a0 [<f8aa3386>] recv_msg+0x2b6/0x560 [tipc] [<f8aa2d30>] send_packet+0x90/0x180 [tipc] [<c011b0d0>] default_wake_function+0x0/0x20 [<c036c83e>] sock_sendmsg+0x8e/0xb0 [<f8a8784e>] tipc_recv_msg+0x47e/0x8c0 [tipc] [<c01435ba>] buffered_rmqueue+0xfa/0x220 [<c036c61a>] sockfd_lookup+0x1a/0x80 [<c036dd61>] sys_sendto+0xe1/0x100 [<c0128f62>] del_timer_sync+0x42/0x140 [<c036d109>] sock_poll+0x29/0x30 [<c017884b>] do_pollfd+0x5b/0xa0 [<c036ddb6>] sys_send+0x36/0x40 [<c036e60e>] sys_socketcall+0x12e/0x240 [<c0105373>] syscall_call+0x7/0xb port->publ lock timeout Call Trace: [<f8a8383a>] link_wakeup_ports+0x12a/0x1d0 [tipc] [<f8a87bce>] tipc_recv_msg+0x7fe/0x8c0 [tipc] [<c014949d>] __kmalloc+0x19d/0x250 [<f8aa5d59>] recv_msg+0x39/0x50 [tipc] [<c0375af2>] netif_receive_skb+0x172/0x1b0 [<c0375bb4>] process_backlog+0x84/0x120 [<c0375cd0>] net_rx_action+0x80/0x120 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c0107ce5>] do_IRQ+0x175/0x230 [<c0105ea5>] nmi_stack_correct+0x1e/0x2e [<c0105ce0>] common_interrupt+0x18/0x20 -- Mark Haverkamp <ma...@os...> |
From: Daniel M. <da...@os...> - 2004-05-13 16:12:44
|
On Thu, 2004-05-13 at 08:10, Mark Haverkamp wrote: > On Wed, 2004-05-12 at 13:25, Mark Haverkamp wrote: > > On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: > > > Yeah, that's a classical one. I also think your solution is ok; the > > > pending ports will be awakened at next message reception, so > > > no harm done. > > > > > > Thanks /jon > > > > > I found another one. I got another hang yesterday after the current > deadlock fix. I re-added my spinlock debug code and found out that > we're getting a deadlock between the node lock and the tipc_port lock. > It looks like the port timeout handler is running on one CPU and a > recv_msg is running on the other. I suppose as a workaround, we could > make all the spin lock access in wakeup be conditional, but that will > probably just make the problem just show up somewhere else. There > should probably be an analysis of code paths and determine how the locks > interact with each other. I agree. The locking hierarchy should be documented. Even if this is just comments a source file, it needs to be done. It should also include exactly what the lock is protecting. > I have noticed that there is at least one > place where three locks are required. This can cause problems like > we've seen when different code paths need multiple locks unless there is > some sort of back off method to insure no deadlocks. > We also need to analyze if the multiple locks are actually giving us any performance and/or parallelism benefits. If we have to take multiple locks in a common path, that might be causing worse performance and more deadlock potential. Daniel |
From: Jon M. <jon...@er...> - 2004-05-13 18:10:12
|
See below. /jon Daniel McNeil wrote: On Thu, 2004-05-13 at 08:10, Mark Haverkamp wrote: On Wed, 2004-05-12 at 13:25, Mark Haverkamp wrote: On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: Yeah, that's a classical one. I also think your solution is ok; the pending ports will be awakened at next message reception, so no harm done. Thanks /jon I found another one. I got another hang yesterday after the current deadlock fix. I re-added my spinlock debug code and found out that we're getting a deadlock between the node lock and the tipc_port lock. It looks like the port timeout handler is running on one CPU and a recv_msg is running on the other. I suppose as a workaround, we could make all the spin lock access in wakeup be conditional, but that will probably just make the problem just show up somewhere else. There should probably be an analysis of code paths and determine how the locks interact with each other. I agree. The locking hierarchy should be documented. Even if this is just comments a source file, it needs to be done. It should also include exactly what the lock is protecting. Agree completely. I wil try to do my part. I have noticed that there is at least one place where three locks are required. This can cause problems like we've seen when different code paths need multiple locks unless there is some sort of back off method to insure no deadlocks. We also need to analyze if the multiple locks are actually giving us any performance and/or parallelism benefits. If we have to take multiple locks in a common path, that might be causing worse performance and more deadlock potential. We could experiment on that. The one-big-lock approach that we had eariler was heavily critisized, and caused a lot of problems, but so far measurements suggest it was mor efficient when running the benchmark between two nodes. When we start to run real systems I am pretty convinced that the finer granularity will pay off, as we will have thousands of ports and dozens of nodes being accessed in parallel. In the previous version we had seven locks/unlocks per message transmitted: Send side: 1x big_lock/unlock when message was created and sent, 2x buffer locks when buffer was allocated and released. Receive side: Ditto + 1x grab/release of big_lock when message is read to user space. I don't think it is possible to make it much better. In the current version we have ten: Send side: 1x net_lock (read) + 1x node lock + 2 x buffer lock. (No tipc_port lock!) Receive side: 1 net_read_lock + 1x node lock + 1x tipc_port_lock + 1x tipc_port_lock when message is read (in "advance_queue()" + 2x buffer lock). I don't think an increase from seven to ten lock accesses adds any significant overhead to the data-phase path, while the advantages are obvious. The real problem is not performance, but potential deadlock situations in other parts of the code. Those can only be solved by as far as possible avoiding to keep the different locks simultaneously, as I suggested in my previous mail. There may be one two cases more I have missed, but if we find any, this is the solution we should look for. Regards /Jon Daniel |
From: Jon M. <jon...@er...> - 2004-05-13 17:27:47
|
Hi Mark, The general principle I had in mind when designing this was to not keep a node-lock and the port-lock at the same time, but in some cases this looks unavoidable, such as in the wakeup-function you corrected yesterday. In other cases I have just not been aware of the potential problem, even when it is easy to fix. The current problem is in fact on of the latter type; as it is completely unecessary to call port_send_protocol_msg() with the port lock on. What we need here is a small redesign to resolve the whole problem: Everywhere port_send_protocol_msg() is called (in port_timeout(), port_recv_protocol_msg() and a few other places) we should 1) build the message (we rename port_send_protocol_msg() to e.g. port_build_protocol_msg() and let it return an sk_buff* ) 2) Update the port's sequence number, when applicable. 3) Release the port lock. 4) Send the buffer through net_route_msg() There is one call where this is difficult, in port_abort_self((), since this itself is called with the lock on, but on the other hand this only sends a message to the own port, not to any link, so no node lock will be taken. This should resolve the deadlock problem between port and node locks once and for all, I hope. The rest is about proper documenation. Regards /Jon Mark Haverkamp wrote: On Wed, 2004-05-12 at 13:25, Mark Haverkamp wrote: On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: Yeah, that's a classical one. I also think your solution is ok; the pending ports will be awakened at next message reception, so no harm done. Thanks /jon I found another one. I got another hang yesterday after the current deadlock fix. I re-added my spinlock debug code and found out that we're getting a deadlock between the node lock and the tipc_port lock. It looks like the port timeout handler is running on one CPU and a recv_msg is running on the other. I suppose as a workaround, we could make all the spin lock access in wakeup be conditional, but that will probably just make the problem just show up somewhere else. There should probably be an analysis of code paths and determine how the locks interact with each other. I have noticed that there is at least one place where three locks are required. This can cause problems like we've seen when different code paths need multiple locks unless there is some sort of back off method to insure no deadlocks. For instance if a function needs both the node lock and tipc_port lock, something like this could be done: - - - - - - again: spinlock(node_lock); spintrylock(tipc_port_lock) if failed to acquire then release node_lock and goto again. - - - - - - - - Unfortunately, if one of the locks was acquired before the function was entered, releasing that lock may allow some state to change that the caller was assuming wouldn't change and that could cause problems too. So, I'm not sure that this method could work in the link_wakeup_ports function. The node lock is held on entry and if the tipc_port lock can't be acquired, you probably can't just release the node lock and wait for it again since that could invalidate the tipc_recv_msg functions assumptions about what the node lock was protecting. Anyway, here is the raw trace from the debug: &node->lock lock timeout Call Trace: [<f8a8543a>] link_send+0xda/0x2a0 [tipc] [<f8a92c8e>] net_route_msg+0x41e/0x43d [tipc] [<f8a94962>] port_send_proto_msg+0x1a2/0x2a0 [tipc] [<f8aa1b10>] k_signal+0x60/0x170 [tipc] [<f8a956be>] port_timeout+0xbe/0x200 [tipc] [<f8aa1ccc>] timeout_receiver+0xac/0x110 [tipc] [<f8aa1a9f>] receive_signal+0x14f/0x160 [tipc] [<c011ab51>] scheduler_tick+0x111/0x690 [<c0124ec3>] tasklet_action+0x73/0xc0 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c011526d>] smp_apic_timer_interrupt+0xdd/0x150 [<c0105d62>] apic_timer_interrupt+0x1a/0x20 [<c01119c0>] delay_pit+0x20/0x30 [<c0221694>] __delay+0x14/0x20 [<f8a8685f>] link_send_sections_long+0x50f/0xb30 [tipc] [<c01152d1>] smp_apic_timer_interrupt+0x141/0x150 [<f8a8366f>] link_schedule_port+0x13f/0x1e0 [tipc] [<f8a86095>] link_send_sections_fast+0x5b5/0x870 [tipc] [<f8a97b92>] tipc_send+0x92/0x9d0 [tipc] [<c03f14cf>] schedule+0x37f/0x7a0 [<f8aa3386>] recv_msg+0x2b6/0x560 [tipc] [<f8aa2d30>] send_packet+0x90/0x180 [tipc] [<c011b0d0>] default_wake_function+0x0/0x20 [<c036c83e>] sock_sendmsg+0x8e/0xb0 [<f8a8784e>] tipc_recv_msg+0x47e/0x8c0 [tipc] [<c01435ba>] buffered_rmqueue+0xfa/0x220 [<c036c61a>] sockfd_lookup+0x1a/0x80 [<c036dd61>] sys_sendto+0xe1/0x100 [<c0128f62>] del_timer_sync+0x42/0x140 [<c036d109>] sock_poll+0x29/0x30 [<c017884b>] do_pollfd+0x5b/0xa0 [<c036ddb6>] sys_send+0x36/0x40 [<c036e60e>] sys_socketcall+0x12e/0x240 [<c0105373>] syscall_call+0x7/0xb port->publ lock timeout Call Trace: [<f8a8383a>] link_wakeup_ports+0x12a/0x1d0 [tipc] [<f8a87bce>] tipc_recv_msg+0x7fe/0x8c0 [tipc] [<c014949d>] __kmalloc+0x19d/0x250 [<f8aa5d59>] recv_msg+0x39/0x50 [tipc] [<c0375af2>] netif_receive_skb+0x172/0x1b0 [<c0375bb4>] process_backlog+0x84/0x120 [<c0375cd0>] net_rx_action+0x80/0x120 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c0107ce5>] do_IRQ+0x175/0x230 [<c0105ea5>] nmi_stack_correct+0x1e/0x2e [<c0105ce0>] common_interrupt+0x18/0x20 |
From: Jon M. <jon...@er...> - 2004-05-13 19:46:45
|
Just to avoid misunderstandings: I am not working with the changes I suggested. I am struggling with trying to understand the usage of netdevice notifications, one of the issues Stephen H.'s called a showstopper in his review. Btw, does anybody know if NETDEV_DOWN/UP means that the carrier disappeared/reappeared (I hope so), or does it simply mean that somebody did "ifconfig ethX down" ? (I think so). Is there any description of how this is supposed to be used ? /Jon Jon Maloy wrote: Hi Mark, The general principle I had in mind when designing this was to not keep a node-lock and the port-lock at the same time, but in some cases this looks unavoidable, such as in the wakeup-function you corrected yesterday. In other cases I have just not been aware of the potential problem, even when it is easy to fix. The current problem is in fact on of the latter type; as it is completely unecessary to call port_send_protocol_msg() with the port lock on. What we need here is a small redesign to resolve the whole problem: Everywhere port_send_protocol_msg() is called (in port_timeout(), port_recv_protocol_msg() and a few other places) we should 1) build the message (we rename port_send_protocol_msg() to e.g. port_build_protocol_msg() and let it return an sk_buff* ) 2) Update the port's sequence number, when applicable. 3) Release the port lock. 4) Send the buffer through net_route_msg() There is one call where this is difficult, in port_abort_self((), since this itself is called with the lock on, but on the other hand this only sends a message to the own port, not to any link, so no node lock will be taken. This should resolve the deadlock problem between port and node locks once and for all, I hope. The rest is about proper documenation. Regards /Jon Mark Haverkamp wrote: On Wed, 2004-05-12 at 13:25, Mark Haverkamp wrote: On Wed, 2004-05-12 at 12:09, Jon Maloy wrote: Yeah, that's a classical one. I also think your solution is ok; the pending ports will be awakened at next message reception, so no harm done. Thanks /jon I found another one. I got another hang yesterday after the current deadlock fix. I re-added my spinlock debug code and found out that we're getting a deadlock between the node lock and the tipc_port lock. It looks like the port timeout handler is running on one CPU and a recv_msg is running on the other. I suppose as a workaround, we could make all the spin lock access in wakeup be conditional, but that will probably just make the problem just show up somewhere else. There should probably be an analysis of code paths and determine how the locks interact with each other. I have noticed that there is at least one place where three locks are required. This can cause problems like we've seen when different code paths need multiple locks unless there is some sort of back off method to insure no deadlocks. For instance if a function needs both the node lock and tipc_port lock, something like this could be done: - - - - - - again: spinlock(node_lock); spintrylock(tipc_port_lock) if failed to acquire then release node_lock and goto again. - - - - - - - - Unfortunately, if one of the locks was acquired before the function was entered, releasing that lock may allow some state to change that the caller was assuming wouldn't change and that could cause problems too. So, I'm not sure that this method could work in the link_wakeup_ports function. The node lock is held on entry and if the tipc_port lock can't be acquired, you probably can't just release the node lock and wait for it again since that could invalidate the tipc_recv_msg functions assumptions about what the node lock was protecting. Anyway, here is the raw trace from the debug: &node->lock lock timeout Call Trace: [<f8a8543a>] link_send+0xda/0x2a0 [tipc] [<f8a92c8e>] net_route_msg+0x41e/0x43d [tipc] [<f8a94962>] port_send_proto_msg+0x1a2/0x2a0 [tipc] [<f8aa1b10>] k_signal+0x60/0x170 [tipc] [<f8a956be>] port_timeout+0xbe/0x200 [tipc] [<f8aa1ccc>] timeout_receiver+0xac/0x110 [tipc] [<f8aa1a9f>] receive_signal+0x14f/0x160 [tipc] [<c011ab51>] scheduler_tick+0x111/0x690 [<c0124ec3>] tasklet_action+0x73/0xc0 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c011526d>] smp_apic_timer_interrupt+0xdd/0x150 [<c0105d62>] apic_timer_interrupt+0x1a/0x20 [<c01119c0>] delay_pit+0x20/0x30 [<c0221694>] __delay+0x14/0x20 [<f8a8685f>] link_send_sections_long+0x50f/0xb30 [tipc] [<c01152d1>] smp_apic_timer_interrupt+0x141/0x150 [<f8a8366f>] link_schedule_port+0x13f/0x1e0 [tipc] [<f8a86095>] link_send_sections_fast+0x5b5/0x870 [tipc] [<f8a97b92>] tipc_send+0x92/0x9d0 [tipc] [<c03f14cf>] schedule+0x37f/0x7a0 [<f8aa3386>] recv_msg+0x2b6/0x560 [tipc] [<f8aa2d30>] send_packet+0x90/0x180 [tipc] [<c011b0d0>] default_wake_function+0x0/0x20 [<c036c83e>] sock_sendmsg+0x8e/0xb0 [<f8a8784e>] tipc_recv_msg+0x47e/0x8c0 [tipc] [<c01435ba>] buffered_rmqueue+0xfa/0x220 [<c036c61a>] sockfd_lookup+0x1a/0x80 [<c036dd61>] sys_sendto+0xe1/0x100 [<c0128f62>] del_timer_sync+0x42/0x140 [<c036d109>] sock_poll+0x29/0x30 [<c017884b>] do_pollfd+0x5b/0xa0 [<c036ddb6>] sys_send+0x36/0x40 [<c036e60e>] sys_socketcall+0x12e/0x240 [<c0105373>] syscall_call+0x7/0xb port->publ lock timeout Call Trace: [<f8a8383a>] link_wakeup_ports+0x12a/0x1d0 [tipc] [<f8a87bce>] tipc_recv_msg+0x7fe/0x8c0 [tipc] [<c014949d>] __kmalloc+0x19d/0x250 [<f8aa5d59>] recv_msg+0x39/0x50 [tipc] [<c0375af2>] netif_receive_skb+0x172/0x1b0 [<c0375bb4>] process_backlog+0x84/0x120 [<c0375cd0>] net_rx_action+0x80/0x120 [<c0124bc8>] __do_softirq+0xb8/0xc0 [<c0124c05>] do_softirq+0x35/0x40 [<c0107ce5>] do_IRQ+0x175/0x230 [<c0105ea5>] nmi_stack_correct+0x1e/0x2e [<c0105ce0>] common_interrupt+0x18/0x20 |