From: Mark H. <ma...@os...> - 2004-06-07 16:58:34
|
I have code running on 4 nodes using multicast to distribute messages between the nodes. After some hours of sending/and receiving one or more of my nodes will hang. The last time 3 of 4 machines were hung and I was able to get a dump from one of them. This one seems to indicate that there may be a spin lock deadlock in buf_safe_discard. It shows up twice in this stack dump. It looks like the first buf_safe_discard gets interrupted while holding the lock. The second buf_safe_discard seems to be called from link_recv_proto_msg (the address pointed to in tipc_recv_msg is just after the call to link_recv_proto_msg. SysRq : Show Regs Pid: 1599, comm: event_server EIP: 0060:[<f8e69cdf>] CPU: 0 EIP is at buf_safe_discard+0x6f/0x270 [tipc] EFLAGS: 00000246 Not tainted (2.6.7-rc2) EAX: ef329bf8 EBX: ef328f50 ECX: 0b6b03c9 EDX: 00000000 ESI: ef328f94 EDI: ef326f50 EBP: efb9db48 DS: 007b ES: 007b CR0: 8005003b CR2: 4206f5e0 CR3: 35326000 CR4: 000006c0 [<c01032d5>] show_regs+0x145/0x170 [<c026b541>] __handle_sysrq+0x71/0x100 [<c02824bc>] receive_chars+0x12c/0x280 [<c02829c6>] serial8250_interrupt+0x176/0x1d0 [<c010785b>] handle_IRQ_event+0x3b/0x70 [<c0107cc1>] do_IRQ+0xe1/0x230 [<c0105cd0>] common_interrupt+0x18/0x20 [<f8e50398>] tipc_recv_msg+0x788/0x8a0 [tipc] [<f8e6e2f9>] recv_msg+0x39/0x50 [tipc] [<c037e052>] netif_receive_skb+0x172/0x1b0 [<c037e114>] process_backlog+0x84/0x120 [<c037e230>] net_rx_action+0x80/0x120 [<c0126068>] __do_softirq+0xb8/0xc0 [<c01260a5>] do_softirq+0x35/0x40 [<f8e69d23>] buf_safe_discard+0xb3/0x270 [tipc] [<f8e66723>] nameseq_deliver+0x83/0x420 [tipc] [<f8e66ced>] bcast_port_recv+0x4d/0x80 [tipc] [<f8e67e65>] tipc_forward_buf2nameseq+0x1c5/0x270 [tipc] [<f8e681eb>] tipc_multicast+0x2db/0x4e0 [tipc] [<f8e6b20a>] send_msg+0x18a/0x210 [tipc] [<c03747ce>] sock_sendmsg+0x8e/0xb0 [<c0375dc1>] sys_sendto+0xe1/0x100 [<c03766ba>] sys_socketcall+0x17a/0x240 [<c0105363>] syscall_call+0x7/0xb -- Mark Haverkamp <ma...@os...> |