Re: [SSI-devel] SSI-1.2.2-FC shm oops
Brought to you by:
brucewalker,
rogertsang
From: Laura R. <lau...@hp...> - 2005-04-13 21:07:08
|
Hi Roger, I got your other dump info...from looking at that info Can you dump the following: kdb> md 0xd3730480 kdb> md 0xd8ad9000 kdb> md 0xd7edd100 laura Roger Tsang wrote: > kdb bt with function args... I am waiting for my serial cable, so if > you need more just ask. I haven't reboot that node yet. > > shm_close+0xb0 (0xd3730480, 0xd3730400, 0xe000, 0x0, 0xd3730500) > exit_mmap+0x166 (0xd376c180, 0xd376c180, 0xd376c180) > mmput+0x4c (0xd376c180, 0xd376f400, 0x7, 0x7, 0xd373e000) > .... > > On 4/13/05, Roger Tsang <rog...@gm...> wrote: > >>I'm still getting the shm oops with the patch. >> >>kernel oops invalid operand >>process httpd >>shm_close 0xb0 [kernel] >>exit_mmap 0x166 >>mmput 0x4c >>do_exit 0xe9 >>do_group_exit 0x32 >>set_signal_to_deliver 0x2b4 >>restore_i387_fxsave 0xaf >>do_signal 0x4f >>restore_sigcontext 0x458 >>sys_sigreturn 0x106 >>signal_return 0x14 >> >> >>On 4/12/05, Laura Ramirez <lau...@hp...> wrote: >> >>>Hi Roger, >>> >>>I'll check in the fix then. This fix only deals with the shm structures >>>so it shouldnt be related to the unixnm.c panic at all. >>> >>>laura >>> >>>Roger Tsang wrote: >>> >>>>Laura, >>>> >>>>Thanks. I've incorporated your patch into my kernel recompile last >>>>night and the cluster seems to be running fine so far after two >>>>failovers, once on each initnode. I hope this is not related to the >>>>unixnm.c oops because I assumed that was due to kernel networking >>>>options packet socket. >>>> >>>>-Roger >>>> >>>>On Apr 11, 2005 8:54 PM, Laura Ramirez <lau...@hp...> wrote: >>>> >>>> >>>>>Hi Roger, >>>>> >>>>>Looking at the shm nodedown code, i saw some locking that didnt look >>>>>right. I have attached a patch file with a fix. I dont know if this >>>>>will fix your shm panic, but if you want to give it a try, >>>>>please let me know how it goes. (use -p0 to apply patch) >>>>> >>>>>Also, is it possible to get a netdump image, if it does panic again, >>>>>or if you get a kdb prompt dump the following: >>>>>kdb> bt >>>>> >>>>>kdb> md shm_ids >>>>> >>>>>kdb> md cfs_shm_node_mnts >>>>> >>>>>thanks >>>>> >>>>>laura >>>>> >>>>>Roger Tsang wrote: >>>>> >>>>> >>>>>>Hi I'm using SSI-1.2.2-FC2 9smp with Lustre-1.2.4 patch. node2 is a >>>>>>failover node, so I failed over to this node once and after a few >>>>>>hours running as the only node in the cluster I got the following. >>>>>> >>>>>>-Roger >>>>>> >>>>>>Apr 11 19:25:42 node2 kernel: ------------[ cut here ]------------ >>>>>>Apr 11 19:25:42 node2 kernel: kernel BUG at shm.c:232! >>>>>>Apr 11 19:25:42 node2 kernel: invalid operand: 0000 >>>>>>Apr 11 19:25:42 node2 kernel: ipt_REJECT ipt_multiport ipt_state >>>>>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 >>>>>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy >>>>>>Apr 11 19:25:42 node2 kernel: CPU: 0 >>>>>>Apr 11 19:25:42 node2 kernel: EIP: 0060:[<c01c9a90>] Not tainted >>>>>>Apr 11 19:25:42 node2 kernel: EFLAGS: 00010246 >>>>>>Apr 11 19:25:42 node2 kernel: >>>>>>Apr 11 19:25:42 node2 kernel: EIP is at shm_close [kernel] 0xb0 >>>>>>(2.4.22-1.2199.nptl_ssi_9smp) >>>>>>Apr 11 19:25:42 node2 kernel: eax: d8ad9000 ebx: c05f0fb0 ecx: >>>>>>c05f0fb0 edx: 00000000 >>>>>>Apr 11 19:25:42 node2 kernel: esi: 02000000 edi: bd311000 ebp: >>>>>>d3fcde64 esp: d3fcde5c >>>>>>Apr 11 19:25:42 node2 kernel: ds: 0068 es: 0068 ss: 0068 >>>>>>Apr 11 19:25:42 node2 kernel: Process httpd (pid: 132848, stackpage=d3fcd000) >>>>>>Apr 11 19:25:42 node2 kernel: Call Trace: >>>>>>Apr 11 19:25:42 node2 kernel: [<c0139c86>] exit_mmap [kernel] 0x166 (0xd3fcde68) >>>>>>Apr 11 19:25:42 node2 kernel: [<c011f3cc>] mmput [kernel] 0x4c (0xd3fcde90) >>>>>>Apr 11 19:25:42 node2 kernel: [<c0125319>] do_exit [kernel] 0xe9 (0xd3fcdea4) >>>>>>Apr 11 19:25:42 node2 kernel: [<c01256e2>] do_group_exit [kernel] 0x32 >>>>>>(0xd3fcdec4) >>>>>>Apr 11 19:25:42 node2 kernel: [<c012ecf4>] get_signal_to_deliver >>>>>>[kernel] 0x2b4 (0xd3fcded8) >>>>>>Apr 11 19:25:42 node2 kernel: [<c01142ff>] restore_i387_fxsave >>>>>>[kernel] 0xaf (0xd3fcdee8) >>>>>>Apr 11 19:25:42 node2 kernel: [<c010b91f>] do_signal [kernel] 0x4f (0xd3fcdf1c) >>>>>>Apr 11 19:25:42 node2 kernel: [<c0109a68>] restore_sigcontext [kernel] >>>>>>0x458 (0xd3fcdf28) >>>>>>Apr 11 19:25:42 node2 kernel: [<c0109ed6>] sys_sigreturn [kernel] >>>>>>0x106 (0xd3fcdf90) >>>>>>Apr 11 19:25:42 node2 kernel: [<c010bb10>] signal_return [kernel] 0x14 >>>>>>(0xd3fcdfc0) >>>>>>Apr 11 19:25:42 node2 kernel: >>>>>>Apr 11 19:25:42 node2 kernel: Code: 0f 0b e8 00 ea db 38 c0 eb a5 8d >>>>>>b6 00 00 00 00 a1 c4 0f 5f >>>>>>Apr 11 19:27:17 node2 kernel: ------------[ cut here ]------------ >>>>>>Apr 11 19:27:17 node2 kernel: kernel BUG at shm.c:169! >>>>>>Apr 11 19:27:17 node2 kernel: invalid operand: 0000 >>>>>>Apr 11 19:27:17 node2 kernel: ipt_REJECT ipt_multiport ipt_state >>>>>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 >>>>>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy >>>>>>Apr 11 19:27:17 node2 kernel: CPU: 0 >>>>>>Apr 11 19:27:17 node2 kernel: EIP: 0060:[<c01c9930>] Not tainted >>>>>>Apr 11 19:27:17 node2 kernel: EFLAGS: 00010246 >>>>>>Apr 11 19:27:17 node2 kernel: >>>>>>Apr 11 19:27:17 node2 kernel: EIP is at shm_open [kernel] 0x60 >>>>>>(2.4.22-1.2199.nptl_ssi_9smp) >>>>>>Apr 11 19:27:17 node2 kernel: eax: d8ad9000 ebx: d40e3c80 ecx: >>>>>>bf000000 edx: 00000000 >>>>>>Apr 11 19:27:17 node2 kernel: esi: d5d40600 edi: 00000000 ebp: >>>>>>d4955ec4 esp: d4955ec4 >>>>>>Apr 11 19:27:17 node2 kernel: ds: 0068 es: 0068 ss: 0068 >>>>>>Apr 11 19:27:17 node2 kernel: Process httpd (pid: 132813, stackpage=d4955000) >>>>>>Apr 11 19:27:17 node2 kernel: Call Trace: >>>>>>Apr 11 19:27:17 node2 kernel: [<c011f8a9>] copy_mm [kernel] 0x389 (0xd4955ec8) >>>>>>Apr 11 19:27:17 node2 kernel: [<c01201f9>] __copy_process [kernel] >>>>>>0x399 (0xd4955f04) >>>>>>Apr 11 19:27:17 node2 kernel: [<c0120a22>] __do_fork [kernel] 0x52 (0xd4955f4c) >>>>>>Apr 11 19:27:17 node2 kernel: [<c0107e65>] sys_clone [kernel] 0x45 (0xd4955f9c) >>>>>>Apr 11 19:27:17 node2 kernel: [<c010bad7>] system_call [kernel] 0x33 >>>>>>(0xd4955fc0) >>>>>>Apr 11 19:27:17 node2 kernel: >>>>>>Apr 11 19:27:17 node2 kernel: Code: 0f 0b a9 00 ea db 38 c0 eb d2 8d >>>>>>b6 00 00 00 00 a1 c4 0f 5f >>>>>> >>>>>> >>>>>>------------------------------------------------------- >>>>>>SF email is sponsored by - The IT Product Guide >>>>>>Read honest & candid reviews on hundreds of IT Products from real users. >>>>>>Discover which products truly live up to the hype. Start reading now. >>>>>>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >>>>>>_______________________________________________ >>>>>>ssic-linux-devel mailing list >>>>>>ssi...@li... >>>>>>https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel >>>>>> >>>>>> >>>>> >>>>> >>>>>Index: ipc/shm.c >>>>>=================================================================== >>>>>RCS file: /cvsroot/ssic-linux/openssi/kernel/ipc/shm.c,v >>>>>retrieving revision 1.2.2.25 >>>>>diff -u -p -r1.2.2.25 shm.c >>>>>--- ipc/shm.c 17 Dec 2004 22:21:13 -0000 1.2.2.25 >>>>>+++ ipc/shm.c 12 Apr 2005 00:28:05 -0000 >>>>>@@ -1235,12 +1235,14 @@ ipc_shm_nodedown(clusternode_t node) >>>>> } >>>>> } >>>>> else { >>>>>+ int id = shp->id; >>>>>+ ipc_get_locks(id, &shm_ids, 1); >>>>> shp->shm_flags |= SHM_DEST; >>>>> if (shp->shm_nattch == 0) { >>>>>- ipc_get_locks(shp->id, &shm_ids, 1); >>>>> ssi_local_destroy(shp); >>>>>- up(&shm_ids.sem); >>>>>+ id = 0; >>>>> } >>>>>+ ipc_drop_locks(id, &shm_ids, 1); >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> >>>> >>>> >>>> > > |