Re: [SSI-devel] SSI-1.2.2-FC shm oops
Brought to you by:
brucewalker,
rogertsang
From: Laura R. <lau...@hp...> - 2005-04-13 20:42:06
|
Hi Roger, Is this only happening after a failover? Is it possible to get a netdump image? If not can you dump the following info from kdb? kdb> bt kdb> md shm_ids kdb> md cfs_shm_node_mnts thanks laura Roger Tsang wrote: > I'm still getting the shm oops with the patch. > > kernel oops invalid operand > process httpd > shm_close 0xb0 [kernel] > exit_mmap 0x166 > mmput 0x4c > do_exit 0xe9 > do_group_exit 0x32 > set_signal_to_deliver 0x2b4 > restore_i387_fxsave 0xaf > do_signal 0x4f > restore_sigcontext 0x458 > sys_sigreturn 0x106 > signal_return 0x14 > > > On 4/12/05, Laura Ramirez <lau...@hp...> wrote: > >>Hi Roger, >> >>I'll check in the fix then. This fix only deals with the shm structures >>so it shouldnt be related to the unixnm.c panic at all. >> >>laura >> >>Roger Tsang wrote: >> >>>Laura, >>> >>>Thanks. I've incorporated your patch into my kernel recompile last >>>night and the cluster seems to be running fine so far after two >>>failovers, once on each initnode. I hope this is not related to the >>>unixnm.c oops because I assumed that was due to kernel networking >>>options packet socket. >>> >>>-Roger >>> >>>On Apr 11, 2005 8:54 PM, Laura Ramirez <lau...@hp...> wrote: >>> >>> >>>>Hi Roger, >>>> >>>>Looking at the shm nodedown code, i saw some locking that didnt look >>>>right. I have attached a patch file with a fix. I dont know if this >>>>will fix your shm panic, but if you want to give it a try, >>>>please let me know how it goes. (use -p0 to apply patch) >>>> >>>>Also, is it possible to get a netdump image, if it does panic again, >>>>or if you get a kdb prompt dump the following: >>>>kdb> bt >>>> >>>>kdb> md shm_ids >>>> >>>>kdb> md cfs_shm_node_mnts >>>> >>>>thanks >>>> >>>>laura >>>> >>>>Roger Tsang wrote: >>>> >>>> >>>>>Hi I'm using SSI-1.2.2-FC2 9smp with Lustre-1.2.4 patch. node2 is a >>>>>failover node, so I failed over to this node once and after a few >>>>>hours running as the only node in the cluster I got the following. >>>>> >>>>>-Roger >>>>> >>>>>Apr 11 19:25:42 node2 kernel: ------------[ cut here ]------------ >>>>>Apr 11 19:25:42 node2 kernel: kernel BUG at shm.c:232! >>>>>Apr 11 19:25:42 node2 kernel: invalid operand: 0000 >>>>>Apr 11 19:25:42 node2 kernel: ipt_REJECT ipt_multiport ipt_state >>>>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 >>>>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy >>>>>Apr 11 19:25:42 node2 kernel: CPU: 0 >>>>>Apr 11 19:25:42 node2 kernel: EIP: 0060:[<c01c9a90>] Not tainted >>>>>Apr 11 19:25:42 node2 kernel: EFLAGS: 00010246 >>>>>Apr 11 19:25:42 node2 kernel: >>>>>Apr 11 19:25:42 node2 kernel: EIP is at shm_close [kernel] 0xb0 >>>>>(2.4.22-1.2199.nptl_ssi_9smp) >>>>>Apr 11 19:25:42 node2 kernel: eax: d8ad9000 ebx: c05f0fb0 ecx: >>>>>c05f0fb0 edx: 00000000 >>>>>Apr 11 19:25:42 node2 kernel: esi: 02000000 edi: bd311000 ebp: >>>>>d3fcde64 esp: d3fcde5c >>>>>Apr 11 19:25:42 node2 kernel: ds: 0068 es: 0068 ss: 0068 >>>>>Apr 11 19:25:42 node2 kernel: Process httpd (pid: 132848, stackpage=d3fcd000) >>>>>Apr 11 19:25:42 node2 kernel: Call Trace: >>>>>Apr 11 19:25:42 node2 kernel: [<c0139c86>] exit_mmap [kernel] 0x166 (0xd3fcde68) >>>>>Apr 11 19:25:42 node2 kernel: [<c011f3cc>] mmput [kernel] 0x4c (0xd3fcde90) >>>>>Apr 11 19:25:42 node2 kernel: [<c0125319>] do_exit [kernel] 0xe9 (0xd3fcdea4) >>>>>Apr 11 19:25:42 node2 kernel: [<c01256e2>] do_group_exit [kernel] 0x32 >>>>>(0xd3fcdec4) >>>>>Apr 11 19:25:42 node2 kernel: [<c012ecf4>] get_signal_to_deliver >>>>>[kernel] 0x2b4 (0xd3fcded8) >>>>>Apr 11 19:25:42 node2 kernel: [<c01142ff>] restore_i387_fxsave >>>>>[kernel] 0xaf (0xd3fcdee8) >>>>>Apr 11 19:25:42 node2 kernel: [<c010b91f>] do_signal [kernel] 0x4f (0xd3fcdf1c) >>>>>Apr 11 19:25:42 node2 kernel: [<c0109a68>] restore_sigcontext [kernel] >>>>>0x458 (0xd3fcdf28) >>>>>Apr 11 19:25:42 node2 kernel: [<c0109ed6>] sys_sigreturn [kernel] >>>>>0x106 (0xd3fcdf90) >>>>>Apr 11 19:25:42 node2 kernel: [<c010bb10>] signal_return [kernel] 0x14 >>>>>(0xd3fcdfc0) >>>>>Apr 11 19:25:42 node2 kernel: >>>>>Apr 11 19:25:42 node2 kernel: Code: 0f 0b e8 00 ea db 38 c0 eb a5 8d >>>>>b6 00 00 00 00 a1 c4 0f 5f >>>>>Apr 11 19:27:17 node2 kernel: ------------[ cut here ]------------ >>>>>Apr 11 19:27:17 node2 kernel: kernel BUG at shm.c:169! >>>>>Apr 11 19:27:17 node2 kernel: invalid operand: 0000 >>>>>Apr 11 19:27:17 node2 kernel: ipt_REJECT ipt_multiport ipt_state >>>>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 >>>>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy >>>>>Apr 11 19:27:17 node2 kernel: CPU: 0 >>>>>Apr 11 19:27:17 node2 kernel: EIP: 0060:[<c01c9930>] Not tainted >>>>>Apr 11 19:27:17 node2 kernel: EFLAGS: 00010246 >>>>>Apr 11 19:27:17 node2 kernel: >>>>>Apr 11 19:27:17 node2 kernel: EIP is at shm_open [kernel] 0x60 >>>>>(2.4.22-1.2199.nptl_ssi_9smp) >>>>>Apr 11 19:27:17 node2 kernel: eax: d8ad9000 ebx: d40e3c80 ecx: >>>>>bf000000 edx: 00000000 >>>>>Apr 11 19:27:17 node2 kernel: esi: d5d40600 edi: 00000000 ebp: >>>>>d4955ec4 esp: d4955ec4 >>>>>Apr 11 19:27:17 node2 kernel: ds: 0068 es: 0068 ss: 0068 >>>>>Apr 11 19:27:17 node2 kernel: Process httpd (pid: 132813, stackpage=d4955000) >>>>>Apr 11 19:27:17 node2 kernel: Call Trace: >>>>>Apr 11 19:27:17 node2 kernel: [<c011f8a9>] copy_mm [kernel] 0x389 (0xd4955ec8) >>>>>Apr 11 19:27:17 node2 kernel: [<c01201f9>] __copy_process [kernel] >>>>>0x399 (0xd4955f04) >>>>>Apr 11 19:27:17 node2 kernel: [<c0120a22>] __do_fork [kernel] 0x52 (0xd4955f4c) >>>>>Apr 11 19:27:17 node2 kernel: [<c0107e65>] sys_clone [kernel] 0x45 (0xd4955f9c) >>>>>Apr 11 19:27:17 node2 kernel: [<c010bad7>] system_call [kernel] 0x33 >>>>>(0xd4955fc0) >>>>>Apr 11 19:27:17 node2 kernel: >>>>>Apr 11 19:27:17 node2 kernel: Code: 0f 0b a9 00 ea db 38 c0 eb d2 8d >>>>>b6 00 00 00 00 a1 c4 0f 5f >>>>> >>>>> >>>>>------------------------------------------------------- >>>>>SF email is sponsored by - The IT Product Guide >>>>>Read honest & candid reviews on hundreds of IT Products from real users. >>>>>Discover which products truly live up to the hype. Start reading now. >>>>>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >>>>>_______________________________________________ >>>>>ssic-linux-devel mailing list >>>>>ssi...@li... >>>>>https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel >>>>> >>>>> >>>> >>>> >>>>Index: ipc/shm.c >>>>=================================================================== >>>>RCS file: /cvsroot/ssic-linux/openssi/kernel/ipc/shm.c,v >>>>retrieving revision 1.2.2.25 >>>>diff -u -p -r1.2.2.25 shm.c >>>>--- ipc/shm.c 17 Dec 2004 22:21:13 -0000 1.2.2.25 >>>>+++ ipc/shm.c 12 Apr 2005 00:28:05 -0000 >>>>@@ -1235,12 +1235,14 @@ ipc_shm_nodedown(clusternode_t node) >>>> } >>>> } >>>> else { >>>>+ int id = shp->id; >>>>+ ipc_get_locks(id, &shm_ids, 1); >>>> shp->shm_flags |= SHM_DEST; >>>> if (shp->shm_nattch == 0) { >>>>- ipc_get_locks(shp->id, &shm_ids, 1); >>>> ssi_local_destroy(shp); >>>>- up(&shm_ids.sem); >>>>+ id = 0; >>>> } >>>>+ ipc_drop_locks(id, &shm_ids, 1); >>>> } >>>> } >>>> } >>>> >>>> >>>> >>> >>> >>> > > |