Re: [SSI-devel] SSI-1.2.2-FC shm oops
Brought to you by:
brucewalker,
rogertsang
From: Roger T. <rog...@gm...> - 2005-04-13 20:42:47
|
kdb> md cfs_shm_node_mnts 0xc06121a0 00000000 0000000 d7bf4300 00000000 0xc06121b0 00000000 0000000 00000000 00000000 0xc06121c0 00000000 0000000 00000000 00000000 0xc06121d0 00000000 0000000 00000000 00000000 0xc06121e0 00000000 0000000 00000000 00000000 0xc06121f0 00000000 0000000 00000000 00000000 0xc0612200 00000000 0000000 00000000 00000000 0xc0612210 00000000 0000000 00000000 00000000 On 4/13/05, Roger Tsang <rog...@gm...> wrote: > I'm still getting the shm oops with the patch. >=20 > kernel oops invalid operand > process httpd > shm_close 0xb0 [kernel] > exit_mmap 0x166 > mmput 0x4c > do_exit 0xe9 > do_group_exit 0x32 > set_signal_to_deliver 0x2b4 > restore_i387_fxsave 0xaf > do_signal 0x4f > restore_sigcontext 0x458 > sys_sigreturn 0x106 > signal_return 0x14 >=20 >=20 > On 4/12/05, Laura Ramirez <lau...@hp...> wrote: > > Hi Roger, > > > > I'll check in the fix then. This fix only deals with the shm structure= s > > so it shouldnt be related to the unixnm.c panic at all. > > > > laura > > > > Roger Tsang wrote: > > > Laura, > > > > > > Thanks. I've incorporated your patch into my kernel recompile last > > > night and the cluster seems to be running fine so far after two > > > failovers, once on each initnode. I hope this is not related to the > > > unixnm.c oops because I assumed that was due to kernel networking > > > options packet socket. > > > > > > -Roger > > > > > > On Apr 11, 2005 8:54 PM, Laura Ramirez <lau...@hp...> wrote: > > > > > >>Hi Roger, > > >> > > >>Looking at the shm nodedown code, i saw some locking that didnt look > > >>right. I have attached a patch file with a fix. I dont know if this > > >>will fix your shm panic, but if you want to give it a try, > > >>please let me know how it goes. (use -p0 to apply patch) > > >> > > >>Also, is it possible to get a netdump image, if it does panic again, > > >>or if you get a kdb prompt dump the following: > > >>kdb> bt > > >> > > >>kdb> md shm_ids > > >> > > >>kdb> md cfs_shm_node_mnts > > >> > > >>thanks > > >> > > >>laura > > >> > > >>Roger Tsang wrote: > > >> > > >>>Hi I'm using SSI-1.2.2-FC2 9smp with Lustre-1.2.4 patch. node2 is a > > >>>failover node, so I failed over to this node once and after a few > > >>>hours running as the only node in the cluster I got the following. > > >>> > > >>>-Roger > > >>> > > >>>Apr 11 19:25:42 node2 kernel: ------------[ cut here ]------------ > > >>>Apr 11 19:25:42 node2 kernel: kernel BUG at shm.c:232! > > >>>Apr 11 19:25:42 node2 kernel: invalid operand: 0000 > > >>>Apr 11 19:25:42 node2 kernel: ipt_REJECT ipt_multiport ipt_state > > >>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 > > >>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy > > >>>Apr 11 19:25:42 node2 kernel: CPU: 0 > > >>>Apr 11 19:25:42 node2 kernel: EIP: 0060:[<c01c9a90>] Not taint= ed > > >>>Apr 11 19:25:42 node2 kernel: EFLAGS: 00010246 > > >>>Apr 11 19:25:42 node2 kernel: > > >>>Apr 11 19:25:42 node2 kernel: EIP is at shm_close [kernel] 0xb0 > > >>>(2.4.22-1.2199.nptl_ssi_9smp) > > >>>Apr 11 19:25:42 node2 kernel: eax: d8ad9000 ebx: c05f0fb0 ecx: > > >>>c05f0fb0 edx: 00000000 > > >>>Apr 11 19:25:42 node2 kernel: esi: 02000000 edi: bd311000 ebp: > > >>>d3fcde64 esp: d3fcde5c > > >>>Apr 11 19:25:42 node2 kernel: ds: 0068 es: 0068 ss: 0068 > > >>>Apr 11 19:25:42 node2 kernel: Process httpd (pid: 132848, stackpage= =3Dd3fcd000) > > >>>Apr 11 19:25:42 node2 kernel: Call Trace: > > >>>Apr 11 19:25:42 node2 kernel: [<c0139c86>] exit_mmap [kernel] 0x166 = (0xd3fcde68) > > >>>Apr 11 19:25:42 node2 kernel: [<c011f3cc>] mmput [kernel] 0x4c (0xd3= fcde90) > > >>>Apr 11 19:25:42 node2 kernel: [<c0125319>] do_exit [kernel] 0xe9 (0x= d3fcdea4) > > >>>Apr 11 19:25:42 node2 kernel: [<c01256e2>] do_group_exit [kernel] 0x= 32 > > >>>(0xd3fcdec4) > > >>>Apr 11 19:25:42 node2 kernel: [<c012ecf4>] get_signal_to_deliver > > >>>[kernel] 0x2b4 (0xd3fcded8) > > >>>Apr 11 19:25:42 node2 kernel: [<c01142ff>] restore_i387_fxsave > > >>>[kernel] 0xaf (0xd3fcdee8) > > >>>Apr 11 19:25:42 node2 kernel: [<c010b91f>] do_signal [kernel] 0x4f (= 0xd3fcdf1c) > > >>>Apr 11 19:25:42 node2 kernel: [<c0109a68>] restore_sigcontext [kerne= l] > > >>>0x458 (0xd3fcdf28) > > >>>Apr 11 19:25:42 node2 kernel: [<c0109ed6>] sys_sigreturn [kernel] > > >>>0x106 (0xd3fcdf90) > > >>>Apr 11 19:25:42 node2 kernel: [<c010bb10>] signal_return [kernel] 0x= 14 > > >>>(0xd3fcdfc0) > > >>>Apr 11 19:25:42 node2 kernel: > > >>>Apr 11 19:25:42 node2 kernel: Code: 0f 0b e8 00 ea db 38 c0 eb a5 8d > > >>>b6 00 00 00 00 a1 c4 0f 5f > > >>>Apr 11 19:27:17 node2 kernel: ------------[ cut here ]------------ > > >>>Apr 11 19:27:17 node2 kernel: kernel BUG at shm.c:169! > > >>>Apr 11 19:27:17 node2 kernel: invalid operand: 0000 > > >>>Apr 11 19:27:17 node2 kernel: ipt_REJECT ipt_multiport ipt_state > > >>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 > > >>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy > > >>>Apr 11 19:27:17 node2 kernel: CPU: 0 > > >>>Apr 11 19:27:17 node2 kernel: EIP: 0060:[<c01c9930>] Not taint= ed > > >>>Apr 11 19:27:17 node2 kernel: EFLAGS: 00010246 > > >>>Apr 11 19:27:17 node2 kernel: > > >>>Apr 11 19:27:17 node2 kernel: EIP is at shm_open [kernel] 0x60 > > >>>(2.4.22-1.2199.nptl_ssi_9smp) > > >>>Apr 11 19:27:17 node2 kernel: eax: d8ad9000 ebx: d40e3c80 ecx: > > >>>bf000000 edx: 00000000 > > >>>Apr 11 19:27:17 node2 kernel: esi: d5d40600 edi: 00000000 ebp: > > >>>d4955ec4 esp: d4955ec4 > > >>>Apr 11 19:27:17 node2 kernel: ds: 0068 es: 0068 ss: 0068 > > >>>Apr 11 19:27:17 node2 kernel: Process httpd (pid: 132813, stackpage= =3Dd4955000) > > >>>Apr 11 19:27:17 node2 kernel: Call Trace: > > >>>Apr 11 19:27:17 node2 kernel: [<c011f8a9>] copy_mm [kernel] 0x389 (0= xd4955ec8) > > >>>Apr 11 19:27:17 node2 kernel: [<c01201f9>] __copy_process [kernel] > > >>>0x399 (0xd4955f04) > > >>>Apr 11 19:27:17 node2 kernel: [<c0120a22>] __do_fork [kernel] 0x52 (= 0xd4955f4c) > > >>>Apr 11 19:27:17 node2 kernel: [<c0107e65>] sys_clone [kernel] 0x45 (= 0xd4955f9c) > > >>>Apr 11 19:27:17 node2 kernel: [<c010bad7>] system_call [kernel] 0x33 > > >>>(0xd4955fc0) > > >>>Apr 11 19:27:17 node2 kernel: > > >>>Apr 11 19:27:17 node2 kernel: Code: 0f 0b a9 00 ea db 38 c0 eb d2 8d > > >>>b6 00 00 00 00 a1 c4 0f 5f > > >>> > > >>> > > >>>------------------------------------------------------- > > >>>SF email is sponsored by - The IT Product Guide > > >>>Read honest & candid reviews on hundreds of IT Products from real us= ers. > > >>>Discover which products truly live up to the hype. Start reading now= . > > >>>http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick > > >>>_______________________________________________ > > >>>ssic-linux-devel mailing list > > >>>ssi...@li... > > >>>https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > >>> > > >>> > > >> > > >> > > >>Index: ipc/shm.c > > >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > >>RCS file: /cvsroot/ssic-linux/openssi/kernel/ipc/shm.c,v > > >>retrieving revision 1.2.2.25 > > >>diff -u -p -r1.2.2.25 shm.c > > >>--- ipc/shm.c 17 Dec 2004 22:21:13 -0000 1.2.2.25 > > >>+++ ipc/shm.c 12 Apr 2005 00:28:05 -0000 > > >>@@ -1235,12 +1235,14 @@ ipc_shm_nodedown(clusternode_t node) > > >> } > > >> } > > >> else { > > >>+ int id =3D shp->id; > > >>+ ipc_get_locks(id, &shm_ids, 1); > > >> shp->shm_flags |=3D SHM_DEST; > > >> if (shp->shm_nattch =3D=3D 0) { > > >>- ipc_get_locks(shp->id, &shm_i= ds, 1); > > >> ssi_local_destroy(shp); > > >>- up(&shm_ids.sem); > > >>+ id =3D 0; > > >> } > > >>+ ipc_drop_locks(id, &shm_ids, 1); > > >> } > > >> } > > >> } > > >> > > >> > > >> > > > > > > > > > > > > |