Re: [SSI-devel] SSI-1.2.2-FC shm oops
Brought to you by:
brucewalker,
rogertsang
From: Roger T. <rog...@gm...> - 2005-04-13 20:50:25
|
kdb bt with function args... I am waiting for my serial cable, so if you need more just ask. I haven't reboot that node yet. shm_close+0xb0 (0xd3730480, 0xd3730400, 0xe000, 0x0, 0xd3730500) exit_mmap+0x166 (0xd376c180, 0xd376c180, 0xd376c180) mmput+0x4c (0xd376c180, 0xd376f400, 0x7, 0x7, 0xd373e000) .... On 4/13/05, Roger Tsang <rog...@gm...> wrote: > I'm still getting the shm oops with the patch. >=20 > kernel oops invalid operand > process httpd > shm_close 0xb0 [kernel] > exit_mmap 0x166 > mmput 0x4c > do_exit 0xe9 > do_group_exit 0x32 > set_signal_to_deliver 0x2b4 > restore_i387_fxsave 0xaf > do_signal 0x4f > restore_sigcontext 0x458 > sys_sigreturn 0x106 > signal_return 0x14 >=20 >=20 > On 4/12/05, Laura Ramirez <lau...@hp...> wrote: > > Hi Roger, > > > > I'll check in the fix then. This fix only deals with the shm structure= s > > so it shouldnt be related to the unixnm.c panic at all. > > > > laura > > > > Roger Tsang wrote: > > > Laura, > > > > > > Thanks. I've incorporated your patch into my kernel recompile last > > > night and the cluster seems to be running fine so far after two > > > failovers, once on each initnode. I hope this is not related to the > > > unixnm.c oops because I assumed that was due to kernel networking > > > options packet socket. > > > > > > -Roger > > > > > > On Apr 11, 2005 8:54 PM, Laura Ramirez <lau...@hp...> wrote: > > > > > >>Hi Roger, > > >> > > >>Looking at the shm nodedown code, i saw some locking that didnt look > > >>right. I have attached a patch file with a fix. I dont know if this > > >>will fix your shm panic, but if you want to give it a try, > > >>please let me know how it goes. (use -p0 to apply patch) > > >> > > >>Also, is it possible to get a netdump image, if it does panic again, > > >>or if you get a kdb prompt dump the following: > > >>kdb> bt > > >> > > >>kdb> md shm_ids > > >> > > >>kdb> md cfs_shm_node_mnts > > >> > > >>thanks > > >> > > >>laura > > >> > > >>Roger Tsang wrote: > > >> > > >>>Hi I'm using SSI-1.2.2-FC2 9smp with Lustre-1.2.4 patch. node2 is a > > >>>failover node, so I failed over to this node once and after a few > > >>>hours running as the only node in the cluster I got the following. > > >>> > > >>>-Roger > > >>> > > >>>Apr 11 19:25:42 node2 kernel: ------------[ cut here ]------------ > > >>>Apr 11 19:25:42 node2 kernel: kernel BUG at shm.c:232! > > >>>Apr 11 19:25:42 node2 kernel: invalid operand: 0000 > > >>>Apr 11 19:25:42 node2 kernel: ipt_REJECT ipt_multiport ipt_state > > >>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 > > >>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy > > >>>Apr 11 19:25:42 node2 kernel: CPU: 0 > > >>>Apr 11 19:25:42 node2 kernel: EIP: 0060:[<c01c9a90>] Not taint= ed > > >>>Apr 11 19:25:42 node2 kernel: EFLAGS: 00010246 > > >>>Apr 11 19:25:42 node2 kernel: > > >>>Apr 11 19:25:42 node2 kernel: EIP is at shm_close [kernel] 0xb0 > > >>>(2.4.22-1.2199.nptl_ssi_9smp) > > >>>Apr 11 19:25:42 node2 kernel: eax: d8ad9000 ebx: c05f0fb0 ecx: > > >>>c05f0fb0 edx: 00000000 > > >>>Apr 11 19:25:42 node2 kernel: esi: 02000000 edi: bd311000 ebp: > > >>>d3fcde64 esp: d3fcde5c > > >>>Apr 11 19:25:42 node2 kernel: ds: 0068 es: 0068 ss: 0068 > > >>>Apr 11 19:25:42 node2 kernel: Process httpd (pid: 132848, stackpage= =3Dd3fcd000) > > >>>Apr 11 19:25:42 node2 kernel: Call Trace: > > >>>Apr 11 19:25:42 node2 kernel: [<c0139c86>] exit_mmap [kernel] 0x166 = (0xd3fcde68) > > >>>Apr 11 19:25:42 node2 kernel: [<c011f3cc>] mmput [kernel] 0x4c (0xd3= fcde90) > > >>>Apr 11 19:25:42 node2 kernel: [<c0125319>] do_exit [kernel] 0xe9 (0x= d3fcdea4) > > >>>Apr 11 19:25:42 node2 kernel: [<c01256e2>] do_group_exit [kernel] 0x= 32 > > >>>(0xd3fcdec4) > > >>>Apr 11 19:25:42 node2 kernel: [<c012ecf4>] get_signal_to_deliver > > >>>[kernel] 0x2b4 (0xd3fcded8) > > >>>Apr 11 19:25:42 node2 kernel: [<c01142ff>] restore_i387_fxsave > > >>>[kernel] 0xaf (0xd3fcdee8) > > >>>Apr 11 19:25:42 node2 kernel: [<c010b91f>] do_signal [kernel] 0x4f (= 0xd3fcdf1c) > > >>>Apr 11 19:25:42 node2 kernel: [<c0109a68>] restore_sigcontext [kerne= l] > > >>>0x458 (0xd3fcdf28) > > >>>Apr 11 19:25:42 node2 kernel: [<c0109ed6>] sys_sigreturn [kernel] > > >>>0x106 (0xd3fcdf90) > > >>>Apr 11 19:25:42 node2 kernel: [<c010bb10>] signal_return [kernel] 0x= 14 > > >>>(0xd3fcdfc0) > > >>>Apr 11 19:25:42 node2 kernel: > > >>>Apr 11 19:25:42 node2 kernel: Code: 0f 0b e8 00 ea db 38 c0 eb a5 8d > > >>>b6 00 00 00 00 a1 c4 0f 5f > > >>>Apr 11 19:27:17 node2 kernel: ------------[ cut here ]------------ > > >>>Apr 11 19:27:17 node2 kernel: kernel BUG at shm.c:169! > > >>>Apr 11 19:27:17 node2 kernel: invalid operand: 0000 > > >>>Apr 11 19:27:17 node2 kernel: ipt_REJECT ipt_multiport ipt_state > > >>>ip_conntrack ipt_TCPMSS iptable_filter ip_tables loop nfsd cls_u32 > > >>>sch_sfq sch_htb tun microcode ide-cd sr_mod cdrom floppy > > >>>Apr 11 19:27:17 node2 kernel: CPU: 0 > > >>>Apr 11 19:27:17 node2 kernel: EIP: 0060:[<c01c9930>] Not taint= ed > > >>>Apr 11 19:27:17 node2 kernel: EFLAGS: 00010246 > > >>>Apr 11 19:27:17 node2 kernel: > > >>>Apr 11 19:27:17 node2 kernel: EIP is at shm_open [kernel] 0x60 > > >>>(2.4.22-1.2199.nptl_ssi_9smp) > > >>>Apr 11 19:27:17 node2 kernel: eax: d8ad9000 ebx: d40e3c80 ecx: > > >>>bf000000 edx: 00000000 > > >>>Apr 11 19:27:17 node2 kernel: esi: d5d40600 edi: 00000000 ebp: > > >>>d4955ec4 esp: d4955ec4 > > >>>Apr 11 19:27:17 node2 kernel: ds: 0068 es: 0068 ss: 0068 > > >>>Apr 11 19:27:17 node2 kernel: Process httpd (pid: 132813, stackpage= =3Dd4955000) > > >>>Apr 11 19:27:17 node2 kernel: Call Trace: > > >>>Apr 11 19:27:17 node2 kernel: [<c011f8a9>] copy_mm [kernel] 0x389 (0= xd4955ec8) > > >>>Apr 11 19:27:17 node2 kernel: [<c01201f9>] __copy_process [kernel] > > >>>0x399 (0xd4955f04) > > >>>Apr 11 19:27:17 node2 kernel: [<c0120a22>] __do_fork [kernel] 0x52 (= 0xd4955f4c) > > >>>Apr 11 19:27:17 node2 kernel: [<c0107e65>] sys_clone [kernel] 0x45 (= 0xd4955f9c) > > >>>Apr 11 19:27:17 node2 kernel: [<c010bad7>] system_call [kernel] 0x33 > > >>>(0xd4955fc0) > > >>>Apr 11 19:27:17 node2 kernel: > > >>>Apr 11 19:27:17 node2 kernel: Code: 0f 0b a9 00 ea db 38 c0 eb d2 8d > > >>>b6 00 00 00 00 a1 c4 0f 5f > > >>> > > >>> > > >>>------------------------------------------------------- > > >>>SF email is sponsored by - The IT Product Guide > > >>>Read honest & candid reviews on hundreds of IT Products from real us= ers. > > >>>Discover which products truly live up to the hype. Start reading now= . > > >>>http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick > > >>>_______________________________________________ > > >>>ssic-linux-devel mailing list > > >>>ssi...@li... > > >>>https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > >>> > > >>> > > >> > > >> > > >>Index: ipc/shm.c > > >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > >>RCS file: /cvsroot/ssic-linux/openssi/kernel/ipc/shm.c,v > > >>retrieving revision 1.2.2.25 > > >>diff -u -p -r1.2.2.25 shm.c > > >>--- ipc/shm.c 17 Dec 2004 22:21:13 -0000 1.2.2.25 > > >>+++ ipc/shm.c 12 Apr 2005 00:28:05 -0000 > > >>@@ -1235,12 +1235,14 @@ ipc_shm_nodedown(clusternode_t node) > > >> } > > >> } > > >> else { > > >>+ int id =3D shp->id; > > >>+ ipc_get_locks(id, &shm_ids, 1); > > >> shp->shm_flags |=3D SHM_DEST; > > >> if (shp->shm_nattch =3D=3D 0) { > > >>- ipc_get_locks(shp->id, &shm_i= ds, 1); > > >> ssi_local_destroy(shp); > > >>- up(&shm_ids.sem); > > >>+ id =3D 0; > > >> } > > >>+ ipc_drop_locks(id, &shm_ids, 1); > > >> } > > >> } > > >> } > > >> > > >> > > >> > > > > > > > > > > > > |