Re: [SSI] Kernel oops when 2nd node boots
Brought to you by:
brucewalker,
rogertsang
From: Sven-Olof K. <sve...@hp...> - 2002-08-25 12:14:14
|
Hi all, I have tried few things, but I still have the problem that node-2 and node-3 gets a kernel oops during boot if an additional filesystem is mounted. Here is what I have tried so far. I used the config.la configration file. I only had to add two drivers for ethernet cards I needed (CONFIG_NE2K_PCI=m and CONFIG_SIS900=m). This is the only diffrence. I'm no longer using devpts I'm no longer using "mount --bind" I built a kernel without the patch (bind.patch) David Zafman sent just in case if there could be a problem with the patch. I have upgraded to the latest RPM-packages for RedHat 7.2 (amongst the packages was gcc) and recompiled the kernel. I have mounted the additional filesystem with cfs_mount instead of mount. As Aneesh pointed out I'm using ext2 filesystems so this should not matter, but I tried it anyway. This has made any diffrence, I still see the problem. Hare are some details about the oops. If I boot the master node, then node-2 and node-3 with only the root filesystem mounted I don't see the problem. With all nodes up I can mount (mount /dev/hda3 /mnt) the filesystem on the master node and it's visible and accessable on all nodes. The kernel oops only happens when node-2 or node-3 boots (I have a 3-node cluster) and /mnt had previously been mounted on the master node. The kernel oops during the second time /usr/sbin/cmount runs in /etc/rc.d/rc.sysinit.nodeup. cmount is called three times in this script. The first time cmount runs there is no oops. The oops allways occur during the second cmount. (It did oops on the first run of cmount when I used "mount --bind" and I did not have David Zafmans patch. I don't use "mount --bind" anylonger). The oops allways happens at cfs_get_uniqueid+0x2a and the traceback looks the same. This is how it looks like on the console and in kdb (I have added a few echo statements to /etc/rc.d/rc.sysinit.nodeup). ... /etc/rc.d/nodeup 3 running DEBUG: now running /etc/rc.d/rc.sysinit.nodeup Welcome to Red Hat Linux DEBUG: about to run /usr/sbin/cmount 1st time Unmounting initrd: [ OK ] Configuring kernel parameters: [ OK ] Setting clock (utc): Sun Aug 25 00:55:04 CEST 2002 [ OK ] Activating swap partitions: [ OK ] Setting hostname host12.net1.home: [ OK ] DEBUG: about to run /usr/sbin/cmount 2nd time Unable to handle kernel paging request at virtual address cf150454 *pde = 00000000 Entering kdb (current=0xc4784000, pid 196726) on processor 0 Oops: Oops due to oops @ 0xc01d7a4a eax = 0xcf1503cc ebx = 0xc114ede0 ecx = 0xc114ede0 edx = 0x00000003 esi = 0xc03bc71c edi = 0xc475a000 esp = 0xc4785e3c eip = 0xc01d7a4a ebp = 0xc4785e3c xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010246 xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc4785e08 [0]kdb> id 0xc01d7a4a 0xc01d7a4a cfs_get_uniqueid+0x2a: mov 0x88(%eax),%eax 0xc01d7a50 cfs_get_uniqueid+0x30: pop %ebp 0xc01d7a51 cfs_get_uniqueid+0x31: ret 0xc01d7a52 cfs_get_uniqueid+0x32: lea 0x0(%esi,1),%esi 0xc01d7a59 cfs_get_uniqueid+0x39: lea 0x0(%edi,1),%edi .... [0]kdb> bt EBP EIP Function(args) 0xc4785e3c 0xc01d7a4a cfs_get_uniqueid+0x2a (0xc114ede0, 0xc3945000, 0xc4785eb4) kernel .text 0xc0100000 0xc01d7a20 0xc01d7a60 0xc4785e58 0xc0144887 do_kern_mount+0x167 (0xc3d60000, 0x40000000, 0xc3cf4000, 0xc3945000, 0xc3cf4000) kernel .text 0xc0100000 0xc0144720 0xc01448c0 0xc4785e8c 0xc0156b78 do_add_mount+0x48 (0xc4785eb4, 0xc3d60000, 0x40000000, 0x0, 0xc3cf4000) kernel .text 0xc0100000 0xc0156b30 0xc0156c40 0xc4785ee0 0xc0156e38 do_mount+0x138 kernel .text 0xc0100000 0xc0156d00 0xc0156e60 0xc4785f5c 0xc020fdc7 ssisys_discover_mounts+0x1c7 kernel .text 0xc0100000 0xc020fc00 0xc020fe40 [0]kdb> ps Task Addr Pid Parent [*] cpu State Thread Command ... 0xc3b78000 00066367 00000001 0 000 stop 0xc3b78370 rc.nodeup 0xc3ae6000 00196675 00066367 0 000 stop 0xc3ae6370 rc.sysinit.node 0xc3986000 00196695 00000001 0 000 stop 0xc3986370 minilogd 0xc4784000 00196726 00196675 1 000 run 0xc4784370*cmount [0]kdb> Any ideas on how I can proceed to find the cause? Aneesh suggested to try the with the latest code from CVS. I have not done that yet, but I think I will give it a try. Thanks, Sven-Olof |