Thread: [SSI] Kernel oops when 2nd node boots
Brought to you by:
brucewalker,
rogertsang
From: Sven-Olof K. <sve...@hp...> - 2002-08-21 18:01:58
|
Hi, I have installed SSI 0.7.0 on a PC and created a 3-node cluster. I'm using CFS and the root filesystem is on a IDE disk. The master node boots without problems, but the second and third node get a oops during the boot when cmount runs. Both node-2 and node-3 panics the same way. The oops happens at cfs_get_uniqueid+0x2a I found that if I do "umount /etc/sysconfig/network-scripts-1" on the master node before attempt to boot the other nodes, I don't get the oops when node-2 and node-3 boot. Instead they boot and joins the cluster. Here are some details about the oops. Console messages during boot and kdb output: .... RAMDISK: Compressed image found at block 0 Freeing initrd memory: 1552k freed VFS: Mounted root (ext2 filesystem). Freeing unused kernel memory: 208k freed Note: unable to open serial console. Mounting /proc Loading 8390 module Loading ne2k-pci module ne2k-pci.c:v1.02 10/19/2000 D. Becker/P. Gortmaker http://www.scyld.com/network/ne2k-pci.html eth0: RealTek RTL-8029 found at 0xfce0, IRQ 9, 00:00:E8:88:52:E5. Gathering cluster info Configuring cluster Running pre-root cluster initialization RTNL: assertion failed at devinet.c(805):inetdev_event RTNL: assertion failed at devinet.c(805):inetdev_event spawn_daemon_thread: Truncated daemon name ics_llunack_daemon spawn_daemon_thread: Truncated daemon name ics_probe_clms_daemon Searching for an existing root node... Found node 1 as the root node. spawn_daemon_proc: Truncated daemon name nm cli nd daemon spawn_daemon_proc: Truncated daemon name nm cli send daemon ipcnameserver ready completed This is a NonStop Clusters kernel. This Cluster Node: 3 Cluster Master Node(s): 1:192.168.1.2 Name server registered with clms ipcname_read completed spawn_daemon_proc: Truncated daemon name CFS delrel daemon Mounting root in linuxrc Unmounting /proc Attempting pivot_root Running post-root cluster initialization Starting init /etc/rc.d/nodeup 3 running Welcome to Red Hat Linux Unable to handle kernel paging request at virtual address ce17ea54 *pde = 00000000 Entering kdb (current=0xc394a000, pid 196687) Oops: Oops due to oops @ 0xc01abd5a eax = 0xce17e9cc ebx = 0xc467a480 ecx = 0xc467a480 edx = 0x00000003 esi = 0xc033349c edi = 0xc11d8000 esp = 0xc394be3c eip = 0xc01abd5a ebp = 0xc394be3c xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010246 xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc394be08 kdb> bt EBP EIP Function(args) 0xc394be3c 0xc01abd5a cfs_get_uniqueid+0x2a (0xc467a480, 0xc3906000, 0xc394beb4) kernel .text 0xc0100000 0xc01abd30 0xc01abd70 0xc394be58 0xc0138f96 do_kern_mount+0x166 (0xc3905000, 0x40000000, 0xc3904000, 0xc3906000, 0xc3904000) kernel .text 0xc0100000 0xc0138e30 0xc0138fd0 0xc394be8c 0xc01499e8 do_add_mount+0x48 (0xc394beb4, 0xc3905000, 0x40000000, 0x0, 0xc3904000) kernel .text 0xc0100000 0xc01499a0 0xc0149ab0 0xc394bee0 0xc0149ca8 do_mount+0x138 kernel .text 0xc0100000 0xc0149b70 0xc0149cd0 0xc394bf5c 0xc01dcd67 ssisys_discover_mounts+0x1c7 kernel .text 0xc0100000 0xc01dcba0 0xc01dcde0 kdb> ps Task Addr Pid Parent [*] cpu State Thread Command .... 0xc3ac4000 00066323 00000001 1 000 stop 0xc3ac4280 rc.nodeup 0xc3a58000 00196674 00066323 1 000 stop 0xc3a58280 rc.sysinit.node 0xc394a000 00196687 00196674 1 000 run 0xc394a280*cmount kdb> id 0xc01abd5a 0xc01abd5a cfs_get_uniqueid+0x2a: mov 0x88(%eax),%eax 0xc01abd60 cfs_get_uniqueid+0x30: pop %ebp 0xc01abd61 cfs_get_uniqueid+0x31: ret 0xc01abd62 cfs_get_uniqueid+0x32: lea 0x0(%esi,1),%esi 0xc01abd69 cfs_get_uniqueid+0x39: lea 0x0(%edi,1),%edi ... Extract from System.map c01abd30 T cfs_get_uniqueid c01abd70 t cfs_statfs Output from "objdump --source -d cluster/ssi/cfs/inode.o" (this kernel was compiled with "-g") ... 00000500 <cfs_get_uniqueid>: unsigned long cfs_get_uniqueid(struct vfsmount *mnt, void *raw_data) { 500: 55 push %ebp 501: 89 e5 mov %esp,%ebp 503: 8b 45 0c mov 0xc(%ebp),%eax 506: 8b 4d 08 mov 0x8(%ebp),%ecx struct cfs_mount_data *data = (struct cfs_mount_data *)raw_data; struct cfsmountargs *argp; if (data == NULL) 509: 85 c0 test %eax,%eax 50b: 74 15 je 522 <cfs_get_uniqueid+0x22> return mnt->mnt_uniqueid; if (data->magic != CFS_MOUNT_MAGIC) 50d: 81 38 cf cf cf cf cmpl $0xcfcfcfcf,(%eax) 513: 75 0d jne 522 <cfs_get_uniqueid+0x22> return mnt->mnt_uniqueid; if (data->mode != CFS_NOTIFY && data->mode != CFS_DISCOVER) 515: 8b 50 04 mov 0x4(%eax),%edx 518: 83 fa 01 cmp $0x1,%edx 51b: 74 0a je 527 <cfs_get_uniqueid+0x27> 51d: 83 fa 03 cmp $0x3,%edx 520: 74 05 je 527 <cfs_get_uniqueid+0x27> return mnt->mnt_uniqueid; 522: 8b 41 3c mov 0x3c(%ecx),%eax 525: eb 09 jmp 530 <cfs_get_uniqueid+0x30> argp = (struct cfsmountargs *) data->payload; 527: 8b 40 08 mov 0x8(%eax),%eax return argp->uniqueid; 52a: 8b 80 88 00 00 00 mov 0x88(%eax),%eax } 530: 5d pop %ebp 531: c3 ret 532: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi 539: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi ... Any idea what's going wrong here? Thanks, Sven-Olof |
From: Bruce W. <br...@ka...> - 2002-08-21 21:57:13
|
Dave Zafman added mount discovery (so all processes on all nodes see the same mount tree without each node having to do the same mount) in this latest release and apparently didn't handle "bind" mounts correctly. He is sick today but will response soon with a trivial fix (for now). In the short term, I'd like to suggest we don't SSI the bind mounts (so they can act sort of like CDSLs). Eventually we will do either CDSLs or context dependent "bind" mounts so we can have node-specific files. At that point, regular "bind" mounts will be automatically clusterwide visible. bruce > Hi, > > I have installed SSI 0.7.0 on a PC and created a 3-node cluster. I'm using CFS > and the root filesystem is on a IDE disk. The master node boots without > problems, but the second and third node get a oops during the boot when > cmount runs. Both node-2 and node-3 panics the same way. The oops happens > at cfs_get_uniqueid+0x2a > > I found that if I do "umount /etc/sysconfig/network-scripts-1" on the master > node before attempt to boot the other nodes, I don't get the oops when node-2 > and node-3 boot. Instead they boot and joins the cluster. > > Here are some details about the oops. > > Console messages during boot and kdb output: > > .... > RAMDISK: Compressed image found at block 0 > Freeing initrd memory: 1552k freed > VFS: Mounted root (ext2 filesystem). > Freeing unused kernel memory: 208k freed > Note: unable to open serial console. > Mounting /proc > Loading 8390 module > Loading ne2k-pci module > ne2k-pci.c:v1.02 10/19/2000 D. Becker/P. Gortmaker > http://www.scyld.com/network/ne2k-pci.html > eth0: RealTek RTL-8029 found at 0xfce0, IRQ 9, 00:00:E8:88:52:E5. > Gathering cluster info > Configuring cluster > Running pre-root cluster initialization > RTNL: assertion failed at devinet.c(805):inetdev_event > RTNL: assertion failed at devinet.c(805):inetdev_event > spawn_daemon_thread: Truncated daemon name ics_llunack_daemon > spawn_daemon_thread: Truncated daemon name ics_probe_clms_daemon > Searching for an existing root node... > Found node 1 as the root node. > spawn_daemon_proc: Truncated daemon name nm cli nd daemon > spawn_daemon_proc: Truncated daemon name nm cli send daemon > ipcnameserver ready completed > > This is a NonStop Clusters kernel. > This Cluster Node: 3 > Cluster Master Node(s): 1:192.168.1.2 > > Name server registered with clms > ipcname_read completed > spawn_daemon_proc: Truncated daemon name CFS delrel daemon > Mounting root in linuxrc > Unmounting /proc > Attempting pivot_root > Running post-root cluster initialization > Starting init > /etc/rc.d/nodeup 3 running > Welcome to Red Hat Linux > Unable to handle kernel paging request at virtual address ce17ea54 > *pde = 00000000 > > Entering kdb (current=0xc394a000, pid 196687) Oops: Oops > due to oops @ 0xc01abd5a > eax = 0xce17e9cc ebx = 0xc467a480 ecx = 0xc467a480 edx = 0x00000003 > esi = 0xc033349c edi = 0xc11d8000 esp = 0xc394be3c eip = 0xc01abd5a > ebp = 0xc394be3c xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010246 > xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc394be08 > kdb> bt > EBP EIP Function(args) > 0xc394be3c 0xc01abd5a cfs_get_uniqueid+0x2a (0xc467a480, 0xc3906000, 0xc394beb4) > kernel .text 0xc0100000 0xc01abd30 0xc01abd70 > 0xc394be58 0xc0138f96 do_kern_mount+0x166 (0xc3905000, 0x40000000, 0xc3904000, 0xc3906000, 0xc3904000) > kernel .text 0xc0100000 0xc0138e30 0xc0138fd0 > 0xc394be8c 0xc01499e8 do_add_mount+0x48 (0xc394beb4, 0xc3905000, 0x40000000, 0x0, 0xc3904000) > kernel .text 0xc0100000 0xc01499a0 0xc0149ab0 > 0xc394bee0 0xc0149ca8 do_mount+0x138 > kernel .text 0xc0100000 0xc0149b70 0xc0149cd0 > 0xc394bf5c 0xc01dcd67 ssisys_discover_mounts+0x1c7 > kernel .text 0xc0100000 0xc01dcba0 0xc01dcde0 > kdb> ps > Task Addr Pid Parent [*] cpu State Thread Command > .... > 0xc3ac4000 00066323 00000001 1 000 stop 0xc3ac4280 rc.nodeup > 0xc3a58000 00196674 00066323 1 000 stop 0xc3a58280 rc.sysinit.node > 0xc394a000 00196687 00196674 1 000 run 0xc394a280*cmount > kdb> id 0xc01abd5a > 0xc01abd5a cfs_get_uniqueid+0x2a: mov 0x88(%eax),%eax > 0xc01abd60 cfs_get_uniqueid+0x30: pop %ebp > 0xc01abd61 cfs_get_uniqueid+0x31: ret > 0xc01abd62 cfs_get_uniqueid+0x32: lea 0x0(%esi,1),%esi > 0xc01abd69 cfs_get_uniqueid+0x39: lea 0x0(%edi,1),%edi > ... > > Extract from System.map > > c01abd30 T cfs_get_uniqueid > c01abd70 t cfs_statfs > > Output from "objdump --source -d cluster/ssi/cfs/inode.o" > (this kernel was compiled with "-g") > > ... > 00000500 <cfs_get_uniqueid>: > > unsigned long > cfs_get_uniqueid(struct vfsmount *mnt, void *raw_data) > { > 500: 55 push %ebp > 501: 89 e5 mov %esp,%ebp > 503: 8b 45 0c mov 0xc(%ebp),%eax > 506: 8b 4d 08 mov 0x8(%ebp),%ecx > struct cfs_mount_data *data = (struct cfs_mount_data *)raw_data; > struct cfsmountargs *argp; > > if (data == NULL) > 509: 85 c0 test %eax,%eax > 50b: 74 15 je 522 <cfs_get_uniqueid+0x22> > return mnt->mnt_uniqueid; > > if (data->magic != CFS_MOUNT_MAGIC) > 50d: 81 38 cf cf cf cf cmpl $0xcfcfcfcf,(%eax) > 513: 75 0d jne 522 <cfs_get_uniqueid+0x22> > return mnt->mnt_uniqueid; > > if (data->mode != CFS_NOTIFY && data->mode != CFS_DISCOVER) > 515: 8b 50 04 mov 0x4(%eax),%edx > 518: 83 fa 01 cmp $0x1,%edx > 51b: 74 0a je 527 <cfs_get_uniqueid+0x27> > 51d: 83 fa 03 cmp $0x3,%edx > 520: 74 05 je 527 <cfs_get_uniqueid+0x27> > return mnt->mnt_uniqueid; > 522: 8b 41 3c mov 0x3c(%ecx),%eax > 525: eb 09 jmp 530 <cfs_get_uniqueid+0x30> > > argp = (struct cfsmountargs *) data->payload; > 527: 8b 40 08 mov 0x8(%eax),%eax > return argp->uniqueid; > 52a: 8b 80 88 00 00 00 mov 0x88(%eax),%eax > } > 530: 5d pop %ebp > 531: c3 ret > 532: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi > 539: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi > ... > > > Any idea what's going wrong here? > > Thanks, > Sven-Olof > > > > ------------------------------------------------------- > This sf.net email is sponsored by: OSDN - Tired of that same old > cell phone? Get a new here for FREE! > https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel |
From: David B. Z. <dav...@hp...> - 2002-08-22 02:07:12
Attachments:
bind.patch
|
Sven, For some reason I couldn't reproduce your exact problem, but found other related issues. I believe that if you apply the attached patch to your kernel, your problem will go away too. Can you try this and report back? Also, I recommend add adding -n to the --bind mounts of /etc/sysconfig/network-scripts, so that /etc/mtab doesn't get cluttered. This is a minor issue. Bruce Walker wrote: > Dave Zafman added mount discovery (so all processes on all nodes see > the same mount tree without each node having to do the same mount) in > this latest release and apparently didn't handle "bind" mounts > correctly. He is sick today but will response soon with a trivial > fix (for now). In the short term, I'd like to suggest we don't > SSI the bind mounts (so they can act sort of like CDSLs). Eventually > we will do either CDSLs or context dependent "bind" mounts so > we can have node-specific files. At that point, regular "bind" mounts > will be automatically clusterwide visible. > > bruce > > > Hi, > > > > I have installed SSI 0.7.0 on a PC and created a 3-node cluster. I'm using CFS > > and the root filesystem is on a IDE disk. The master node boots without > > problems, but the second and third node get a oops during the boot when > > cmount runs. Both node-2 and node-3 panics the same way. The oops happens > > at cfs_get_uniqueid+0x2a > > > > I found that if I do "umount /etc/sysconfig/network-scripts-1" on the master > > node before attempt to boot the other nodes, I don't get the oops when node-2 > > and node-3 boot. Instead they boot and joins the cluster. > > > > Here are some details about the oops. > > > > Console messages during boot and kdb output: > > > > .... > > RAMDISK: Compressed image found at block 0 > > Freeing initrd memory: 1552k freed > > VFS: Mounted root (ext2 filesystem). > > Freeing unused kernel memory: 208k freed > > Note: unable to open serial console. > > Mounting /proc > > Loading 8390 module > > Loading ne2k-pci module > > ne2k-pci.c:v1.02 10/19/2000 D. Becker/P. Gortmaker > > http://www.scyld.com/network/ne2k-pci.html > > eth0: RealTek RTL-8029 found at 0xfce0, IRQ 9, 00:00:E8:88:52:E5. > > Gathering cluster info > > Configuring cluster > > Running pre-root cluster initialization > > RTNL: assertion failed at devinet.c(805):inetdev_event > > RTNL: assertion failed at devinet.c(805):inetdev_event > > spawn_daemon_thread: Truncated daemon name ics_llunack_daemon > > spawn_daemon_thread: Truncated daemon name ics_probe_clms_daemon > > Searching for an existing root node... > > Found node 1 as the root node. > > spawn_daemon_proc: Truncated daemon name nm cli nd daemon > > spawn_daemon_proc: Truncated daemon name nm cli send daemon > > ipcnameserver ready completed > > > > This is a NonStop Clusters kernel. > > This Cluster Node: 3 > > Cluster Master Node(s): 1:192.168.1.2 > > > > Name server registered with clms > > ipcname_read completed > > spawn_daemon_proc: Truncated daemon name CFS delrel daemon > > Mounting root in linuxrc > > Unmounting /proc > > Attempting pivot_root > > Running post-root cluster initialization > > Starting init > > /etc/rc.d/nodeup 3 running > > Welcome to Red Hat Linux > > Unable to handle kernel paging request at virtual address ce17ea54 > > *pde = 00000000 > > > > Entering kdb (current=0xc394a000, pid 196687) Oops: Oops > > due to oops @ 0xc01abd5a > > eax = 0xce17e9cc ebx = 0xc467a480 ecx = 0xc467a480 edx = 0x00000003 > > esi = 0xc033349c edi = 0xc11d8000 esp = 0xc394be3c eip = 0xc01abd5a > > ebp = 0xc394be3c xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010246 > > xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc394be08 > > kdb> bt > > EBP EIP Function(args) > > 0xc394be3c 0xc01abd5a cfs_get_uniqueid+0x2a (0xc467a480, 0xc3906000, 0xc394beb4) > > kernel .text 0xc0100000 0xc01abd30 0xc01abd70 > > 0xc394be58 0xc0138f96 do_kern_mount+0x166 (0xc3905000, 0x40000000, 0xc3904000, 0xc3906000, 0xc3904000) > > kernel .text 0xc0100000 0xc0138e30 0xc0138fd0 > > 0xc394be8c 0xc01499e8 do_add_mount+0x48 (0xc394beb4, 0xc3905000, 0x40000000, 0x0, 0xc3904000) > > kernel .text 0xc0100000 0xc01499a0 0xc0149ab0 > > 0xc394bee0 0xc0149ca8 do_mount+0x138 > > kernel .text 0xc0100000 0xc0149b70 0xc0149cd0 > > 0xc394bf5c 0xc01dcd67 ssisys_discover_mounts+0x1c7 > > kernel .text 0xc0100000 0xc01dcba0 0xc01dcde0 > > kdb> ps > > Task Addr Pid Parent [*] cpu State Thread Command > > .... > > 0xc3ac4000 00066323 00000001 1 000 stop 0xc3ac4280 rc.nodeup > > 0xc3a58000 00196674 00066323 1 000 stop 0xc3a58280 rc.sysinit.node > > 0xc394a000 00196687 00196674 1 000 run 0xc394a280*cmount > > kdb> id 0xc01abd5a > > 0xc01abd5a cfs_get_uniqueid+0x2a: mov 0x88(%eax),%eax > > 0xc01abd60 cfs_get_uniqueid+0x30: pop %ebp > > 0xc01abd61 cfs_get_uniqueid+0x31: ret > > 0xc01abd62 cfs_get_uniqueid+0x32: lea 0x0(%esi,1),%esi > > 0xc01abd69 cfs_get_uniqueid+0x39: lea 0x0(%edi,1),%edi > > ... > > > > Extract from System.map > > > > c01abd30 T cfs_get_uniqueid > > c01abd70 t cfs_statfs > > > > Output from "objdump --source -d cluster/ssi/cfs/inode.o" > > (this kernel was compiled with "-g") > > > > ... > > 00000500 <cfs_get_uniqueid>: > > > > unsigned long > > cfs_get_uniqueid(struct vfsmount *mnt, void *raw_data) > > { > > 500: 55 push %ebp > > 501: 89 e5 mov %esp,%ebp > > 503: 8b 45 0c mov 0xc(%ebp),%eax > > 506: 8b 4d 08 mov 0x8(%ebp),%ecx > > struct cfs_mount_data *data = (struct cfs_mount_data *)raw_data; > > struct cfsmountargs *argp; > > > > if (data == NULL) > > 509: 85 c0 test %eax,%eax > > 50b: 74 15 je 522 <cfs_get_uniqueid+0x22> > > return mnt->mnt_uniqueid; > > > > if (data->magic != CFS_MOUNT_MAGIC) > > 50d: 81 38 cf cf cf cf cmpl $0xcfcfcfcf,(%eax) > > 513: 75 0d jne 522 <cfs_get_uniqueid+0x22> > > return mnt->mnt_uniqueid; > > > > if (data->mode != CFS_NOTIFY && data->mode != CFS_DISCOVER) > > 515: 8b 50 04 mov 0x4(%eax),%edx > > 518: 83 fa 01 cmp $0x1,%edx > > 51b: 74 0a je 527 <cfs_get_uniqueid+0x27> > > 51d: 83 fa 03 cmp $0x3,%edx > > 520: 74 05 je 527 <cfs_get_uniqueid+0x27> > > return mnt->mnt_uniqueid; > > 522: 8b 41 3c mov 0x3c(%ecx),%eax > > 525: eb 09 jmp 530 <cfs_get_uniqueid+0x30> > > > > argp = (struct cfsmountargs *) data->payload; > > 527: 8b 40 08 mov 0x8(%eax),%eax > > return argp->uniqueid; > > 52a: 8b 80 88 00 00 00 mov 0x88(%eax),%eax > > } > > 530: 5d pop %ebp > > 531: c3 ret > > 532: 8d b4 26 00 00 00 00 lea 0x0(%esi,1),%esi > > 539: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi > > ... > > > > > > Any idea what's going wrong here? > > > > Thanks, > > Sven-Olof > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by: OSDN - Tired of that same old > > cell phone? Get a new here for FREE! > > https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 > > _______________________________________________ > > ssic-linux-devel mailing list > > ssi...@li... > > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > ------------------------------------------------------- > This sf.net email is sponsored by: OSDN - Tired of that same old > cell phone? Get a new here for FREE! > https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel -- David B. Zafman | Hewlett-Packard Company mailto:dav...@hp... | http://www.hp.com "Thus spake the master programmer: When you have learned to snatch the error code from the trap frame, it will be time for you to leave." |
From: Sven-Olof K. <sve...@hp...> - 2002-08-22 18:26:41
|
David, David B. Zafman wrote: > For some reason I couldn't reproduce your exact problem, but found other related issues. I believe that if > you apply the attached patch to your kernel, your problem will go away too. Can you try this and report > back? I have applied the patch and it has improved things. I can now perform these steps: - Boot node-1 /etc/sysconfig/network-scripts-1 is mounted on /etc/sysconfig/network-scripts - Boot node-2 - Boot node-3 This works fine. I don't see any oops. Without the patch both node-2 and node-3 would had a oops. But I get a oops if I do this: - Boot node-1 - Boot node-2 - On node-1 I mount another filesystem (mount /dev/hda3 /mnt) No problems so far. I can create files under /mnt on both node-1 and node-2 - Boot node-3 Node-3 get the same oops, but a little bit later during the boot. - On node-1 umount /mnt - Boot node-3 again Now node-3 boots without problems. Here is the console output from node-3 when the oops occur ... This is a NonStop Clusters kernel. This Cluster Node: 3 Cluster Master Node(s): 1:192.168.1.2 Name server registered with clms ipcname_read completed spawn_daemon_proc: Truncated daemon name CFS delrel daemon Mounting root in linuxrc Unmounting /proc Attempting pivot_root Running post-root cluster initialization Starting init spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection spawn_daemon_thread: Truncated daemon name ics_accept_connection /etc/rc.d/nodeup 3 running Welcome to Red Hat Linux Unmounting initrd: [ OK ] Configuring kernel parameters: [ OK ] Setting clock (utc): Thu Aug 22 20:46:54 CEST 2002 [ OK ] Activating swap partitions: [ OK ] Setting hostname host12.net1.home: [ OK ] Unable to handle kernel paging request at virtual address cda60ef4 *pde = 00000000 Entering kdb (current=0xc3950000, pid 196737) Oops: Oops due to oops @ 0xc01abdba eax = 0xcda60e6c ebx = 0xc467a420 ecx = 0xc467a420 edx = 0x00000003 esi = 0xc033355c edi = 0xc3924000 esp = 0xc3951e3c eip = 0xc01abdba ebp = 0xc3951e3c xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010246 xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc3951e08 kdb> bt EBP EIP Function(args) 0xc3951e3c 0xc01abdba cfs_get_uniqueid+0x2a (0xc467a420, 0xc3c31000, 0xc3951eb4) kernel .text 0xc0100000 0xc01abd90 0xc01abdd0 0xc3951e58 0xc0138f96 do_kern_mount+0x166 (0xc3c75000, 0x40000000, 0xc398d000, ) kernel .text 0xc0100000 0xc0138e30 0xc0138fd0 0xc3951e8c 0xc01499f8 do_add_mount+0x48 (0xc3951eb4, 0xc3c75000, 0x40000000, 0x) kernel .text 0xc0100000 0xc01499b0 0xc0149ac0 0xc3951ee0 0xc0149cb8 do_mount+0x138 kernel .text 0xc0100000 0xc0149b80 0xc0149ce0 0xc3951f5c 0xc01dce07 ssisys_discover_mounts+0x1c7 kernel .text 0xc0100000 0xc01dcc40 0xc01dce80 kdb> id 0xc01abdba 0xc01abdba cfs_get_uniqueid+0x2a: mov 0x88(%eax),%eax 0xc01abdc0 cfs_get_uniqueid+0x30: pop %ebp 0xc01abdc1 cfs_get_uniqueid+0x31: ret 0xc01abdc2 cfs_get_uniqueid+0x32: lea 0x0(%esi,1),%esi 0xc01abdc9 cfs_get_uniqueid+0x39: lea 0x0(%edi,1),%edi kdb> ps Task Addr Pid Parent [*] cpu State Thread Command ... 0xc3b2c000 00066377 00000001 1 000 stop 0xc3b2c280 rc.nodeup 0xc3a96000 00196686 00066377 1 000 stop 0xc3a96280 rc.sysinit.node 0xc38de000 00196706 00000001 1 000 stop 0xc38de280 minilogd 0xc3950000 00196737 00196686 1 000 run 0xc3950280*cmount Thanks, Sven-Olof |
From: Sven-Olof K. <sve...@hp...> - 2002-08-22 19:18:38
|
David, In addition to my previous post I performed one more test using the kernel with your patch applied. Bruce Walker wrote: > Dave Zafman added mount discovery (so all processes on all nodes see > the same mount tree without each node having to do the same mount) in > this latest release and apparently didn't handle "bind" mounts > correctly. He is sick today but will response soon with a trivial > fix (for now). In the short term, I'd like to suggest we don't > SSI the bind mounts (so they can act sort of like CDSLs). Eventually > we will do either CDSLs or context dependent "bind" mounts so > we can have node-specific files. At that point, regular "bind" mounts > will be automatically clusterwide visible. I edited /etc/rc.d/rc.sysinit and removed "mount --bind" and performed the test were I had mounted an additional filesystem on the master node before booting node-3. I still get the oops in the exact same way. Not using mount --bind does not prevent the oops. On the master node these filesystems are mounted: # mount /dev/hda2 on / type ext2 (rw) none on /proc type proc (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) none on /devfs type devfs (rw) /dev/hda3 on /mnt type ext2 (rw) BTW, it would be nice if CDSLs are implemented in SSI. IMHO CDSLs are a better alternative then bind mounts. Thanks, Sven-Olof |
From: David B. Z. <dav...@hp...> - 2002-08-22 20:42:11
|
We don't support devpts in SSI at this point. Please comment out the mount from the /etc/fstab or build a new kernel with CONFIG_DEVPTS_FS turn off in your .config file. Tell me if this helps at all. Sven-Olof Klasson wrote: <<<<CUT>>>>> > On the master node these filesystems are mounted: > # mount > /dev/hda2 on / type ext2 (rw) > none on /proc type proc (rw) > none on /dev/pts type devpts (rw,gid=5,mode=620) > none on /devfs type devfs (rw) > /dev/hda3 on /mnt type ext2 (rw) > > -- David B. Zafman | Hewlett-Packard Company Linux Kernel Developer | Open SSI Clustering Project mailto:dav...@hp... | http://www.hp.com "Thus spake the master programmer: When you have learned to snatch the error code from the trap frame, it will be time for you to leave." |
From: Sven-Olof K. <sve...@hp...> - 2002-08-23 15:39:03
|
David B. Zafman wrote: > We don't support devpts in SSI at this point. Please comment out the > mount from the /etc/fstab or build a new kernel with CONFIG_DEVPTS_FS > turn off in your .config file. > > Tell me if this helps at all. I removed devpts from /etc/fstab, but I still see the problem that node-2 and node-3 get a kernel oops if an additional filesystem is mounted (/mnt). Here are the mounted filesystems on the master node: # mount /dev/hda2 on / type ext2 (rw) none on /proc type proc (rw) none on /devfs type devfs (rw) /dev/hda3 on /mnt type ext2 (rw) Is there a sample kernel .config file were this works? I could try to configure my kernel as similar as possible (given the HW diffrences) in case the problem depends on the kernel configuration. Thanks, Sven-Olof |
From: John B. <joh...@hp...> - 2002-08-23 17:07:46
|
In the ssi CVS repository, the file ssic-linux/ssi-kernel/config.la is the configuration we use for our local testing. Sven-Olof Klasson wrote: > David B. Zafman wrote: > >>We don't support devpts in SSI at this point. Please comment out the >>mount from the /etc/fstab or build a new kernel with CONFIG_DEVPTS_FS >>turn off in your .config file. >> >>Tell me if this helps at all. > > > I removed devpts from /etc/fstab, but I still see the problem that node-2 > and node-3 get a kernel oops if an additional filesystem is mounted (/mnt). > > Here are the mounted filesystems on the master node: > # mount > /dev/hda2 on / type ext2 (rw) > none on /proc type proc (rw) > none on /devfs type devfs (rw) > /dev/hda3 on /mnt type ext2 (rw) > > Is there a sample kernel .config file were this works? I could try to > configure my kernel as similar as possible (given the HW diffrences) in > case the problem depends on the kernel configuration. > > Thanks, > Sven-Olof > > > > > ------------------------------------------------------- > This sf.net email is sponsored by: OSDN - Tired of that same old > cell phone? Get a new here for FREE! > https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > |
From: Brian J. W. <Bri...@hp...> - 2002-08-23 18:37:23
|
John Byrne wrote: > > In the ssi CVS repository, the file ssic-linux/ssi-kernel/config.la is > the configuration we use for our local testing. > It's also in the 0.7.0 release. -- Brian Watson | "Now I don't know, but I been told it's Software Developer | hard to run with the weight of gold, Open SSI Clustering Project | Other hand I heard it said, it's Hewlett-Packard Company | just as hard with the weight of lead." | -Robert Hunter, 1970 mailto:Bri...@hp... http://opensource.compaq.com/ |
From: Aneesh K. K.V <ane...@di...> - 2002-08-24 04:44:59
|
Hi, Did you mount /mnt with cfs_mount.( If it is ext2 it will already be taken care inside the kernel . If not try using cfs_mount ) I guess discovery of mount points during nodes joining the cluster is creating problems if /mnt is not mounted with MS_CFS flag.( I haven't looked into the code to confirm this.but a wild guess ) -aneesh On Fri, 2002-08-23 at 21:08, Sven-Olof Klasson wrote: > David B. Zafman wrote: > > We don't support devpts in SSI at this point. Please comment out the > > mount from the /etc/fstab or build a new kernel with CONFIG_DEVPTS_FS > > turn off in your .config file. > > > > Tell me if this helps at all. > > I removed devpts from /etc/fstab, but I still see the problem that node-2 > and node-3 get a kernel oops if an additional filesystem is mounted (/mnt). > > Here are the mounted filesystems on the master node: > # mount > /dev/hda2 on / type ext2 (rw) > none on /proc type proc (rw) > none on /devfs type devfs (rw) > /dev/hda3 on /mnt type ext2 (rw) > > Is there a sample kernel .config file were this works? I could try to > configure my kernel as similar as possible (given the HW diffrences) in > case the problem depends on the kernel configuration. > > Thanks, > Sven-Olof > > > |
From: Aneesh K. K.V <ane...@di...> - 2002-08-24 09:41:23
|
Hi, I tried some configuration . I guess if we don't specify the MS_CFS flag, that is if the partition is mounted by normal mount it doesn't get in mount discover at all when a node join. So what i said in the previous mail is not correct. It cannot be due to missing MS_CFS flag. Also in Sven's mail 'mount' command shows that it is ext2 so it would have already used the MS_CFS flag. What is the trace that kdb is giving now. ? -aneesh On Sat, 2002-08-24 at 10:15, Aneesh Kumar K.V wrote: > Hi, > > Did you mount /mnt with cfs_mount.( If it is ext2 it will already be > taken care inside the kernel . If not try using cfs_mount ) > > I guess discovery of mount points during nodes joining the cluster is > creating problems if /mnt is not mounted with MS_CFS flag.( I haven't > looked into the code to confirm this.but a wild guess ) > > -aneesh > |
From: Sven-Olof K. <sve...@hp...> - 2002-08-25 12:14:14
|
Hi all, I have tried few things, but I still have the problem that node-2 and node-3 gets a kernel oops during boot if an additional filesystem is mounted. Here is what I have tried so far. I used the config.la configration file. I only had to add two drivers for ethernet cards I needed (CONFIG_NE2K_PCI=m and CONFIG_SIS900=m). This is the only diffrence. I'm no longer using devpts I'm no longer using "mount --bind" I built a kernel without the patch (bind.patch) David Zafman sent just in case if there could be a problem with the patch. I have upgraded to the latest RPM-packages for RedHat 7.2 (amongst the packages was gcc) and recompiled the kernel. I have mounted the additional filesystem with cfs_mount instead of mount. As Aneesh pointed out I'm using ext2 filesystems so this should not matter, but I tried it anyway. This has made any diffrence, I still see the problem. Hare are some details about the oops. If I boot the master node, then node-2 and node-3 with only the root filesystem mounted I don't see the problem. With all nodes up I can mount (mount /dev/hda3 /mnt) the filesystem on the master node and it's visible and accessable on all nodes. The kernel oops only happens when node-2 or node-3 boots (I have a 3-node cluster) and /mnt had previously been mounted on the master node. The kernel oops during the second time /usr/sbin/cmount runs in /etc/rc.d/rc.sysinit.nodeup. cmount is called three times in this script. The first time cmount runs there is no oops. The oops allways occur during the second cmount. (It did oops on the first run of cmount when I used "mount --bind" and I did not have David Zafmans patch. I don't use "mount --bind" anylonger). The oops allways happens at cfs_get_uniqueid+0x2a and the traceback looks the same. This is how it looks like on the console and in kdb (I have added a few echo statements to /etc/rc.d/rc.sysinit.nodeup). ... /etc/rc.d/nodeup 3 running DEBUG: now running /etc/rc.d/rc.sysinit.nodeup Welcome to Red Hat Linux DEBUG: about to run /usr/sbin/cmount 1st time Unmounting initrd: [ OK ] Configuring kernel parameters: [ OK ] Setting clock (utc): Sun Aug 25 00:55:04 CEST 2002 [ OK ] Activating swap partitions: [ OK ] Setting hostname host12.net1.home: [ OK ] DEBUG: about to run /usr/sbin/cmount 2nd time Unable to handle kernel paging request at virtual address cf150454 *pde = 00000000 Entering kdb (current=0xc4784000, pid 196726) on processor 0 Oops: Oops due to oops @ 0xc01d7a4a eax = 0xcf1503cc ebx = 0xc114ede0 ecx = 0xc114ede0 edx = 0x00000003 esi = 0xc03bc71c edi = 0xc475a000 esp = 0xc4785e3c eip = 0xc01d7a4a ebp = 0xc4785e3c xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010246 xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc4785e08 [0]kdb> id 0xc01d7a4a 0xc01d7a4a cfs_get_uniqueid+0x2a: mov 0x88(%eax),%eax 0xc01d7a50 cfs_get_uniqueid+0x30: pop %ebp 0xc01d7a51 cfs_get_uniqueid+0x31: ret 0xc01d7a52 cfs_get_uniqueid+0x32: lea 0x0(%esi,1),%esi 0xc01d7a59 cfs_get_uniqueid+0x39: lea 0x0(%edi,1),%edi .... [0]kdb> bt EBP EIP Function(args) 0xc4785e3c 0xc01d7a4a cfs_get_uniqueid+0x2a (0xc114ede0, 0xc3945000, 0xc4785eb4) kernel .text 0xc0100000 0xc01d7a20 0xc01d7a60 0xc4785e58 0xc0144887 do_kern_mount+0x167 (0xc3d60000, 0x40000000, 0xc3cf4000, 0xc3945000, 0xc3cf4000) kernel .text 0xc0100000 0xc0144720 0xc01448c0 0xc4785e8c 0xc0156b78 do_add_mount+0x48 (0xc4785eb4, 0xc3d60000, 0x40000000, 0x0, 0xc3cf4000) kernel .text 0xc0100000 0xc0156b30 0xc0156c40 0xc4785ee0 0xc0156e38 do_mount+0x138 kernel .text 0xc0100000 0xc0156d00 0xc0156e60 0xc4785f5c 0xc020fdc7 ssisys_discover_mounts+0x1c7 kernel .text 0xc0100000 0xc020fc00 0xc020fe40 [0]kdb> ps Task Addr Pid Parent [*] cpu State Thread Command ... 0xc3b78000 00066367 00000001 0 000 stop 0xc3b78370 rc.nodeup 0xc3ae6000 00196675 00066367 0 000 stop 0xc3ae6370 rc.sysinit.node 0xc3986000 00196695 00000001 0 000 stop 0xc3986370 minilogd 0xc4784000 00196726 00196675 1 000 run 0xc4784370*cmount [0]kdb> Any ideas on how I can proceed to find the cause? Aneesh suggested to try the with the latest code from CVS. I have not done that yet, but I think I will give it a try. Thanks, Sven-Olof |
From: David B. Z. <dav...@hp...> - 2002-08-26 17:58:22
Attachments:
discover.patch
|
Thanks for very thorough testing information. It helped me to realize what the problem is. My cluster had the exact same problem and a similiar configuration but possibly because I have 4 of exactly the same machines the bogus pointer wasn't an invalid address, so I didn't notice the problem. Is your node-1 quite different hardware/RAM from node-2 and node-3? Just curious. Attached is the fix for the problem. You should be able to have /mnt mounted at all times. With my --bind patch you should be able to do the bind mounts again. -- David B. Zafman | Hewlett-Packard Company Linux Kernel Developer | Open SSI Clustering Project mailto:dav...@hp... | http://www.hp.com "Thus spake the master programmer: When you have learned to snatch the error code from the trap frame, it will be time for you to leave." |
From: Sven-Olof K. <sve...@hp...> - 2002-08-26 20:06:16
|
David, David B. Zafman wrote: > Thanks for very thorough testing information. It helped me to realize > what the problem is. My cluster had the exact same problem and a > similiar configuration but possibly because I have 4 of exactly the same > machines the bogus pointer wasn't an invalid address, so I didn't notice > the problem. Is your node-1 quite different hardware/RAM from node-2 > and node-3? Just curious. You're right. My machines are quite diffrent Node-1 (master node) Intel Celeron 366MHz 256MB RAM Node-2 Intel Celeron 566MHz 64MB RAM Node-3 Intel "classic" Pentium 75MHz 72MB RAM (a 7 year old machine) > Attached is the fix for the problem. You should be able to have /mnt > mounted at all times. With my --bind patch you should be able to do the > bind mounts again. I have applied the patch and I can confirm that it fixes the problem! Thanks !!! Sven-Olof |