Re: [SSI-users] DRBD failover issue, kernal panic no init found
Brought to you by:
brucewalker,
rogertsang
From: Roger T. <rog...@gm...> - 2006-07-12 22:33:15
|
Sounds like you still have that mount option in your initrd. Roger On 7/12/06, Brown, Larry <Lar...@ni...> wrote: > Is that it? Does anyone working on the cfs_root_failover code follow > this list? I am at the end of my abilities to troubleshoot this > problem. It is reporduceable between the two clusters we created and > the error is one that apparently no one else has seen. Should we give > up the ghost on this? Is there some other test we can use to identify > where this goes off the tracks? > > > Larry Brown > Network Engineer > > -----Original Message----- > From: ssi...@li... > [mailto:ssi...@li...] On Behalf Of > Brown, Larry > Sent: Tuesday, July 11, 2006 4:28 PM > To: Swami, Vijay; Roger Tsang > Cc: ssi...@li... > Subject: Re: [SSI-users] DRBD failover issue, kernal panic no init found > > I took this box removed errors=remount-ro from fstab and rebooted the > cluster. The cluster came back up, I verified from output of mount that > the errors option was not present and I killed the master. The slave > gave the same error message: > > EXT3-fs: the errors option needs an argument > cfs_root_failover: error -22 > cfs_root_failover error -22 > > I've checked the source code for cfs_root_failover and it is responding > to the EXT3-fs failure message. I also verified that error 22 is a bad > option code. > > Larry Brown > Network Engineer > > -----Original Message----- > From: ssi...@li... > [mailto:ssi...@li...] On Behalf Of > Vijay Swami > Sent: Sunday, July 09, 2006 4:39 PM > To: Roger Tsang > Cc: ssi...@li... > Subject: Re: [SSI-users] DRBD failover issue, kernal panic no init found > > Ok, did a re-install on another box, and seems like I'm getting closer. > Errors are attached in the JPEG. > > Seems to indicate there is no errors=remount-ro in /etc/fstab, but its > there. I don't understand why EXT3-fs is complaining there are no > options to errors, when they are clearly there, and mounted properly. > > # mount > /dev/1/drbd/0 on / type ext3 (rw,chard,errors=remount-ro) > > # grep drbd /etc/fstab > /dev/drbd/0 / ext3 errors=remount-ro,chard,defaults,node=1:2 0 1 > > Any ideas? > > Thanks. > > /vijay > > On Sat, 2006-07-08 at 11:53 -0400, Roger Tsang wrote: > > Have you checked whether you get the DRBD register message on your > > *slave* machine? > > > > Roger > > > > On 7/8/06, Vijay Swami <vij...@ni...> wrote: > > > I am running drbd-ssi. I checked and I do get the DRBD register > > > message, and the other two when the machine boots. I was kind of > > > hoping that was the problem, as it makes sense. :) > > > > > > Here is the interesting thing... when I fail the *slave*, I do see > > > the proper DRBD messages on the master, i.e. the DRBD nodedown. > > > However, that doesn't happen when I fail the *master*. > > > > > > Also, this is RHEL3, using drbd-ssi for FC2. Although one is based > > > off the other, I'm wondering if there is something weird going on > > > there. For instance, the drbd-ssi mkinitrd flat out does not work > > > for me out of the box. If I do mkinitrd --cfs --drbd the initrd it > > > generates will not boot. It doesn't even seem to be executing > > > linuxrc properly. I.E. The first line of linuxrc I put echo 'Test', > and it prints: "Kernel Panic: > > > No init found" before even printing that line. > > > > > > Very odd. > > > > > > I'm about to try Debian Sarge on the machine, and see how that goes > > > with DRBD. Or perhaps hack FC2 kernel to work with my hardware > > > (mainly SCSI driver is needed). > > > > > > I suspect while FC2, OpenSSI 1.2.2, drbd-ssi-1.2.2-20050712 are a > > > known working combo, there might be something off with RHEL3. > > > > > > /vijay > > > > > > On Sat, 2006-07-08 at 00:48 -0400, Roger Tsang wrote: > > > > Hi, > > > > > > > > I see your screenshot. I don't see any drbd0 nodedown messages > > > > there, so there are a few possibilities: > > > > > > > > 1. You are running vanilla drbd. > > > > 2. You are running drbd-ssi, but drbd-ssi never registered your > > > > device as a CLMS service. Check that drbd registered with OpenSSI > > > > > by looking for the following console messages when you first start > drbd. > > > > > > > > drbd0: drbdsetup [197114]: cstate Unconfigured --> StandAlone > > > > drbd0: drbdsetup [197117]: cstate StandAlone --> Unconnected > > > > drbd0: drbd0_receiver [197118]: cstate Unconnected --> > > > > WFConnection > > > > drbd0: Registering drbd0 with CLMS subsystem > > > > > > > > > > > > Roger > > > > > > > > > > > > On 7/7/06, Vijay Swami <vij...@ni...> wrote: > > > > > I don't think its an SSIfailover, or rc.sysrecover issue. It > > > > > seems that it tries to re-spawn INIT, without failing over DRBD > > > > > first. If I try cfs_setroot ext3 /dev/drbd/0 on a node, nothing > > > > > seems to happen. Is that correct behavior? > > > > > > > > > > I'm attaching a JPEG screen-shot of the error. Hopefully it goes > > > > > > through. > > > > > > > > > > /vijay > > > > > > > > > > On Fri, 2006-07-07 at 17:56 -0400, Roger Tsang wrote: > > > > > > I already migrated to SSI-1.9.x with the latest DRBD which is > > > > > > based on drbd-0.7.20 but haven't checked in the update yet. > > > > > > The latest one in CVS is based on drbd-0.7.19 and works fine > > > > > > unless you changed your drbd devices' al-extents parameter. > > > > > > In that case you would want drbd-ssi based on drbd-0.7.20. > > > > > > > > > > > > It seems to me, without much info, your SSIfailover and > > > > > > rc.sysrecover system scripts weren't setup properly. You have > > > > > > > to modify them for drbd for root filesystem failover. The > > > > > > drbd-ssi-1.2.2-20050712 tarball should include a sample of > these two system scripts. > > > > > > > > > > > > Roger > > > > > > > > > > > > > > > > > > On 7/7/06, Vijay Swami <vij...@ni...> wrote: > > > > > > > Roger, > > > > > > > > > > > > > > Its drbd-ssi-1.2.2-20050712. It seems like as soon as the > > > > > > > master goes down, the slave does not resort to using its own > > > > > > > > copy. I don't see any > > > > > > > EXT3 FSCK messages like I've seen by searching the list for > > > > > > > successful fail over messages. The error is most certainly > > > > > > > it can't find /sbin/init probably because once the > > > > > > > connection to the master dies, its not properly using its > 'own' copy. > > > > > > > > > > > > > > I'm wondering if this is related to the Debian /dev/drbd0 vs > > > > > > > > /dev/drbd/0 issue that I've read about. I noticed you said > > > > > > > you ran into this problem on FC2, which is what RHEL3 is > > > > > > > based on. Where you using OpenSSI 1.2.x or 1.9.x and which > version of DRBD? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > /vijay > > > > > > > > > > > > > > On Thu, 2006-07-06 at 22:50 -0400, Roger Tsang wrote: > > > > > > > > Are you using drbd-ssi (openSSI modified drbd) and have > > > > > > > > enabled SSI-failover service? > > > > > > > > > > > > > > > > If you like to try the latest drbd-ssi I'll be making a > > > > > > > > drbd-ssi tarball based on drbd-0.7.20, hopefully sometime > this week. > > > > > > > > > > > > > > > > Roger > > > > > > > > > > > > > > > > > > > > > > > > On 7/6/06, Vijay Swami <vij...@ni...> > wrote: > > > > > > > > > Pertinent details: > > > > > > > > > * 2 node cluster > > > > > > > > > * RHEL3 > > > > > > > > > * Kernel/OpenSSI ver: 2.4.21-27.0.2.EL_ssi_2 (install > > > > > > > > > from RPM) > > > > > > > > > * drbd version 0.7.11 (api:77) > > > > > > > > > * SunFire X4100 2 CPU AMD Opteron hardware > > > > > > > > > > > > > > > > > > drbd.conf snippet: > > > > > > > > > > > > > > > > > > on host1 { > > > > > > > > > device /dev/drbd/0; > > > > > > > > > disk /dev/sda2; > > > > > > > > > nodenum 1; > > > > > > > > > address 192.168.1.1:7788; > > > > > > > > > meta-disk /dev/sda3[0]; > > > > > > > > > > > > > > > > > > > > > > > > > > > on host2 { > > > > > > > > > device /dev/drbd/0; > > > > > > > > > disk /dev/sda2; > > > > > > > > > nodenum 2; > > > > > > > > > address 192.168.1.2:7788; > > > > > > > > > meta-disk /dev/sda3[0]; > > > > > > > > > > > > > > > > > > /etc/fstab: > > > > > > > > > > > > > > > > > > /dev/drbd/0 / ext3 > > > > > > > > > defaults,chard,errors=remount-ro,node=1:2 0 1 > > > > > > > > > /dev/sda1 /boot ext3 defaults,node=1:2 1 2 > > > > > > > > > none /dev/pts devpts gid=5,mode=620,node=* > 0 0 > > > > > > > > > none /proc proc defaults,node=* 0 0 > > > > > > > > > #none /dev/shm tmpfs > defaults > > > > > > > > > 0 0 > > > > > > > > > > > > > > > > > > df output: > > > > > > > > > > > > > > > > > > Filesystem 1K-blocks Used Available Use% > Mounted on > > > > > > > > > /dev/1/drbd/0 10080520 2329744 7238708 25% > / > > > > > > > > > /dev/1/sda1 248895 61612 174433 27% > /boot > > > > > > > > > > > > > > > > > > The root file system is configured with DRBD, and > > > > > > > > > working great. Each node can act as an init node when > > > > > > > > > booted up individually. The other node joins the > cluster, and things work great. > > > > > > > > > > > > > > > > > > However, I'm having trouble with the failover. > > > > > > > > > > > > > > > > > > If I 'unplug' the master node, the slave recognizes the > > > > > > > > > master has gone down, and attempts to take over. It > > > > > > > > > prints DRBD timeout messages to the console (as > expected), then: > > > > > > > > > > > > > > > > > > ipcnameserver ready > > > > > > > > > Kernel Panic: no init found. Cannot restart > > > > > > > > > > > > > > > > > > .. and then it reboots. > > > > > > > > > > > > > > > > > > I have a feeling its something fairly simple here, as > > > > > > > > > DRBD itself works fine when the nodes are booted. I'm > > > > > > > > > assuming its looking for /sbin/init to start on the > > > > > > > > > slave node but can't find it? I don't know why since > > > > > > > > > DRBD itself works on either node acting as an init node > when they are booted regardless of sequence. > > > > > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > /vijay > > > > > > > > > > > > > > > > > > Using Tomcat but need to do more? Need to support web > services, security? > > > > > > > > > Get stuff done quickly with pre-integrated technology to > > > > > > > > > > make your job easier Download IBM WebSphere Application > > > > > > > > > Server v.1.0.1 based on Apache Geronimo > > > > > > > > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=2 > > > > > > > > > 63057&dat=121642 > > > > > > > > > _______________________________________________ > > > > > > > > > Ssic-linux-users mailing list > > > > > > > > > Ssi...@li... > > > > > > > > > https://lists.sourceforge.net/lists/listinfo/ssic-linux- > > > > > > > > > users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > - > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your job > easier Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Ssic-linux-users mailing list > Ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-users > > |