Thread: RE: [SSI-users] Xen Cluster & DRBD
Brought to you by:
brucewalker,
rogertsang
From: Owen C. <ow...@em...> - 2006-04-25 10:35:23
|
Once the master node has gone, I can no longer ping the CVIP address - even if I bring that node back up again as secondary. The only way to get it back is to restart the entire cluster.=20 I thought the 'fsck.ext3: No such file or directory while trying to open /dev/drbd0.........' looked serious enough to be the cause. However, I've now tried editing some test files and crashing nodes and the drbd setup seems to be working ok. Thanks for the wake-up call!! My cvip.conf has <director_node> and <real_server_node> sections for the second node already.=20 Any ideas as to where else I should look for the problem? Owen -----Original Message----- From: Roger Tsang [mailto:rog...@gm...]=20 Sent: 25 April 2006 03:03 To: Owen Campbell Cc: ssi...@li... Subject: Re: [SSI-users] Xen Cluster & DRBD What kinda problem are you having? nodedown completed. Roger On 4/24/06, Owen Campbell <ow...@em...> wrote: > > > Can anyone help to get my cluster of Xen virtual machines to failover on > failure of the root node?...... > > This is a debian sarge based system (both the dom0 and domU's). > > The initrd was created with devices labeled as /dev/drbd/0 in drbd.conf and > fstab. drbd.conf was then put back to using /dev/drbd0. I've tried both > formats in fstab, but with no difference to the results. > > I've also tried editing the initrd to remove all trace of /dev/drbd/0, but > it also made no difference. > > Everything works fine, except failover when the root node goes down. Then I > get: > > >drbd0: PingAck did not arrive in time. > > drbd0: drbd0_asender [131278]: cstate Connected --> NetworkFailure > > drbd0: asender terminated > > drbd0: drbd0_receiver [131271]: cstate NetworkFailure --> BrokenPipe > > drbd0: short read expecting header on sock: r=3D-512 > > drbd0: worker terminated > > drbd0: drbd0_receiver [131271]: cstate BrokenPipe --> Unconnected > > drbd0: Connection lost. > > drbd0: drbd0_receiver [131271]: cstate Unconnected --> WFConnection > > Taking over master from node 1. > > Node 1 has gone down!!! > > passed the first scan in ipcname_pull_data > > num_objects[MSG] =3D 0 > > num_objects[SEM] =3D 0 > > num_objects[SHM] =3D 0 > > ipcnameserver ready completed > > drbd0: drbd_nodedown: Signaling receiver thread. > > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > drivers/block/drbd/drbd_fs.c:702 > > drbd0: Secondary/Unknown --> Primary/Unknown > > drbd0: Doing CLMS nodedown callback for service 9 > > EXT3-fs: INFO: recovery required on readonly filesystem. > > EXT3-fs: write access will be enabled during recovery. > > write handler down off 470000 len 10000 > > kjournald starting. Commit interval 5 seconds > > EXT3-fs: recovery complete. > > EXT3-fs: mounted filesystem with ordered data mode. > > fsck 1.35 (28-Feb-2004) > > ERROR: Couldn't open /dev/null (No such file or directory) > > e2fsck 1.35 (28-Feb-2004) > > fsck.ext3: No such file or directory while trying to open /dev/drbd0 > > The superblock could not be read or does not describe a correct ext2 > > filesystem. If the device is valid and it really contains an ext2 > > filesystem (and not swap or ufs or something else), then the superblock > > is corrupt, and you might try running e2fsck with an alternate superblock: > > e2fsck -b 8193 <device> > > EXT3 FS on drbd0, internal journal > > /etc/init.d/rc.sysrecover running > > ssi-ntpsetrefclk: ntpd is not running; not setting refclk > > INIT: version 2.86-SSI reloading > > INIT: cannot execute "/sbin/getty" > > INIT: Sending processes the TERM signal > > INIT: Sending processes the KILL signal > > INIT: Pid 131747 [id siR] seems to hang > > /etc/init.d/rc.nodedown 1 running > > fsck 1.35 (28-Feb-2004) > > INIT: +++ nodedown completed on node 1 > > Any help, much appreciated!!!!! > > > Owen > > |
From: Owen C. <ow...@em...> - 2006-04-26 08:37:49
|
OK, after a little more investigation, I think I have the source of the problem.... When the second node boots, it doesn't get a 'default' entry in the routing table. If I restart /etc/init.d/networking manually, everything is setup properly and the failover works ok. I'm guessing the initial network setup is controlled by the root node? Can anybody shed any light on how I get the routing table set up properly without the manual intervention? Thanks, Owen -----Original Message----- From: ssi...@li... [mailto:ssi...@li...] On Behalf Of Owen Campbell Sent: 25 April 2006 11:35 To: ssi...@li... Subject: RE: [SSI-users] Xen Cluster & DRBD Once the master node has gone, I can no longer ping the CVIP address - even if I bring that node back up again as secondary. The only way to get it back is to restart the entire cluster.=20 I thought the 'fsck.ext3: No such file or directory while trying to open /dev/drbd0.........' looked serious enough to be the cause. However, I've now tried editing some test files and crashing nodes and the drbd setup seems to be working ok. Thanks for the wake-up call!! My cvip.conf has <director_node> and <real_server_node> sections for the second node already.=20 Any ideas as to where else I should look for the problem? Owen -----Original Message----- From: Roger Tsang [mailto:rog...@gm...]=20 Sent: 25 April 2006 03:03 To: Owen Campbell Cc: ssi...@li... Subject: Re: [SSI-users] Xen Cluster & DRBD What kinda problem are you having? nodedown completed. Roger On 4/24/06, Owen Campbell <ow...@em...> wrote: > > > Can anyone help to get my cluster of Xen virtual machines to failover on > failure of the root node?...... > > This is a debian sarge based system (both the dom0 and domU's). > > The initrd was created with devices labeled as /dev/drbd/0 in drbd.conf and > fstab. drbd.conf was then put back to using /dev/drbd0. I've tried both > formats in fstab, but with no difference to the results. > > I've also tried editing the initrd to remove all trace of /dev/drbd/0, but > it also made no difference. > > Everything works fine, except failover when the root node goes down. Then I > get: > > >drbd0: PingAck did not arrive in time. > > drbd0: drbd0_asender [131278]: cstate Connected --> NetworkFailure > > drbd0: asender terminated > > drbd0: drbd0_receiver [131271]: cstate NetworkFailure --> BrokenPipe > > drbd0: short read expecting header on sock: r=3D-512 > > drbd0: worker terminated > > drbd0: drbd0_receiver [131271]: cstate BrokenPipe --> Unconnected > > drbd0: Connection lost. > > drbd0: drbd0_receiver [131271]: cstate Unconnected --> WFConnection > > Taking over master from node 1. > > Node 1 has gone down!!! > > passed the first scan in ipcname_pull_data > > num_objects[MSG] =3D 0 > > num_objects[SEM] =3D 0 > > num_objects[SHM] =3D 0 > > ipcnameserver ready completed > > drbd0: drbd_nodedown: Signaling receiver thread. > > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > drivers/block/drbd/drbd_fs.c:702 > > drbd0: Secondary/Unknown --> Primary/Unknown > > drbd0: Doing CLMS nodedown callback for service 9 > > EXT3-fs: INFO: recovery required on readonly filesystem. > > EXT3-fs: write access will be enabled during recovery. > > write handler down off 470000 len 10000 > > kjournald starting. Commit interval 5 seconds > > EXT3-fs: recovery complete. > > EXT3-fs: mounted filesystem with ordered data mode. > > fsck 1.35 (28-Feb-2004) > > ERROR: Couldn't open /dev/null (No such file or directory) > > e2fsck 1.35 (28-Feb-2004) > > fsck.ext3: No such file or directory while trying to open /dev/drbd0 > > The superblock could not be read or does not describe a correct ext2 > > filesystem. If the device is valid and it really contains an ext2 > > filesystem (and not swap or ufs or something else), then the superblock > > is corrupt, and you might try running e2fsck with an alternate superblock: > > e2fsck -b 8193 <device> > > EXT3 FS on drbd0, internal journal > > /etc/init.d/rc.sysrecover running > > ssi-ntpsetrefclk: ntpd is not running; not setting refclk > > INIT: version 2.86-SSI reloading > > INIT: cannot execute "/sbin/getty" > > INIT: Sending processes the TERM signal > > INIT: Sending processes the KILL signal > > INIT: Pid 131747 [id siR] seems to hang > > /etc/init.d/rc.nodedown 1 running > > fsck 1.35 (28-Feb-2004) > > INIT: +++ nodedown completed on node 1 > > Any help, much appreciated!!!!! > > > Owen > > ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=120709&bid&3057&dat=121642 _______________________________________________ Ssic-linux-users mailing list Ssi...@li... https://lists.sourceforge.net/lists/listinfo/ssic-linux-users |
From: stefan <st...@fu...> - 2006-04-27 09:55:56
|
hello Owen, > OK, after a little more investigation, I think I have the source of the > problem.... > > When the second node boots, it doesn't get a 'default' entry in the > routing table. If I restart /etc/init.d/networking manually, everything > is setup properly and the failover works ok. did you solve the problem? I have the same issue here with vmware. tia stefan |
From: E. S. V. <ven...@ms...> - 2006-04-27 16:52:14
|
>When the second node boots, it doesn't get a 'default' entry in the >routing table. If I restart /etc/init.d/networking manually, everything >is setup properly and the failover works ok. > >I'm guessing the initial network setup is controlled by the root node? >Can anybody shed any light on how I get the routing table set up >properly without the manual intervention? > Could it be that the file "/cluster/node2/etc/network/interfaces" has only: auto lo iface lo inet loopback but no entry for eth0. This was the case for my 1.9.1 cluster on Debian Sarge. Venkat ===================================================================== Please note that this e-mail and any files transmitted with it may be privileged, confidential, and protected from disclosure under applicable law. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this communication or any of its attachments is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and deleting this message, any attachments, and all copies and backups from your computer. |
From: stefan <st...@fu...> - 2006-04-28 14:29:35
|
Am Donnerstag, 27. April 2006 18:51 schrieb E. S. Venkatraman: > >When the second node boots, it doesn't get a 'default' entry in the > >routing table. If I restart /etc/init.d/networking manually, everything > >is setup properly and the failover works ok. > > > >I'm guessing the initial network setup is controlled by the root node? > >Can anybody shed any light on how I get the routing table set up > >properly without the manual intervention? > > Could it be that the file "/cluster/node2/etc/network/interfaces" has only: > > auto lo > iface lo inet loopback > > but no entry for eth0. This was the case for my 1.9.1 cluster on Debian > Sarge. This was the first I tried. This is not solving the problem. I use openssi1.2. Secondary node has no network. Only when starting networking manually. Dont know what is going wrong! tia stefan |
From: stefan <st...@fu...> - 2006-04-28 17:34:52
|
okay, when I boot secondary node I see that networking is successfully loaded. But then starts LVS_HA that gives the error: LVS_HA failed to open /proc/sys/net/ipv4/conf/lo/hidden=20 MAybe this is the point where network/routing is broken? Is that message normal? or Have I todo something? tia stefan |
From: Owen C. <ow...@em...> - 2006-04-27 12:36:46
|
I installed openssi-webview and this fixed it!!! I can't really claim to have solved it, since I don't really know how the fix worked. My guess is that it's something to do with the fact that openssi-webview installs the dhcp server. Owen -----Original Message----- From: stefan [mailto:st...@fu...]=20 Sent: 27 April 2006 10:56 To: ssi...@li...; Owen Campbell Subject: Re: [SSI-users] Xen Cluster & DRBD hello Owen, > OK, after a little more investigation, I think I have the source of the > problem.... > > When the second node boots, it doesn't get a 'default' entry in the > routing table. If I restart /etc/init.d/networking manually, everything > is setup properly and the failover works ok. did you solve the problem? I have the same issue here with vmware. tia stefan |
From: Kilian C. <kil...@li...> - 2006-04-27 16:25:29
|
On Thursday 27 April 2006 14:36, Owen Campbell wrote: > I installed openssi-webview and this fixed it!!! Duh? Weirdest thing of the day. :) > I can't really claim to have solved it, since I don't really know how > the fix worked. My guess is that it's something to do with the fact that > openssi-webview installs the dhcp server. No chance. # dpkg -I openssi-webview_0.2-3_all.deb [..] Depends: openssi, debconf (>=3D 0.5), php4 (>=3D 4.1.0) | libapache2-mod-p= hp4 (>=3D=20 4.1.0), rrdtool, apache | apache-ssl | apache-perl | apache2, cron | anacro= n,=20 ucf Recommends: sudo Suggests: php4-rrdtool [...] I can't see why openssi-webview would have installed a DHCP server. By the= =20 way, dhcp3-server is a dependency of the openssi package, so it should have= =20 been already installed. =2D-=20 Kilian CAVALOTTI Administrateur r=E9seaux et syst=E8mes UPMC / CNRS - LIP6 (C870) 8, rue du Capitaine Scott Tel. : 01 44 27 88 54 75015 Paris - France Fax. : 01 44 27 70 00 |
From: Owen C. <ow...@em...> - 2006-04-27 16:36:11
|
My cluster is made up of Xen virtual machines so I'm using openssi-xen - = and that doesn't seem to have dhcp3-server in its dependencies. -----Original Message----- From: ssi...@li... = [mailto:ssi...@li...] On Behalf Of = Kilian CAVALOTTI Sent: 27 April 2006 17:25 To: ssi...@li... Subject: Re: [SSI-users] Xen Cluster & DRBD On Thursday 27 April 2006 14:36, Owen Campbell wrote: > I installed openssi-webview and this fixed it!!! Duh? Weirdest thing of the day. :) > I can't really claim to have solved it, since I don't really know how > the fix worked. My guess is that it's something to do with the fact = that > openssi-webview installs the dhcp server. No chance. # dpkg -I openssi-webview_0.2-3_all.deb [..] Depends: openssi, debconf (>=3D 0.5), php4 (>=3D 4.1.0) | = libapache2-mod-php4 (>=3D=20 4.1.0), rrdtool, apache | apache-ssl | apache-perl | apache2, cron | = anacron,=20 ucf Recommends: sudo Suggests: php4-rrdtool [...] I can't see why openssi-webview would have installed a DHCP server. By = the=20 way, dhcp3-server is a dependency of the openssi package, so it should = have=20 been already installed. --=20 Kilian CAVALOTTI Administrateur r=E9seaux et = syst=E8mes UPMC / CNRS - LIP6 (C870) 8, rue du Capitaine Scott Tel. : 01 44 27 88 54 75015 Paris - France Fax. : 01 44 27 70 00 ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, = security? Get stuff done quickly with pre-integrated technology to make your job = easier Download IBM WebSphere Application Server v.1.0.1 based on Apache = Geronimo http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=120709&bid&3057&dat=121642 _______________________________________________ Ssic-linux-users mailing list Ssi...@li... https://lists.sourceforge.net/lists/listinfo/ssic-linux-users |
From: Owen C. <ow...@em...> - 2006-04-28 10:28:56
|
No - it was set up correctly. The complication is that I use dhcp with reservations on the dhcp server rather than static addresses, so my interfaces file looks like: auto eth0 iface eth0 inet dhcp Without openssi-webview, the secondary node came up with the correct address, but no default route. If I then ran /etc/init.d/networking manually once the node had booted, the route was added and everything worked fine. Installing openssi-webview (which also installed dhcp3-server) meant that the secondary node also got its default route correctly on boot and the manual step was no longer needed. A little bizarre, but once I got it working, I stopped looking into it!! -----Original Message----- From: ssi...@li... [mailto:ssi...@li...] On Behalf Of E. S. Venkatraman Sent: 27 April 2006 17:52 To: ssi...@li... Subject: Re: [SSI-users] Xen Cluster & DRBD >When the second node boots, it doesn't get a 'default' entry in the >routing table. If I restart /etc/init.d/networking manually, everything >is setup properly and the failover works ok. > >I'm guessing the initial network setup is controlled by the root node? >Can anybody shed any light on how I get the routing table set up >properly without the manual intervention? > Could it be that the file "/cluster/node2/etc/network/interfaces" has only: auto lo iface lo inet loopback but no entry for eth0. This was the case for my 1.9.1 cluster on Debian Sarge.=20 Venkat =20 =20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 Please note that this e-mail and any files transmitted with it may be=20 privileged, confidential, and protected from disclosure under=20 applicable law. If the reader of this message is not the intended=20 recipient, or an employee or agent responsible for delivering this=20 message to the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this communication or any of its attachments is strictly prohibited. If you have received this communication in error, please notify the=20 sender immediately by replying to this message and deleting this=20 message, any attachments, and all copies and backups from your=20 computer. ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat=3D= 121642 _______________________________________________ Ssic-linux-users mailing list Ssi...@li... https://lists.sourceforge.net/lists/listinfo/ssic-linux-users |