From: Florent C. <Flo...@un...> - 2003-01-22 22:05:28
|
Hi I gave a try to clustermatic today, on a test cluster made of 8 nodes with dual pentium 3 processors, serverwork chipset with integrated eepro100 , + one eepro on the master, myrinet 2000 boards and switch. Either with fast ethernet or myrinet the nodes hang on the boot.img fetch after they got an IP from RARP. I tried the mcastbcast hack, crossover cables instead of our Cisco switch, and boot over myrinet (with different MACs ) : all the same, nothing happens and tcpdump does not show anything after the RARP resolution. I chose addresses in the 192.168.33 range in order not to interfer with the campus here. Did I forget something in the redhat 8.0 security settings ? I set no firewall and started all services... Please help thanks in advance -- Florent Calvayrac | Tel : 02 43 83 26 26 Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18 UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay Universite du Maine-Faculte des Sciences | 72085 Le Mans Cedex 9 |
From: <lo...@tr...> - 2003-01-22 23:05:31
|
I copy down there two messages which made it work..though I am not totally sure it's your problem Lothar I just saw something like this on a cluster here and I fixed it. If you're having the same problem, the following patch should fix it for you. This only modifies the "beoserv" binary (which runs on the master and does things like server boot images and RARP responses) so you won't have to change the slave boot images or anything. Index: mcsend.c =================================================================== RCS file: /users/hendriks/repository/beoboot/mcsend.c,v retrieving revision 1.25 retrieving revision 1.26 diff -u -r1.25 -r1.26 --- mcsend.c 27 Aug 2002 16:25:13 -0000 1.25 +++ mcsend.c 17 Dec 2002 21:49:13 -0000 1.26 @@ -28,7 +28,7 @@ * negligence or otherwise) arising in any way out of the use of this * software, even if advised of the possibility of such damage. * - * $Id: mcsend.c,v 1.25 2002/08/27 16:25:13 hendriks Exp $ + * $Id: mcsend.c,v 1.26 2002/12/17 21:49:13 hendriks Exp $ *--------------------------------------------------------------------*/ #include <sys/time.h> #include <sys/types.h> @@ -1029,6 +1029,10 @@ } break; case SND_TIME_WAIT: + if (ifc->sendok > 0 && + !FD_ISSET(ifc->fd, wset) && sender_ready(s)) + FD_SET(ifc->fd, wset); + timeleft = SENDER_TIMEOUT - (now.tv_sec - s->lastuse); if (timeleft <= 0) { sender_discard(s); you'll have to rebuilt beoboot. Grab the source RPM (included in Clustermatic 3) and apply this patch to it. You'll have to do something like: rpm -i beoboot-....src.rpm rpmbuild -bp /usr/src/redhat/SPECS/beoboot.spec cd /usr/src/redhat/BUILD/beoboot-.... patch -p1 < patchfile make beoserv Then replace the beoserv in /usr/sbin with the one built there. See local Linux guru for more help on building stuff. :) - Erik > Erik A. Hendriks wrote: > > >On Mon, Dec 16, 2002 at 10:05:11AM -0800, lo...@tr... wrote: > > > > > >>Well that's how it goes. Looks to me as if the problem is > >>on the master side....but no idea what. > >> Lothar Florent Calvayrac wrote: > Hi > > I gave a try to clustermatic today, on a test cluster > made of 8 nodes with dual pentium 3 processors, > serverwork chipset with integrated eepro100 , + one eepro > on the master, myrinet 2000 boards and switch. > > Either with fast ethernet or myrinet the nodes > hang on the boot.img fetch after they got an > IP from RARP. I tried the mcastbcast hack, > crossover cables instead of our Cisco switch, > and boot over myrinet (with different MACs ) : > all the same, nothing happens > and tcpdump does not show anything after the RARP > resolution. > > I chose addresses in the 192.168.33 range in order > not to interfer with the campus here. > > Did I forget something in the redhat 8.0 security > settings ? I set no firewall and started all services... > > Please help > > thanks in advance > > |
From: Florent C. <Flo...@un...> - 2003-01-24 17:50:50
|
thank you very much Lothar. I do not think your patch helped, because this is how I managed to make Clustermatic work : since we are building a diskless cluster at the same time, and I was destroying floppy disks in my tries to make a working one, and moreover I got tired of waiting 2 minutes at each boot try, I decided to give a try to PXElinux. After some problems (the switch allowed ethernet ports to the nodes slower than PXE/dhcp would give up - thanks to the very speed in boot I was looking for - had to solve it bu setting spanning tree to fast and channel to off, as suggested by Cisco), I managed to load directly a phase 2 image by pxe. So everything seems to work ... thanks anyway ot...@tr... wrote: > I copy down there two messages which made it work..though I am not > totally sure it's > your problem > > Lothar > > > I just saw something like this on a cluster here and I fixed it. If > you're having the same problem, the following patch should fix it for > you. This only modifies the "beoserv" binary (which runs on the > master and does things like server boot images and RARP responses) so > you won't have to change the slave boot images or anything. > > Index: mcsend.c > =================================================================== > RCS file: /users/hendriks/repository/beoboot/mcsend.c,v > retrieving revision 1.25 > retrieving revision 1.26 > diff -u -r1.25 -r1.26 > --- mcsend.c 27 Aug 2002 16:25:13 -0000 1.25 > +++ mcsend.c 17 Dec 2002 21:49:13 -0000 1.26 > @@ -28,7 +28,7 @@ > * negligence or otherwise) arising in any way out of the use of this > * software, even if advised of the possibility of such damage. > * > - * $Id: mcsend.c,v 1.25 2002/08/27 16:25:13 hendriks Exp $ > + * $Id: mcsend.c,v 1.26 2002/12/17 21:49:13 hendriks Exp $ > *--------------------------------------------------------------------*/ > #include <sys/time.h> > #include <sys/types.h> @@ -1029,6 +1029,10 @@ > } > break; > case SND_TIME_WAIT: > + if (ifc->sendok > 0 && > + !FD_ISSET(ifc->fd, wset) && sender_ready(s)) > + FD_SET(ifc->fd, wset); > + > timeleft = SENDER_TIMEOUT - (now.tv_sec - s->lastuse); > if (timeleft <= 0) { > sender_discard(s); > > > > you'll have to rebuilt beoboot. Grab the source RPM (included in > Clustermatic 3) and apply this patch to it. > > You'll have to do something like: > > rpm -i beoboot-....src.rpm > > rpmbuild -bp /usr/src/redhat/SPECS/beoboot.spec > > cd /usr/src/redhat/BUILD/beoboot-.... > > patch -p1 < patchfile > > make beoserv > > > Then replace the beoserv in /usr/sbin with the one built there. See > local Linux guru for more help on building stuff. :) > > - Erik > >> Erik A. Hendriks wrote: >> >> >On Mon, Dec 16, 2002 at 10:05:11AM -0800, lo...@tr... wrote: >> > > >> >>Well that's how it goes. Looks to me as if the problem is >> >>on the master side....but no idea what. >> >> > > > > > Lothar > > > Florent Calvayrac wrote: > >> Hi >> >> I gave a try to clustermatic today, on a test cluster >> made of 8 nodes with dual pentium 3 processors, >> serverwork chipset with integrated eepro100 , + one eepro >> on the master, myrinet 2000 boards and switch. >> >> Either with fast ethernet or myrinet the nodes >> hang on the boot.img fetch after they got an >> IP from RARP. I tried the mcastbcast hack, >> crossover cables instead of our Cisco switch, >> and boot over myrinet (with different MACs ) : >> all the same, nothing happens >> and tcpdump does not show anything after the RARP >> resolution. >> >> I chose addresses in the 192.168.33 range in order >> not to interfer with the campus here. >> >> Did I forget something in the redhat 8.0 security >> settings ? I set no firewall and started all services... >> >> Please help >> >> thanks in advance >> >> > > -- Florent Calvayrac | Tel : 02 43 83 26 26 Laboratoire de Physique de l'Etat Condense | Fax : 02 43 83 35 18 UMR-CNRS 6087 | http://www.univ-lemans.fr/~fcalvay Universite du Maine-Faculte des Sciences | 72085 Le Mans Cedex 9 |