From: Thomas E. <eck...@gm...> - 2004-07-20 13:20:16
|
Hi Luke, this thread seems to have slipped off the bproc-list -- most likely I replied to the wrong message -- so here is a forward of my reply :( I'm interested in the bproc3<->lam-7.0.x results: have you tried bproc3 with the latest stable lam (7.0.x) and it did not work or are you focusing on bproc4 now anyway due to other reasons (want to use 2.6-kernels, ...)? Thomas ---------- Forwarded message ---------- Date: Thu, 24 Jun 2004 11:56:09 +0200 (MEST) From: Thomas Eckert <eck...@gm...> To: Luke Palmer <lo...@du...> Subject: Re: [BProc] Re: More LAM / BProc changes Luke, On Wed, 23 Jun 2004, Luke Palmer wrote: > Thanks for the email. Did you build the kernel, or all of bproc as > well, using the 2.6? as the system evolved over time that's a tricky question. bproc-4.0.0_pre5 (kmods, libs, clients, ...) was completely build against 2.6 for sure. from the logs of the package-manager I see that beoboot, cmtools, beonss, ... were build (Gentoo is a "from-source" distribution) earlier (most likely against 2.4.22 with bproc-4.0.0_pre3). > It's easy for me to upgrade to Fedora Core 2, > which would have the 2.6 kernels, but next to impossible to back out. I > want to be good and sure before I do this... the execve hook is not implemented in brpoc4 yet -- so if you rely on it (i.e. your batch system uses shell-scripts) you'll have a problem with that. you have tried LAM with bproc3 without luck, right? (at least according to the lam v7.0.5-docs bproc>=3.2.5 _should_ work) if time permits I'll do a quick test with bproc-3.2.6 ... Cheers, Thomas > On Mon, 2004-06-21 at 02:56, Thomas Eckert wrote: > > Luke, > > > > I only can offer a kind of "me too" statement with a litte addon: > > bproc-4.0.0_pre{4,5} (I'm not sure for _pre3) id not build with > > 2.4-kernels (result as you describe it) > > BUT both do with 2.6-kernels (vanilla-kernels with only bproc-patch applied). > > All testing was done with a faily current Gentoo. > > > > My _guess_ would be that testing concentrates on 2.6-kernels at the moment as > > this is the interesting new stuff (Erik?). > > > > For the LAM-side: > > I've not tested LAM with bproc3 up to now but with the nightly snapshots on > > LAM ("lam-7.1a1r9708.tar.gz" in this case) and the bproc-4.0.0_pre5-patch > > suggested by Kevin Russell to clients/bproc.c a few days ago on bproc-users it > > comiles and is able of "mpirun"ning jobs on x86. > > > > Hope this helps a bit, > > > > Thomas > > > > On Sun, 20 Jun 2004, Luke wrote: > > > > > Daniel, > > > > > > Thanks for the reply. I'm cc-ing the bproc list on this just in case > > > anyone else has ideas. > > > > > > Unless you know of a download loaction that I don't, all the > > > Clustermatic stuff is bproc4.0.0pre3. I wasn't able to mix versions- > > > how are you using pre4? I don't have an immediate need to use a newer > > > version of bproc, but I do have an immediate need for LAM. I'm going to > > > ask Brian if I can help with development. > > > > > > Here's a rundown of my problems. At the very start of the pre4 build, I > > > see: > > > > > > make -C vmadump vmadump.ko kver > > > make[1]: Entering directory `/usr/src/bproc-4.0.0pre4/vmadump' > > > gcc -D__KERNEL__ > > > -I/lib/modules/2.4.25-bproc/build//usr/src/linux-2.4.25-bproc/include > > > -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing > > > -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 > > > -march=i686 -DMODULE -DPACKAGE_VERSION='"4.0.0pre4"' -I. -c > > > vmadump_common.c > > > > > > The include line, which is automatically generated, is obviously wrong. > > > If I correct it, the build makes sense a bit. The client programs build > > > fine. When building vmadump and the kernel modules, I see errors like > > > these, repeated many times. Except the second error, they all have to > > > do with the variable "current", but I'm afraid I've been unable to > > > figure out what it is (my knowledge of kernel-ish stuff is quite > > > little). There are many, many warnings as well, that I have left out. > > > > > > vmadump_common.c:679: error: structure has no member named `sighand' > > > vmadump_common.c:681: error: too few arguments to function > > > `recalc_sigpending' > > > vmadump_common.c:707: error: structure has no member named `clear_child_tid' > > > ghost.c:432: error: structure has no member named `utime' > > > ghost.c:433: error: structure has no member named `stime' > > > ghost.c:434: error: structure has no member named `cutime' > > > ghost.c:435: error: structure has no member named `cstime' > > > > > > Any ideas? > > > > > > Thanks > > > -Luke > > > > > > Daniel Gruner wrote: > > > > > > >Hi Luke, > > > > > > > >I am using the stuff from clustermatic. At least the kernels. > > > >Some of the other packages too, as far as I recall. > > > > > > > >Now, I am not working with Fedora on any of my clusters yet, and that > > > >may be why you are seeing problems. Can you tell me what the build > > > >problems are? I have built bproc on many different systems, such > > > >as RH7.2 on alpha, RH7.3 on i386, RH9 on athlon, etc. > > > > > > > >Also, in my experience, Fedora 1 is missing a bunch of stuff, such as > > > >libraries missing from packages, and whatever else. Let me know more > > > >details, and perhaps I can help. > > > > > > > >Now as for success with LAM... that is another story. Brian was still > > > >working on it, but it seems that some functions in the BProc api are > > > >either broken or not properly documented. The note you quote below is old, > > > >and has been revised since to "not working". > > > > > > > >Regards, > > > >Daniel > > > > > > > >On Sun, Jun 20, 2004 at 01:36:17AM -0500, Luke Palmer wrote: > > > > > > > > > > > >>Hey Daniel, > > > >> > > > >>I have a couple of questions for you. I cannot replicate your success > > > >>with LAM on bproc. I am hoping the difference is that I am using > > > >>bproc4.0.0pre3 (from clustermatic). > > > >> > > > >>You say you are using bproc4.0.0pre4. I was wondering if there is any > > > >>trick to getting it to build? I am on Fedora Core 1, and the build is > > > >>quite badly broken in my environment. > > > >> > > > >>Please let me know if you can help! > > > >> > > > >>Thanks > > > >>-Luke > > > >> > > > >>On Sun, 2004-06-13 at 18:42, Daniel Gruner wrote: > > > >> > > > >> > > > >>>Brian, > > > >>> > > > >>>Success!!! (well, at least apparently). The thing lamboots fine, starts > > > >>>the lamd on all the nodes, and seems to run mpi jobs, so until further > > > >>>testing it looks like we are in business! > > > >>> > > > >>>I agree with you about BProc's lack of documentation, but I still like > > > >>>the system. It is the only cluster software that makes the cluster > > > >>>seem like a single system image. I have been running different versions > > > >>>of it for over 2 years, and I insist on it for all my clusters. It may > > > >>>be time to hack on the BProc guys for more documentation (Erik...). > > > >>> > > > >>>Anyway, thanks for your work on LAM, and let me know if I can be of more > > > >>>help. > > > >>> > > > >>>Regards, > > > >>>Daniel > > > >>> > > > >>> > > > >>>On Sun, Jun 13, 2004 at 04:01:11PM -0700, Brian Barrett wrote: > > > >>> > > > >>> > > > >>>>On Jun 13, 2004, at 3:32 PM, Daniel Gruner wrote: > > > >>>> > > > >>>> > > > >>>> > > > >>>>>Ok, here it goes again. The master node is still somehow screwed up, > > > >>>>>according to lamboot... Here is the output: > > > >>>>> > > > >>>>> > > > >>>><snip> > > > >>>> > > > >>>> > > > >>>> > > > >>>>>n-1<8638> ssi:boot:bproc: n-1 nodestatus failed (-1) > > > >>>>>n-1<8638> ssi:boot:bproc: n-1 node status down, failure > > > >>>>>n-1<8638> ssi:boot:bproc: n0 node status: up > > > >>>>> > > > >>>>> > > > >>>>Well, on the good side, we detect the master node correctly now :). > > > >>>> > > > >>>>You know, life would be so much better if BProc documented things like > > > >>>>that. I added some code to make sure that NODE_MASTER is never used > > > >>>>for the parameter to bproc_nodestatus or the like. Hopefully, that > > > >>>>will make life better all around. If you could svn up and let me know > > > >>>>how it goes, I'd appreciate it. Only file changed is > > > >>>>share/ssi/boot/bproc/src/ssi_boot_bproc.c, so you should be able to svn > > > >>>>up and run make (without the autogen or configure stuff). > > > >>>> > > > >>>>Thanks! > > > >>>> > > > >>>>Brian > > > >>>> > > > >>>>-- > > > >>>> Brian Barrett > > > >>>> LAM/MPI developer and all around nice guy > > > >>>> Have a LAM/MPI day: http://www.lam-mpi.org/ > > > >>>> > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference > > > Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer > > > Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA > > > REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND > > > _______________________________________________ > > > BProc-users mailing list > > > BPr...@li... > > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > > > > > -- Sometimes I think the surest sign that intelligent life exists elsewhere in the universe is that none of it has tried to contact us. -- Calvin |