You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Vipul D. <vip...@ya...> - 2004-08-13 19:24:32
|
Hi folks, I went ahead and installed ClusterMatic 4 over a clean RedHat 9.0 on the master of my 5-node Intel P4-based cluster. The beoboot part works fine to bring the slave nodes up and running connected to port 2223 of master, but the node_up script seems to fail at the end, so node gets marked as "error" state. What could the reasons be? Any help will be appreciated. ON MASTER (node_up script fails): $# tail -15 /var/log/clustermatic/node.0 vmadlib : loaded /lib/ld-2.3.2.so (size=103044;id=0,0;mode=100755) vmadlib : loaded /lib/libc-2.3.2.so (size=1549556;id=0,0;mode=100755) vmadlib : loaded /lib/librt-2.3.2.so (size=37552;id=0,0;mode=100755) vmadlib : loaded /lib/libpthread-0.10.so (size=103104;id=0,0;mode=100755) vmadlib : loaded /lib/libm-2.3.2.so (size=211876;id=0,0;mode=100755) vmadlib : loaded /lib/libnss_bproc.so.2 (size=25043;id=0,0;mode=100755) vmadlib : loaded /usr/lib/libbproc.so.4.0.0 (size=21388;id=0,0;mode=100755) nodeup : Plugin vmadlib returned status 0 (ok) nodeup : No premove function for nodeinfo nodeup : No premove function for kmod nodeup : Starting 1 child processes. nodeup : Finished creating child processes. nodeup : I/O error talking to child nodeup : Child process for node 0 died with signal 4 nodeup : Node setup returned status 1 ON SLAVE (things look fine): boot: Server IP Address : 10.0.0.1 boot: My IP Address : 10.0.0.100 boot: starting bpslave : bpslave -d -i -v 10.0.0.1 2223 bpslave: connecting to 10.0.0.1:2223 bpslave: IO Daemon started; pid 15 bpslave connection to 10.0.0.1:2223 up and running bpslave: Setting node number to 0 Now, if I force the master to mark status of slave node to be "up", bpsh fails like this. I tried "vmadlib -l" and it does show /lib/ld-2.3.2.so, but execing slave cannot seem to find it. I also looked at the strace of (strace -f) bpmaster, and one odd thing is it tries to close a whole host of socket descriptors (4096 instances) that are not open. Please - any help will be appreciated. ON MASTER: $# bpctl -S 0 -s up $# bpstat Node(s) Status Mode User Group 1-3 down ---------- root root 0 up ---x------ root root $# bpsh 0 sleep 1 0: No such file or directory ON SLAVE: vmadump: mmap failed: /lib/ld-2.3.2-so Thanks. Vipul |
From: Michal J. <mi...@ha...> - 2004-08-12 23:44:04
|
I think that it speaks for itself. :-) --- beoboot-cm1.9/node_up/nodeinfo.c~ 2003-11-05 11:52:07.000000000 -0700 +++ beoboot-cm1.9/node_up/nodeinfo.c 2004-08-12 15:34:57.354997882 -0600 @@ -91,7 +91,7 @@ int nodeup_postmove(int argc, char *argv {"cpus active : %Ld", &values[0], 1}, {"cycle frequency [Hz] : %Ld", &values[1], 1}, #endif -#if defined(__i386__) +#if defined(__i386__) || defined(__x86_64__) {"cpu MHz : %Ld", &values[1], 1000000}, {"processor\t:", &values[0], 0}, #endif Not much bad will happen without it. Simply in clustermatic log files you will see on x86_64 cpus=1; hz=0; mem=0 or something like that if that is missing. What formats should be used for Sparc and PowerPC I am afraid that I do not know. Michal |
From: YhLu <Yh...@ty...> - 2004-08-12 19:52:59
|
Bproc doesnt need LinuxBIOS. Beoboot will produce kernel and initrd. You can use that for PXE. As for LinuxBIOS you may use Etherboot in your rom. Then you need mkelfimage At the same time you still can use Etherboot to produce .zpxe as your PXE load so Normal BIOS + PXE + tg3.pxe (Etherboot)+ elf ( from mkelfimage kernel + initrd) = LinuxBIOS ( tg3.zelf = Etherboot in ROM) + elf ( Kernel + initrd). Regards YH _____ From: Vipul Deokar [mailto:vip...@ya...] Sent: Thursday, August 12, 2004 11:30 AM To: bpr...@li... Subject: [BProc] Newbie questions Hi folks, I am trying to build a small 5-node cluster using RedHat 9.x base installation, and the bproc and beoboot available at sourceforge. I was able to successfully build and run the bproc-patched kernel and bproc (bpmaster, bpstat, etc. However, I have problems building the beoboot. I have included what I have done below (section E); and included the steps (step 28, steps 22, 27) I am having trouble with. Can any of you please help either by telling me where I am screwing up or pointing out a good doc (I have already looked at the doc at bproc home page; and a Bproc reference manual by Eric)? Any help will be appreciated. Thanks. Another option I am considering is using ClusterMatic 4.0, but it seems to require LinuxBIOS to boot. Is LinuxBIOS REQUIRED on all (compute) nodes for Clustermatic to work? Can I still use PXE/beoboot (with beoboot -1 image on floppy on slave nodes)? Thanks. Vipul DETAILED ACTIONS TAKEN TO BUILD BPROC UTILITIES FROM SOURCEFORGE: A. DOWNLOAD KERNEL AND BPROC PATCH AND BPROC SOURCES. 1. download linux-2.4.21.tar.gz from <file:///\\www.kernel.org> www.kernel.org to /home/vdeokar/kernel_sources 2. download bproc-3.2.6.tar.gz from sourceforge.net to /home/vdeokar/kernel_sources (BPROC sources and patch to kernel B. PATCH, CONFIGURE & BUILD KERNEL 3. Apply bproc patch to kernel sources unzipped at /home/vdeokar/kernel_sources/linux-2.4.21 (gzip -dc ../bproc-3.2.6.tar.gz | patch -p1) Failed in the end with following msg, but thats the last line of the patch file, and really seems to have succeeded. "patching file arch/ppc/kernel/misc.S patch unexpectedly ends in middle of line" 4. Copy /boot/config-2.4.20-31.9 (exisiting kernel config file) to .config in linux-2.4.21 5. make oldconfig (set CONFIG_BPROC=y) 6. Build kernel (make dep; make bzImage) 7. SKIP - Build Bproc (as in BProc manual with LINUX=/home/vdeokar/kernel_sources/linux-2.4.21) 8. make modules 9. Become root 10. make modules_install 11. cp arch/i386/boot/bzImage /boot/vmlinuz-2.4.21-bproc3.2.6 and System.map and config file. 12. mkinitrd /boot/initrd-2.4.21.bproc326.img 2.4.21 13. Edit GRUB file. 14. Reboot with new kernel and login as root C. BUILD AND INSTALL BPROC 15. Do step 7 (build bproc here). make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 16. Install (per PDF doc by Eric) from bproc dir. make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 install. 17. Reboot and login as root 18. modprobe bproc (Install BProc) 19. Run bpmaster (check /etc/beowulf/config). Ready for global process sharing. 20. Before bpstat I needed to create symlink (ln -s libbproc.so.2.5.0 libbproc.so.2) in /usr/lib D. DOWNLOAD, BUILD, INSTALL CMTOOLS 21. Download cmtools1.2 from sourceforge bproc. ??? 22. Should I patch kernel? I did not since I do not have package mpi or mpich installed....looked like all MPI library patches; and patch --dry-run failed many HUNKS. 23. Build cmtools; had to introduce $CC environment variable to gcc. make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 24. (Install cmtools) make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 install 25. Run ldconfig E. DOWNLOAD, BUILD, INSTALL BEOBOOT 26. Download beoboot-cm1.5 from dourceforge bproc. ??? 27. Did not patch as most seemed to be for 2.4.17 kernels. ??? 28. Tried to build but needs libmodutils.a, libmodutilobj.a, libmodutilutil.a, libz.a which is not on this machine. DO I need to get a patch from somewhere else? 28a. Modutils-2.4.28 rpm already exists (installed) on the systems. (No need to download, make and make install modutils from <file:///\\www.kernel.org> www.kernel.org ) STUCK HERE TRIED FOLLOWING... 28b. Need to do this before 28c - a force install of modutils: Download from <file:///\\ftp.gnu.org> ftp.gnu.org bison, flex. configure, make and make install. 28c. Download modutils-2.4.21.tar.gz. 28d. If I change the Makefiles and proceed to force-build and install beoboot, beoboot fails later when I am trying to create -1 and/or -2 images with undefined references to insmod_main, rmmod_main, etc.: I modified the 2 Makefiles to make it build to remove linkage from these modutil* libraries. (make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 ) 29. Install (make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 install 2>&1 | tee makebeobootinstall.log) F.Start using 30. Reboot and login as root. 31. Run "service beowulf start" |
From: Vipul D. <vip...@ya...> - 2004-08-12 18:30:49
|
Hi folks, I am trying to build a small 5-node cluster using RedHat 9.x base installation, and the bproc and beoboot available at sourceforge. I was able to successfully build and run the bproc-patched kernel and bproc (bpmaster, bpstat, etc. However, I have problems building the beoboot. I have included what I have done below (section E); and included the steps (step 28, steps 22, 27) I am having trouble with. Can any of you please help either by telling me where I am screwing up or pointing out a good doc (I have already looked at the doc at bproc home page; and a Bproc reference manual by Eric)? Any help will be appreciated. Thanks. Another option I am considering is using ClusterMatic 4.0, but it seems to require LinuxBIOS to boot. Is LinuxBIOS REQUIRED on all (compute) nodes for Clustermatic to work? Can I still use PXE/beoboot (with beoboot -1 image on floppy on slave nodes)? Thanks. Vipul DETAILED ACTIONS TAKEN TO BUILD BPROC UTILITIES FROM SOURCEFORGE: A. DOWNLOAD KERNEL AND BPROC PATCH AND BPROC SOURCES. 1. download linux-2.4.21.tar.gz from www.kernel.org to /home/vdeokar/kernel_sources 2. download bproc-3.2.6.tar.gz from sourceforge.net to /home/vdeokar/kernel_sources (BPROC sources and patch to kernel B. PATCH, CONFIGURE & BUILD KERNEL 3. Apply bproc patch to kernel sources unzipped at /home/vdeokar/kernel_sources/linux-2.4.21 (gzip -dc ../bproc-3.2.6.tar.gz | patch -p1) Failed in the end with following msg, but thats the last line of the patch file, and really seems to have succeeded. "patching file arch/ppc/kernel/misc.S patch unexpectedly ends in middle of line" 4. Copy /boot/config-2.4.20-31.9 (exisiting kernel config file) to .config in linux-2.4.21 5. make oldconfig (set CONFIG_BPROC=y) 6. Build kernel (make dep; make bzImage) 7. SKIP - Build Bproc (as in BProc manual with LINUX=/home/vdeokar/kernel_sources/linux-2.4.21) 8. make modules 9. Become root 10. make modules_install 11. cp arch/i386/boot/bzImage /boot/vmlinuz-2.4.21-bproc3.2.6 and System.map and config file. 12. mkinitrd /boot/initrd-2.4.21.bproc326.img 2.4.21 13. Edit GRUB file. 14. Reboot with new kernel and login as root C. BUILD AND INSTALL BPROC 15. Do step 7 (build bproc here). make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 16. Install (per PDF doc by Eric) from bproc dir. make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 install. 17. Reboot and login as root 18. modprobe bproc (Install BProc) 19. Run bpmaster (check /etc/beowulf/config). Ready for global process sharing. 20. Before bpstat I needed to create symlink (ln -s libbproc.so.2.5.0 libbproc.so.2) in /usr/lib D. DOWNLOAD, BUILD, INSTALL CMTOOLS 21. Download cmtools1.2 from sourceforge bproc. ??? 22. Should I patch kernel? I did not since I do not have package mpi or mpich installed....looked like all MPI library patches; and patch --dry-run failed many HUNKS. 23. Build cmtools; had to introduce $CC environment variable to gcc. make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 24. (Install cmtools) make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 install 25. Run ldconfig E. DOWNLOAD, BUILD, INSTALL BEOBOOT 26. Download beoboot-cm1.5 from dourceforge bproc. ??? 27. Did not patch as most seemed to be for 2.4.17 kernels. ??? 28. Tried to build but needs libmodutils.a, libmodutilobj.a, libmodutilutil.a, libz.a which is not on this machine. DO I need to get a patch from somewhere else? 28a. Modutils-2.4.28 rpm already exists (installed) on the systems. (No need to download, make and make install modutils from www.kernel.org ) STUCK HERE TRIED FOLLOWING... 28b. Need to do this before 28c - a force install of modutils: Download from ftp.gnu.org bison, flex. configure, make and make install. 28c. Download modutils-2.4.21.tar.gz. 28d. If I change the Makefiles and proceed to force-build and install beoboot, beoboot fails later when I am trying to create -1 and/or -2 images with undefined references to insmod_main, rmmod_main, etc.: I modified the 2 Makefiles to make it build to remove linkage from these modutil* libraries. (make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 ) 29. Install (make LINUX=/home/vdeokar/kernel_sources/linux-2.4.21 install 2>&1 | tee makebeobootinstall.log) F.Start using 30. Reboot and login as root. 31. Run "service beowulf start" |
From: <er...@he...> - 2004-08-12 17:14:29
|
On Wed, Aug 11, 2004 at 04:51:17PM -0600, Michal Jaegermann wrote: > I am forgetting to ask. When a node comes up with bproc-4.0.0pre6 > then the following is printed: > > bproc: WARNING: ..../kernel/slave.c setsid in > > (with a full path to a location when that happened to be compiled). > > As far as I can tell this does not really carry any information on > which I can act and there is also no way to do something which > would shut that off. It looks to me like some develepment leftover. > Right? Right. It's a testing message that I forgot to remove. It's just one of those messages that lets me know that I'm actually running the version that I think I'm running. It should be ignored. - Erik |
From: Michal J. <mi...@ha...> - 2004-08-11 22:51:31
|
I am forgetting to ask. When a node comes up with bproc-4.0.0pre6 then the following is printed: bproc: WARNING: ..../kernel/slave.c setsid in (with a full path to a location when that happened to be compiled). As far as I can tell this does not really carry any information on which I can act and there is also no way to do something which would shut that off. It looks to me like some develepment leftover. Right? Michal |
From: Michal J. <mi...@ha...> - 2004-08-11 22:42:50
|
On Wed, Aug 11, 2004 at 08:08:05AM -0600, er...@he... wrote: > On Tue, Aug 10, 2004 at 03:26:37PM -0600, Michal Jaegermann wrote: .... > > > > which could be simply rewritten as .... > > making all this recursion unnecessary (and skirting the bug in the > > process). > > Sounds good to me. I'll take the simple road. In that case, after some thought, I think that the following will be safe and will work not only with the current kernel: --- ./beoboot.orig 2004-07-15 22:49:45.000000000 -0600 +++ ./beoboot 2004-08-11 16:21:46.015934024 -0600 @@ -169,15 +169,8 @@ # module. # module_dep() { - local MODDIR="$1" - local MOD="$2" - local DEPS= - local dep - DEPS=`module_expand $MODDIR \`$BEOLIB/bin/modhelper -d "$MOD"\`` - for dep in $DEPS; do - echo "$dep" - module_dep "$MODDIR" "$dep" - done + depmod --quick $KVER + sed -n 's!'"$2"': *!!p' "$1/modules.dep" } > I'm glad they got rid of those @#!$ stupid newline escapes in > modules.dep. Oh, in such situation module_dep () { local line depmod --quick $KVER while read line ; do case $line in "$2: "*) echo "${line#*: }" ; break ;; esac done < "$1/modules.dep" } will do just fine as 'read' above will handle escapes by itself. And not even sed is required. :-) Maybe this is even better in some sense then that version with sed? Well, yes, there is a quiet assumption here that modules are already installed. I do not think that this is too much to ask in practice. Michal |
From: <er...@he...> - 2004-08-11 15:46:15
|
On Tue, Aug 10, 2004 at 03:26:37PM -0600, Michal Jaegermann wrote: > > There is a bug in a function show_deps(), modhelper.c, from > beoboot-cm1.9. The code goes like that: > > while (p && *p) { > np = strchr(p, ','); > printf("%.*s", np-p, p) > > If there is no comma in a string p then np is set to NULL and > np-p may overflow the second argument to printf which happens to > be int. When this happens, which is pretty likely on 64-bit platforms, > then printf errors out, returns a negaive value and nothing is printed. > If you wonder I got hit by that. :-) On x86 you may usually get > away with this. > > Here is a fix: > > --- beoboot-cm1.9/modhelper.c~ 2004-04-19 11:07:31.000000000 -0600 > +++ beoboot-cm1.9/modhelper.c 2004-08-10 15:00:13.377034440 -0600 > @@ -236,11 +236,13 @@ > p = deps; > while (p && *p) { > np = strchr(p, ','); > - printf("%.*s", np-p, p); > - if (np) { > - printf(" "); > + if (np != NULL) { > + printf("%.*s ", (int)(np-p), p); > np++; > } > + else { > + printf("%s", p); > + } > p = np; > } > printf("\n"); > > OTOH 'modhelper -d' is used in /usr/bin/beoboot script only in > this function: > > module_dep() { > local MODDIR="$1" > local MOD="$2" > local DEPS= > local dep > DEPS=`module_expand $MODDIR \`$BEOLIB/bin/modhelper -d "$MOD"\`` > for dep in $DEPS; do > echo "$dep" > module_dep "$MODDIR" "$dep" > done > } > > which could be simply rewritten as > > module_dep() { > [ -r "$1/modules.dep" ] || depmod > sed -n 's!'"$2"': *!!p' "$1/modules.dep" > } > > making all this recursion unnecessary (and skirting the bug in the > process). Sounds good to me. I'll take the simple road. I think modhelper started life as a prototype for the module loading code in the boot program. I'm glad they got rid of those @#!$ stupid newline escapes in modules.dep. - Erik |
From: Michal J. <mi...@ha...> - 2004-08-11 05:02:16
|
On Tue, Aug 10, 2004 at 08:43:28AM -0600, er...@he... wrote: > > I you're probably seeing an issue with vmadump and it's page table > walk. I have this fixed and I hope to get it released today. The first test was indeed successful. We will see if this record will continue. :-) Thanks a bunch, Michal p.s. Erik did not annouce that explicitely on this list but a new version of bproc can be had from http://bproc.sourceforge.net/ |
From: Michal J. <mi...@ha...> - 2004-08-10 21:26:50
|
There is a bug in a function show_deps(), modhelper.c, from beoboot-cm1.9. The code goes like that: while (p && *p) { np = strchr(p, ','); printf("%.*s", np-p, p) If there is no comma in a string p then np is set to NULL and np-p may overflow the second argument to printf which happens to be int. When this happens, which is pretty likely on 64-bit platforms, then printf errors out, returns a negaive value and nothing is printed. If you wonder I got hit by that. :-) On x86 you may usually get away with this. Here is a fix: --- beoboot-cm1.9/modhelper.c~ 2004-04-19 11:07:31.000000000 -0600 +++ beoboot-cm1.9/modhelper.c 2004-08-10 15:00:13.377034440 -0600 @@ -236,11 +236,13 @@ p = deps; while (p && *p) { np = strchr(p, ','); - printf("%.*s", np-p, p); - if (np) { - printf(" "); + if (np != NULL) { + printf("%.*s ", (int)(np-p), p); np++; } + else { + printf("%s", p); + } p = np; } printf("\n"); OTOH 'modhelper -d' is used in /usr/bin/beoboot script only in this function: module_dep() { local MODDIR="$1" local MOD="$2" local DEPS= local dep DEPS=`module_expand $MODDIR \`$BEOLIB/bin/modhelper -d "$MOD"\`` for dep in $DEPS; do echo "$dep" module_dep "$MODDIR" "$dep" done } which could be simply rewritten as module_dep() { [ -r "$1/modules.dep" ] || depmod sed -n 's!'"$2"': *!!p' "$1/modules.dep" } making all this recursion unnecessary (and skirting the bug in the process). Michal |
From: <er...@he...> - 2004-08-10 16:21:26
|
On Fri, Aug 06, 2004 at 06:41:01PM -0600, Michal Jaegermann wrote: > I am trying to get bproc going with kernels later than 2.6.6 > but I run into a trouble which so far does not want to budge. > To reduce that to "simplest" terms I have now kernel.org > 2.6.6 and 2.6.7 kernels for i686 SMP and I used a kernel patch > from bproc-4.0.0pre5; with minor adjustments in the second > case. Configuration for both kernels is practically the same. > > For both I can start the service, boot my test node and > bpstat and bpctl work and behave in an expected manner. So far > so good. But if I will do something like > > bpsh 1 ls > > then with 2.6.6 kernel I am getting an expected listing but > with 2.6.7: > > bpsh: Child process on node 1 exited abnormally > > The same with any other process on a node I am trying to run > via bpsh. As for now my efforts to get to the bottom of this > matter were not very succesfull. Anybody with ideas what I am > missing here? I you're probably seeing an issue with vmadump and it's page table walk. I have this fixed and I hope to get it released today. [ The lab is finally back to a point where I can work on this stuff so I'm trying to get through all the email and patches in the queue. ] - Erik |
From: <er...@he...> - 2004-08-10 16:15:20
|
On Sat, Jul 24, 2004 at 04:46:10PM -0500, Luke Palmer wrote: > Hi all, > > I find myself continually adding to my bproc libraries list, which is by > now huge. I was considering just making all my libraries available to > nodes via an NFS mount. Can anyone think of a reason why that would be > bad? The stuff we do here is very long running, so the increased > library loading time would be negligible. NFS isn't unreasonable if your cluster is small enough. I don't try to avoid it as much as I avoid depending on it. Process startup (especially MPI) creates a situation where potentially every node in the system will hit the NFS server at the same time. This can go badly in very large configurations. We've also seen coherency problems with network file systems (NFS and others) in very large systems. On the other hand, if you're not seeing any problems like that then I can't see any reason not to use it. - Erik |
From: Michal J. <mi...@ha...> - 2004-08-07 00:41:16
|
I am trying to get bproc going with kernels later than 2.6.6 but I run into a trouble which so far does not want to budge. To reduce that to "simplest" terms I have now kernel.org 2.6.6 and 2.6.7 kernels for i686 SMP and I used a kernel patch from bproc-4.0.0pre5; with minor adjustments in the second case. Configuration for both kernels is practically the same. For both I can start the service, boot my test node and bpstat and bpctl work and behave in an expected manner. So far so good. But if I will do something like bpsh 1 ls then with 2.6.6 kernel I am getting an expected listing but with 2.6.7: bpsh: Child process on node 1 exited abnormally The same with any other process on a node I am trying to run via bpsh. As for now my efforts to get to the bottom of this matter were not very succesfull. Anybody with ideas what I am missing here? TIA, Michal |
From: Luke P. <lop...@wi...> - 2004-07-29 19:22:27
|
Michal, You were dead on. I just added that export after the function in ldt.c, and all my errors went away. Eric, want to add that to the bproc patch? -Luke On Wed, 2004-07-28 at 19:30, Michal Jaegermann wrote: > On Wed, Jul 28, 2004 at 05:37:13PM -0500, Luke Palmer wrote: > > > > The tail of dmesg is this: > > > > bproc: Unknown symbol do_vmadump > > bproc: Unknown symbol load_LDT_nolock > > bproc: Unknown symbol vmadump_thaw_proc > > bproc: Unknown symbol vmadump_freeze_proc > > vmadump: 4.0.0pre5 Erik Hendriks <er...@he...> > > bproc: Unknown symbol load_LDT_nolock > > Various "vmadump" symbols are supplied by vmadump.ko module > so if that one will not load for some reasons then you will > have these unresolved. > > 'load_LDT_nolock' is hiding in arch/i386/kernel/ldt.c (I think, > although what I am hacking right now is patched quite a bit but you > always have 'grep') and you are getting that as a result of an > expansion of 'activate_mm' macro in which is used in kernel/ghost.c. > Apparently in earlier kernel versions this was done in some other > way and that symbol was not required. Just add > > EXPORT_SYMBOL(load_LDT_nolock); > > in ldt.c and be done with it. Similar with other things you > need exported. Kernel developers are trying to avoid exporting > what is not necessary but if you need it then you need it. > > Michal > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Michal J. <mi...@ha...> - 2004-07-29 00:30:36
|
On Wed, Jul 28, 2004 at 05:37:13PM -0500, Luke Palmer wrote: > > The tail of dmesg is this: > > bproc: Unknown symbol do_vmadump > bproc: Unknown symbol load_LDT_nolock > bproc: Unknown symbol vmadump_thaw_proc > bproc: Unknown symbol vmadump_freeze_proc > vmadump: 4.0.0pre5 Erik Hendriks <er...@he...> > bproc: Unknown symbol load_LDT_nolock Various "vmadump" symbols are supplied by vmadump.ko module so if that one will not load for some reasons then you will have these unresolved. 'load_LDT_nolock' is hiding in arch/i386/kernel/ldt.c (I think, although what I am hacking right now is patched quite a bit but you always have 'grep') and you are getting that as a result of an expansion of 'activate_mm' macro in which is used in kernel/ghost.c. Apparently in earlier kernel versions this was done in some other way and that symbol was not required. Just add EXPORT_SYMBOL(load_LDT_nolock); in ldt.c and be done with it. Similar with other things you need exported. Kernel developers are trying to avoid exporting what is not necessary but if you need it then you need it. Michal |
From: Luke P. <lop...@wi...> - 2004-07-28 22:37:24
|
Hi everyone, I'm having trouble with the most recent version of bproc, using a kernel.org 2.6.6 kernel. Everything compiles and installs fine, but: # modprobe bproc FATAL: Error inserting bproc (/lib/modules/2.6.6-4g4g/extra/bproc.ko): Unknown symbol in module, or unknown parameter (see dmesg) The tail of dmesg is this: bproc: Unknown symbol do_vmadump bproc: Unknown symbol load_LDT_nolock bproc: Unknown symbol vmadump_thaw_proc bproc: Unknown symbol vmadump_freeze_proc vmadump: 4.0.0pre5 Erik Hendriks <er...@he...> bproc: Unknown symbol load_LDT_nolock Any ideas? -Luke |
From: YhLu <Yh...@ty...> - 2004-07-26 19:00:18
|
Where can I find the info about install bproc with Myrinet? Regards YH |
From: Kimitoshi T. <kt...@cl...> - 2004-07-26 10:14:00
|
Hello all I'm trying to build bproc tools on newly installed Fedora Core 2 box. So far, I had luck in compiling following software: bproc-4.0.0pre5 beonss-cm1.1 bjs-1.5 cmtool-1.4 mpich-1.2.5.2(bproc-pacthed) linux-2.6.6(bproc-pacthed) However, I got stuck in building beoboot-cm1.9. When I type make in the source directory, it stops with following messages; $ make make -C monte LINUX=/lib/modules/2.6.6-bproc4.0.0pre5_01/build EXTRAKDEFS="" kmonte.ko make[1]: Entering directory `/home/ktaka/ezCluster-1.0.0/SRC/beoboot-cm1.9/monte' gcc -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 - march=pentium4 -I/lib/modules/2.6.6-bproc4.0.0pre5_01/build/include/asm-i386/mach-default -O2 -fomit-frame-pointer -g - Wdeclaration-after-statement -DMODULE -DPACKAGE_VERSION='"cm1.9"' -I. -c -o kmonte.o kmonte.c kmonte.c:66:27: linux/pagemap.h: No such file or directory kmonte.c:67:24: linux/slab.h: No such file or directory kmonte.c:70:25: asm/uaccess.h: No such file or directory kmonte.c:72:25: asm/pgtable.h: No such file or directory <big snip> There aren't linux/pagemap.h, linux/slab.h, asm/uaccess.h, asm/pgtable.h in /usr/include on my FC2 box. Those files exist on other boxes such as RH9, RH8 but not in the new FC2 box. I'm wondering if there is anybody who had same situation. Also I want to know where to get those headers from. Simply copying them from the kernel source didn't let me through the compilation. Thank you Kimitohsi Takahashi Cluster Computing Inc. Japan |
From: Daniel G. <dg...@ti...> - 2004-07-25 04:25:22
|
On Sat, Jul 24, 2004 at 04:46:10PM -0500, Luke Palmer wrote: > Hi all, > > I find myself continually adding to my bproc libraries list, which is by > now huge. I was considering just making all my libraries available to > nodes via an NFS mount. Can anyone think of a reason why that would be > bad? The stuff we do here is very long running, so the increased > library loading time would be negligible. Hi Luke, This is a common problem for us too. My usual suggestion is to statically link the executables, which reduces the need for all the library loading specifications. This is especially good with executables that are long running, so you don't need to load the huge static executables too often. Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Luke P. <lop...@wi...> - 2004-07-24 21:46:14
|
Hi all, I find myself continually adding to my bproc libraries list, which is by now huge. I was considering just making all my libraries available to nodes via an NFS mount. Can anyone think of a reason why that would be bad? The stuff we do here is very long running, so the increased library loading time would be negligible. Thanks -Luke |
From: Daniel G. <dg...@ti...> - 2004-07-21 17:51:00
|
On Wed, Jul 21, 2004 at 10:58:39AM -0500, Brian Barrett wrote: > On Jul 20, 2004, at 9:19 AM, Daniel Gruner wrote: > > > On Tue, Jul 20, 2004 at 09:03:01AM -0500, Brian Barrett wrote: > >> On Jul 20, 2004, at 8:08 AM, Thomas Eckert wrote: > >> > >>> this thread seems to have slipped off the bproc-list -- most likely I > >>> replied > >>> to the wrong message -- so here is a forward of my reply :( > >>> > >>> I'm interested in the bproc3<->lam-7.0.x results: have you tried > >>> bproc3 with > >>> the latest stable lam (7.0.x) and it did not work or are you focusing > >>> on > >>> bproc4 now anyway due to other reasons (want to use 2.6-kernels, > >>> ...)? > >> > >> Luke was using BProc 4, which LAM 7.0.x does not support (LAM 7.1, > >> which just went into beta, supports what is currently in the BProc 4 > >> API. Hopefully, that means it will support BProc 4 when it goes > >> stable). > >> > >> If you have any problems using LAM 7.0.x with BProc 3, please let us > >> (the LAM developers) know. There has been some fairly extensive > >> testing, so I would be surprised if there were problems in that area. > > > > I have recently installed LAM 7.0.2 on a CM3 (BProc 3) cluster. It > > mostly > > works, but there are a few disturbing glitches: > > > > - I cannot seem to run 2 MPI jobs as the same user simultaneously (on > > different sets of nodes, of course), since when I do the second > > invocation > > of mpiexec (or its equivalent lamboot/run/lamhalt) it kills the first > > lamd > > on the master node. It does seem to work for different users, though. > > This is expected behavior. The design of LAM is that your start the > RTE (the daemons) on the nodes you will use for all MPI applications, > then run your application (or applications) inside that universe. If > you need two separate universes, you can use the LAM_MPI_SESSION_SUFFIX > environment variable to keep the daemons from clobbering each other. > See the lamboot(1) man page and the LAM/MPI User Document (available in > pdf form on the web page) for more information. I see the logic in this, but what happens when you are running some kind of batch queuing system, request a set of nodes/processors, and then run lamboot on them? I guess the LAM_MPI_SESSION_SUFFIX must then be used for all mpiexec jobs, right? I had better rtfm... Also, why did it only happen for the same user, but not when different users do it? I guess you can only kill your own jobs, but do the LAM daemons keep separate universes in that case? > > > - Just doing lamboot followed by lamhalt (whether or not some mpi job > > is run) > > produces a core dump (I guess it is by lamhalt). Always. Running > > mpiexec does > > it too. > > Yeah, that's disturbing. Is there a core file left around? If so, can > you gdb the core file and send me a stack trace? I'll try when I see one. Thanks. Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Brian B. <brb...@la...> - 2004-07-21 15:58:50
|
On Jul 20, 2004, at 9:19 AM, Daniel Gruner wrote: > On Tue, Jul 20, 2004 at 09:03:01AM -0500, Brian Barrett wrote: >> On Jul 20, 2004, at 8:08 AM, Thomas Eckert wrote: >> >>> this thread seems to have slipped off the bproc-list -- most likely I >>> replied >>> to the wrong message -- so here is a forward of my reply :( >>> >>> I'm interested in the bproc3<->lam-7.0.x results: have you tried >>> bproc3 with >>> the latest stable lam (7.0.x) and it did not work or are you focusing >>> on >>> bproc4 now anyway due to other reasons (want to use 2.6-kernels, >>> ...)? >> >> Luke was using BProc 4, which LAM 7.0.x does not support (LAM 7.1, >> which just went into beta, supports what is currently in the BProc 4 >> API. Hopefully, that means it will support BProc 4 when it goes >> stable). >> >> If you have any problems using LAM 7.0.x with BProc 3, please let us >> (the LAM developers) know. There has been some fairly extensive >> testing, so I would be surprised if there were problems in that area. > > I have recently installed LAM 7.0.2 on a CM3 (BProc 3) cluster. It > mostly > works, but there are a few disturbing glitches: > > - I cannot seem to run 2 MPI jobs as the same user simultaneously (on > different sets of nodes, of course), since when I do the second > invocation > of mpiexec (or its equivalent lamboot/run/lamhalt) it kills the first > lamd > on the master node. It does seem to work for different users, though. This is expected behavior. The design of LAM is that your start the RTE (the daemons) on the nodes you will use for all MPI applications, then run your application (or applications) inside that universe. If you need two separate universes, you can use the LAM_MPI_SESSION_SUFFIX environment variable to keep the daemons from clobbering each other. See the lamboot(1) man page and the LAM/MPI User Document (available in pdf form on the web page) for more information. > - Just doing lamboot followed by lamhalt (whether or not some mpi job > is run) > produces a core dump (I guess it is by lamhalt). Always. Running > mpiexec does > it too. Yeah, that's disturbing. Is there a core file left around? If so, can you gdb the core file and send me a stack trace? Thanks! Brian |
From: Steven J. <py...@li...> - 2004-07-20 20:55:08
|
Greetings, It is probably worth cat /proc/cpuinfo. I found that some dual xeons with some kernel versions don't actually disable hyperthreading even if that is set in BIOS unless ACPI is enabled. Otherwise, I'm out of ideas. G'day, sjames -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 2701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office & fax 866.545.6306 ----------------------------------------------------------------------- On Tue, 20 Jul 2004, Luke Palmer wrote: > Nope, I've got it off in the BIOS for everything except the master. > I've heard the linux scheduler can't tell the fake CPUs from the real > ones, and can get two processes stuck on the same physical CPU. That > would have explained it though... > > Any other ideas? > -Luke > > On Tue, 2004-07-20 at 14:09, Steven James wrote: > > Greetings, > > > > Any chance hyperthreading is turned on? > > > > G'day, > > sjames > > > > > > -------------------------steven james, director of research, linux labs > > ... ........ ..... .... 230 peachtree st nw ste 2701 > > the original linux labs atlanta.ga.us 30303 > > -since 1995 http://www.linuxlabs.com > > office & fax 866.545.6306 > > ----------------------------------------------------------------------- > > > > > > On Tue, 20 Jul 2004, Luke Palmer wrote: > > > > > Hi everyone, > > > > > > My cluster is made up of dual Xeon nodes with 2GB memory. We formerly > > > ran openMosix, which actually worked quite well other than the one hour > > > uptimes... Anyway, we observe that when placing two processes on a > > > node, they run about half as fast as they would were a single process > > > placed on a node. If I look at ps or top, both processes stay at 99.9% > > > or so, and the load averages stay just shy of 2, but the wall clock > > > doesn't lie... > > > > > > I'm running the most recent clustermatic stuff, so nodes and master have > > > 2.4.22-cm36smp kernels over Fedora Core 1. This didn't used to happen > > > with openMosix for the exact same executables. Any ideas of things to > > > look at? > > > > > > Thanks > > > -Luke > > > > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by BEA Weblogic Workshop > > > FREE Java Enterprise J2EE developer tools! > > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > > _______________________________________________ > > > BProc-users mailing list > > > BPr...@li... > > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Luke P. <lop...@wi...> - 2004-07-20 19:32:51
|
Nope, I've got it off in the BIOS for everything except the master. I've heard the linux scheduler can't tell the fake CPUs from the real ones, and can get two processes stuck on the same physical CPU. That would have explained it though... Any other ideas? -Luke On Tue, 2004-07-20 at 14:09, Steven James wrote: > Greetings, > > Any chance hyperthreading is turned on? > > G'day, > sjames > > > -------------------------steven james, director of research, linux labs > ... ........ ..... .... 230 peachtree st nw ste 2701 > the original linux labs atlanta.ga.us 30303 > -since 1995 http://www.linuxlabs.com > office & fax 866.545.6306 > ----------------------------------------------------------------------- > > > On Tue, 20 Jul 2004, Luke Palmer wrote: > > > Hi everyone, > > > > My cluster is made up of dual Xeon nodes with 2GB memory. We formerly > > ran openMosix, which actually worked quite well other than the one hour > > uptimes... Anyway, we observe that when placing two processes on a > > node, they run about half as fast as they would were a single process > > placed on a node. If I look at ps or top, both processes stay at 99.9% > > or so, and the load averages stay just shy of 2, but the wall clock > > doesn't lie... > > > > I'm running the most recent clustermatic stuff, so nodes and master have > > 2.4.22-cm36smp kernels over Fedora Core 1. This didn't used to happen > > with openMosix for the exact same executables. Any ideas of things to > > look at? > > > > Thanks > > -Luke > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > BProc-users mailing list > > BPr...@li... > > https://lists.sourceforge.net/lists/listinfo/bproc-users > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Steven J. <py...@li...> - 2004-07-20 19:09:47
|
Greetings, Any chance hyperthreading is turned on? G'day, sjames -------------------------steven james, director of research, linux labs ... ........ ..... .... 230 peachtree st nw ste 2701 the original linux labs atlanta.ga.us 30303 -since 1995 http://www.linuxlabs.com office & fax 866.545.6306 ----------------------------------------------------------------------- On Tue, 20 Jul 2004, Luke Palmer wrote: > Hi everyone, > > My cluster is made up of dual Xeon nodes with 2GB memory. We formerly > ran openMosix, which actually worked quite well other than the one hour > uptimes... Anyway, we observe that when placing two processes on a > node, they run about half as fast as they would were a single process > placed on a node. If I look at ps or top, both processes stay at 99.9% > or so, and the load averages stay just shy of 2, but the wall clock > doesn't lie... > > I'm running the most recent clustermatic stuff, so nodes and master have > 2.4.22-cm36smp kernels over Fedora Core 1. This didn't used to happen > with openMosix for the exact same executables. Any ideas of things to > look at? > > Thanks > -Luke > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |