From: Daniel G. <dg...@ti...> - 2004-06-09 22:16:17
|
On Wed, Jun 09, 2004 at 12:11:28AM -0700, Brian W. Barrett wrote: > On Jun 8, 2004, at 7:10 PM, Brian W. Barrett wrote: > > > On Jun 8, 2004, at 6:06 PM, Daniel Gruner wrote: > > > >> A couple of months ago I had a similar problem, and Brian Barrett, who > >> is a/the maintainer of lam-mpi said he would hack at it. I offered > >> to test it on a bproc-4 cluster. However I never heard from him > >> again. > >> The exchange is in the bproc list archive (I assume there is an > >> archive :-). > > > > Brian's still trying to find some time to work on it :). > > > > I'm just a little behind fixing some other bugs in LAM 7.1. And since > > I'm doing most of this by guesswork (we don't have access to a bproc-4 > > cluster and I'm not even sure the bproc-3 cluster I had access to > > exists any more), it's going slow. > > > > If you want to take a look, I think the bproc code in > > share/etc/lam_*.c can be #if 0'ed out with bproc-4 and beonss. The > > code in share/ssi/boot/bproc/src/ is going to take a bit more work, > > but I didn't think it would be that bad. Of course, I haven't had > > time to look in the last couple weeks. > > Amazing what a kick in the pants and boredom will do. I hacked up some > support for bproc-4 based on previous e-mails from Erik and Greg. > Testing is at the "it compiled without warnings" stage, so this might > take a couple of tries to get right. Let me know how it goes. > > I've attached a patch against our current SVN tree. You will need to > check out the trunk, apply the attached patch, run the autogen.sh > script, then the usual configure ; make ; make install. More detailed > instructions, including requirements for Autoconf, Automake, and > Libtool can be found on our web page at: > > http://www.lam-mpi.org/svn/ > > If anyone wants to try this out, but can't get subversion or the > Autotools to work, let me know and I'll build a tarball and post it > somewhere. > > Brian > > PS. In theory, Myrinet/GM and InfiniBand support should work just fine > with the LAM trunk and BProc. If you have such a cluster, let me know > how it goes - I don't think anyone has tried it before. Brian, I managed to build the latest version, including patches, on my bproc 4 cluster. It is made of Alpha machines, and I configured lam to use the "fort" compiler (the Compaq fortran compiler). It build without any apparent issues, and installed fine. Now, when I try to lamboot, it dumps core! Here is the output from "lamboot -d bhost": n-1<16164> ssi:boot:open: opening n-1<16164> ssi:boot:open: opening boot module bproc n-1<16164> ssi:boot:open: opened boot module bproc n-1<16164> ssi:boot:open: opening boot module globus n-1<16164> ssi:boot:open: opened boot module globus n-1<16164> ssi:boot:open: opening boot module rsh n-1<16164> ssi:boot:open: opened boot module rsh n-1<16164> ssi:boot:open: opening boot module slurm n-1<16164> ssi:boot:open: opened boot module slurm n-1<16164> ssi:boot:select: initializing boot module slurm n-1<16164> ssi:boot:slurm: not running under SLURM n-1<16164> ssi:boot:select: boot module not available: slurm n-1<16164> ssi:boot:select: initializing boot module rsh n-1<16164> ssi:boot:rsh: module initializing n-1<16164> ssi:boot:rsh:agent: rsh n-1<16164> ssi:boot:rsh:username: <same> n-1<16164> ssi:boot:rsh:verbose: 1000 n-1<16164> ssi:boot:rsh:algorithm: linear n-1<16164> ssi:boot:rsh:no_n: 0 n-1<16164> ssi:boot:rsh:no_profile: 0 n-1<16164> ssi:boot:rsh:fast: 0 n-1<16164> ssi:boot:rsh:ignore_stderr: 0 n-1<16164> ssi:boot:rsh:priority: 10 n-1<16164> ssi:boot:select: boot module available: rsh, priority: 10 n-1<16164> ssi:boot:select: initializing boot module globus n-1<16164> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<16164> ssi:boot:select: boot module not available: globus n-1<16164> ssi:boot:select: initializing boot module bproc n-1<16164> ssi:boot:bproc: module initializing n-1<16164> ssi:boot:bproc:verbose: 1000 n-1<16164> ssi:boot:bproc:priority: 50 n-1<16164> ssi:boot:select: boot module available: bproc, priority: 50 n-1<16164> ssi:boot:select: finalizing boot module slurm n-1<16164> ssi:boot:slurm: finalizing n-1<16164> ssi:boot:select: closing boot module slurm n-1<16164> ssi:boot:select: finalizing boot module rsh n-1<16164> ssi:boot:rsh: finalizing n-1<16164> ssi:boot:select: closing boot module rsh n-1<16164> ssi:boot:select: finalizing boot module globus n-1<16164> ssi:boot:globus: finalizing n-1<16164> ssi:boot:select: closing boot module globus n-1<16164> ssi:boot:select: selected boot module bproc LAM 7.1a1svn/MPI 2 C++/ROMIO/bproc - Indiana University n-1<16164> ssi:boot:base: looking for boot schema in following directories: n-1<16164> ssi:boot:base: <current directory> n-1<16164> ssi:boot:base: $TROLLIUSHOME/etc n-1<16164> ssi:boot:base: $LAMHOME/etc n-1<16164> ssi:boot:base: /usr/local/etc n-1<16164> ssi:boot:base: looking for boot schema file: n-1<16164> ssi:boot:base: bhost n-1<16164> ssi:boot:base: found boot schema: bhost n-1<16164> ssi:boot:bproc: found the following hosts: n-1<16164> ssi:boot:bproc: n0 192.168.101.1 (cpu=1) n-1<16164> ssi:boot:bproc: n1 n0 (cpu=1) n-1<16164> ssi:boot:bproc: n2 n1 (cpu=1) n-1<16164> ssi:boot:bproc: n3 n2 (cpu=1) n-1<16164> ssi:boot:bproc: n4 n3 (cpu=1) n-1<16164> ssi:boot:bproc: n5 n4 (cpu=1) n-1<16164> ssi:boot:bproc: n6 n5 (cpu=1) n-1<16164> ssi:boot:bproc: resolved hosts: n-1<16164> ssi:boot:bproc: n0 192.168.101.1 --> 192.168.101.1 (origin) n-1<16164> ssi:boot:bproc: n1 n0 --> 192.168.101.100 n-1<16164> ssi:boot:bproc: n2 n1 --> 192.168.101.101 n-1<16164> ssi:boot:bproc: n3 n2 --> 192.168.101.102 n-1<16164> ssi:boot:bproc: n4 n3 --> 192.168.101.103 n-1<16164> ssi:boot:bproc: n5 n4 --> 192.168.101.104 n-1<16164> ssi:boot:bproc: n6 n5 --> 192.168.101.105 Segmentation fault (core dumped) Any ideas? Any other things to try? Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |