You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: <bc...@au...> - 2005-02-21 00:15:05
|
<BODY><P>Hello list,</P> <P>Just wondering if there is a special version of HPL for running under clustermatic(BProc)</P> <P>thanks</P> <P>Brady</P></BODY> |
From: <bc...@au...> - 2005-02-19 22:55:47
|
<BODY><P>Hello all,</P> <P>I'm running into a problem wtih clustermatic 5 ,FC3, NFS. all nodes NFS mounted successfully.</P> <P>However, I see the node_up in the ps output eventhough I have no problem access /home(exportfs) from n16 and it will not go away.</P> <P>root 7554 2268 0 17:16 ? 00:00:00 /bin/sh /etc/clustermatic/node_up 16<BR>root 7582 7554 0 17:16 ? 00:00:00 bpsh 16 mount -t nfs 172.17.100.1:/home /home<BR>root 7663 2268 0 17:19 ? 00:00:00 /bin/sh /etc/clustermatic/node_up 16<BR>root 7691 7663 0 17:19 ? 00:00:00 bpsh 16 mount -t nfs 172.17.100.1:/home /home.</P> <P>thanks in advance for your help</P> <P>Brady</P></BODY> |
From: Dale H. <ro...@ma...> - 2005-02-16 23:02:00
|
Is there a way to set the default permissions (automagically configure) for a node once they are up? I'd like to separate some nodes out for a particular group, and only let them have execute permission to that once they bootup and are available. -- Dale Harris ro...@ma... /.-) |
From: Feldman Wilma<po...@ya...> - 2005-02-05 09:26:14
|
See attachment message.html |
From: Dale H. <ro...@ma...> - 2005-02-04 16:38:44
|
Hey, I was looking some web page talking about schedulers and Scyld's beowulf, and using the beorun command. I'm not able to find much of any documentation out there about what this command is, or does. Anyone familiar with it? -- Dale Harris ro...@ma... /.-) |
From: Brian B. <brb...@la...> - 2005-02-03 15:50:07
|
On Jan 25, 2005, at 6:14 PM, Dale Harris wrote: > So anyone tried to get this too work? Doesn't appear lamboot is very > happy with bjssub. Any hints on how to get this to work? I'm finally getting a chance to look at this, and am seeing some odd behavior with LAM under bjs. The lamd processes we start on the compute nodes seem to disappear almost immediately after doing their startup handshake with lamboot. Is there anything different / special with the behavior of BProc when using bjs as opposed to not using bjs, other than the default permissions of the nodes? Any killing programs that get run or anything? Brian -- Brian Barrett LAM/MPI developer and all around nice guy Have a LAM/MPI day: http://www.lam-mpi.org/ |
From: Dale H. <ro...@ma...> - 2005-02-01 21:22:30
|
On Tue, Feb 01, 2005 at 08:52:45AM -0600, Rene Salmon elucidated: > > > > Hi, > > It looks like PBSPro at some point officially started to support Scyld and > even put out some test beta versions. > Right. I don't anything about the status of that. Big thing with PBSPro is you're looking a something that costs around $125 to $500 per CPU. Dale |
From: Daniel G. <dg...@cp...> - 2005-02-01 19:19:31
|
Sorry, I forgot the attachment... -- Dr. Daniel Gruner dg...@ch... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: Daniel G. <dg...@cp...> - 2005-02-01 19:14:28
|
HI Jordan, I am attaching here a tar.gz file with all my /etc/clustermatic stuff. You will find there, in the node/ directory, the modprobe.conf stuff. The node_up script is totally different from the original one. I should say that all this was reworked by Michal Jaegermann, from HardData. I hope it clears up your problems. Let me know... Regards, Daniel On Tue, Feb 01, 2005 at 11:00:55AM -0800, J. Dawe wrote: > Hi all. I've read through the list messages and found the references to > how Clustermatic 5 doesn't mount nfs the old way anymore, and I've tried > to set up the nfs.init script that Daniel Gruner posted, but it doesn't > seem to be running. > > I have added > > . /etc/clustermatic/nfs.init > > to my node_up script, but when I "bpsh 0 ls", /home doesn't appear in the > dir list. Also, I can run the nfs.init file from the prompt as: > > /etc/clustermatic/nfs.init 0 > > and it will run, with the following error messages: > > FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or > directory > mount: fs type rpc_pipefs not supported by kernel > FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or > directory > FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or > directory > mount: fs type nfs not supported by kernel > mount: fs type nfs not supported by kernel > > After I run nfs.init manually, /home is created on the slave, which I can > see with "bpsh 0 ls". > > So how do I fix this? I would think maybe the node isn't finding the > /etc/modprobe.conf.dist file it needs, except the lack of the > creation of /home on boot suggests to me the script isn't running at all. > Suggestions? > > Jordan Dawe > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting > Tool for open source databases. Create drag-&-drop reports. Save time > by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. > Download a FREE copy at http://www.intelliview.com/go/osdn_nl > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users -- Dr. Daniel Gruner dg...@ch... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |
From: J. D. <jdawe@u.washington.edu> - 2005-02-01 19:00:58
|
Hi all. I've read through the list messages and found the references to how Clustermatic 5 doesn't mount nfs the old way anymore, and I've tried to set up the nfs.init script that Daniel Gruner posted, but it doesn't seem to be running. I have added . /etc/clustermatic/nfs.init to my node_up script, but when I "bpsh 0 ls", /home doesn't appear in the dir list. Also, I can run the nfs.init file from the prompt as: /etc/clustermatic/nfs.init 0 and it will run, with the following error messages: FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or directory mount: fs type rpc_pipefs not supported by kernel FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or directory FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or directory mount: fs type nfs not supported by kernel mount: fs type nfs not supported by kernel After I run nfs.init manually, /home is created on the slave, which I can see with "bpsh 0 ls". So how do I fix this? I would think maybe the node isn't finding the /etc/modprobe.conf.dist file it needs, except the lack of the creation of /home on boot suggests to me the script isn't running at all. Suggestions? Jordan Dawe |
From: Dale H. <ro...@ma...> - 2005-01-27 17:29:49
|
So I'm thinking about hacking in support for LAM into BJS, by basically doing a fork()/exec() of lamboot in bjssub. Any suggestions on the best way to go about doing that? I'm still puzzling over the code some. Dale |
From: Dale H. <ro...@ma...> - 2005-01-25 23:46:18
|
For a little more information, I'm trying a script like the below to bjssub: #!/bin/sh TMPFILE=`mktemp /tmp/lamhosts.$JOBID.XXXXXX` echo $NODES | awk '{ split($0,nodes,","); print "n-1 no-schedule=1"; for (i in nodes) printf "n%d\n", nodes[i] }' > $TMPFILE lamboot -d $TMPFILE sleep 2 mpirun C <mpijob> -- Dale Harris ro...@ma... /.-) |
From: Dale H. <ro...@ma...> - 2005-01-25 23:14:43
|
So anyone tried to get this too work? Doesn't appear lamboot is very happy with bjssub. Any hints on how to get this to work? -- Dale Harris ro...@ma... /.-) |
From: Dale H. <ro...@ma...> - 2005-01-25 21:01:24
|
On Sat, Dec 11, 2004 at 11:19:11AM -0600, Rene Salmon elucidated: > > Hi, > > Does anyone know if the Maui scheduler can be used with bjs? > > I found some postings on clubmask http://clubmask.sourceforge.net/ > which uses bproc and maui but I don't think it uses bjs. No, clubmask isn't bjs. Last thing I heard about Clubmask was that the primary developer got a new job and was too swamped to work on it. However, that may have changed. You might look at using SGE, it's a bit of a kludge to fit it to bproc, but it can work. Dale |
From: Reza S. <sh...@en...> - 2005-01-19 17:34:02
|
Hello again, Just thought I would keep you up to date with our experiments on getting MATLAB running with BProc on a Clustermatic cluster. The sysadmin here finally figured out that the reason why the node_up and nfs.init scripts were not working was that they were in Windows format and not Unix format, though this was not immediately visible e.g. from vi. So after getting the MATLAB directory mounted on the slave nodes, MATLAB is working more or less. The main problem we are having now is that after running 'bpsh 0 matlab' for example (actually 'bpsh 0 matlab2' where matlab2 is a modified script), the terminal hangs after the script is run and the MATLAB prompt or command prompt is not visible after the script is complete (a Ctrl-C is required to get this to show up). It seems the master node doesn't know that the slave node has finished its job. Even if I put an 'exit' at the end of the MATLAB script, it still does the same thing. If anybody has any idea of a workaround for this, then that would be really great. Thanks everybody for all your help! - Reza Reza Shahidi wrote: > Hi, > > I think this is not a mount problem. Even if I comment out the > entire nfs.init file, the nodes still hang on booting. The boot > process must be getting stuck in the node_up script. It is too bad I > am unable to find any useful log messages. If anybody can think of > what could be happening, please let me know. Thanks. > > Happy New Year, > > Reza > > Steven James wrote: > >> Greetings, >> >> NFS mounts can hang up if the server isn't running lockd and the mount >> options don't include nolock. >> >> G'day, >> sjames >> >> >> >> On Fri, 31 Dec 2004, Reza Shahidi wrote: >> >> >> >>> Hello, >>> >>> I tried the script you sent below, but now the nodes get stuck with >>> a status of boot when Clustermatic is restarted. I can't bpsh to the >>> nodes or anything. On the screen of node 0, the boot sequence gets >>> stuck at bpslave-0: setting node number to 0, and stays that way if not >>> restarted. This does not happen when the regular node_up/nfs.init >>> scripts are used, but of course, I am still not able to get the NFS >>> mount working in this case either. Any more ideas? >>> >>> Thanks, >>> >>> Reza >>> >>> Daniel Gruner wrote: >>> >>> >>> >>>> Reza, >>>> >>>> For some reason, in Clustermatic 5, trying to do NFS mounts according >>>> to the "manual" (which is what you tried, and used to work in >>>> Clustermatic 4), doesn't work anymore. We've had to do some hacks >>>> in order >>>> to make it work. In short, do NOT try the NFS mounts in >>>> /etc/clustermatic/fstab. What you have to do is run a script from the >>>> /etc/clustermatic/node_up script, which will do all the necessary >>>> stuff on >>>> the nodes. >>>> >>>> I am attaching here my /etc/clustermatic/node_up, and another file >>>> called >>>> nfs.init which is also put in /etc/clustermatic. This scheme works >>>> well for us, and it should work for you as well. You will need to >>>> modify >>>> the nfs.init script to mount your particular filesystem(s). >>>> >>>> Regards, >>>> Daniel >>>> >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> #!/bin/sh >>>> # >>>> # This shell script is called automatically by BProc to perform any >>>> # steps necessary to bring up the nodes. This is just a stub script >>>> # pointing to the program that does the real work. >>>> # >>>> # $Id: node_up.stub,v 1.3 2003/11/12 23:30:59 mkdist Exp $ >>>> >>>> # All changes up to "############" line by >>>> # Michal Jaegermann, mi...@ha... >>>> >>>> seterror () { >>>> bpctl -S $1 -s error >>>> exit 1 >>>> } >>>> >>>> if [ -x /usr/lib64/beoboot/bin/node_up ] ; then >>>> /usr/lib64/beoboot/bin/node_up $* || seterror $* >>>> else >>>> /usr/lib/beoboot/bin/node_up $* || seterror $* >>>> fi >>>> # we are "sourcing" these script so variable assignments >>>> # remain like in here; pass a node number as an argument >>>> # if you want to _run_ them from a shell and wrap in a loop >>>> # for multiple nodes >>>> # >>>> # lm_sensors - 'bpsh 3 sensors' will produce sensors information >>>> for node 3 >>>> # . /etc/clustermatic/sensors.init >>>> # if we use pathscale libraries we have to make them available on >>>> nodes >>>> # . /etc/clustermatic/pathscale.init >>>> # similarly for Intel compiler >>>> # . /etc/clustermatic/intel.init >>>> # Turn the next line on for NFS support on nodes >>>> . /etc/clustermatic/nfs.init >>>> >>>> exit >>>> >>>> ############ >>>> >>>> # below the original script - now NOT executing due to 'exit' above >>>> >>>> if [ -x /usr/lib64/beoboot/bin/node_up ] ; then >>>> exec /usr/lib64/beoboot/bin/node_up $* >>>> else >>>> exec /usr/lib/beoboot/bin/node_up $* >>>> fi >>>> >>>> # If we reach this point there's an error. >>>> bpctl -S $* -s error >>>> exit 1 >>>> >>>> # If you want to put more setup stuff here, make sure do replace the >>>> # "exec" above with the following: >>>> # /usr/lib/beoboot/bin/node_up $* || exit 1 >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> #!/bin/sh >>>> # >>>> # A sample how to get NFS modules on a node. >>>> # Make sure that /etc/modules.conf.dist for a node does not >>>> # define any 'install' actions for these >>>> # >>>> # Michal Jaegermann, 2004/Aug/19, mi...@ha... >>>> # >>>> >>>> node=$1 >>>> # get the list of modules, and copy them to the node >>>> mod=nfs >>>> modules=$( grep $mod.ko /lib/modules/$(uname -r)/modules.dep) >>>> modules=${modules/:/} >>>> modules=$( >>>> for m in $modules ; do >>>> echo $m >>>> done | tac ) >>>> ( cd / >>>> for m in $modules ; do >>>> echo $m >>>> done >>>> ) | ( cd / ; cpio -o -c --quiet ) | bpsh $node cpio -imd --quiet >>>> bpsh $node depmod -a >>>> # fix the permissions after cpio >>>> bpsh $node chmod -R a+rX /lib >>>> # load the modules >>>> for m in $modules ; do >>>> m=$(basename $m .ko) >>>> m=${m/_/-} >>>> case $m in >>>> sunrpc) >>>> bpsh $node modprobe -i sunrpc >>>> bpsh $node mkdir -p /var/lib/nfs/rpc_pipefs >>>> bpsh $node mount | grep -q rpc_pipefs || \ >>>> bpsh $node mount -t rpc_pipefs sunrpc /var/lib/nfs/rpc_pipefs >>>> ;; >>>> *) bpsh $node modprobe -i $m >>>> esac >>>> done >>>> # these are for a benfit of rpc.statd >>>> bpsh $node mkdir -p /var/lib/nfs/statd/ >>>> bpsh $node mkdir -p /var/run >>>> bpsh $node portmap >>>> bpsh $node rpc.statd >>>> bpsh $node mkdir /home >>>> bpsh $node mount -t nfs -o nfsvers=3,rw,noac master:/home /home >>>> bpsh $node mkdir /usr/local >>>> bpsh $node mount -t nfs -o nfsvers=3,rw,noac master:/usr/local >>>> /usr/local >>>> >>>> >>>> >>> >>> >>> ------------------------------------------------------- >>> The SF.Net email is sponsored by: Beat the post-holiday blues >>> Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. >>> It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt >>> _______________________________________________ >>> BProc-users mailing list >>> BPr...@li... >>> https://lists.sourceforge.net/lists/listinfo/bproc-users >>> >>> >> >> >> ||||| |||| ||||||||||||| ||| >> by Linux Labs International, Inc. >> Steven James, CTO >> >> 55 Marietta Street >> Suite 1830 >> Atlanta, Ga 30303 >> 866 824 9737 support >> >> >> > > > > ------------------------------------------------------- > The SF.Net email is sponsored by: Beat the post-holiday blues > Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. > It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Dale H. <ro...@ma...> - 2005-01-19 04:45:42
|
On Tue, Jan 18, 2005 at 04:33:46PM -0600, Rene Salmon elucidated: > Hi list, > > > > We just got our bproc cluster up and we are still trying to figure some > things out and would appreciate some help on this. > > We have some C code that we want to run on the bproc cluster. The problem > is that our C code makes several calls to system() in order to call outside > programs like "file" or "mkdir" etc.. > Well, you could always use a fork(), exec() combo instead of doing system(). But there should be a libc call for mkdir(). Things like file, you could use stat() instead. using strftime(). The is a whole library of calls you could be using instead of system() calls. > Here is a sample of the stuff this code is trying to do thank you for any > help on this? > > void test_date(void) > { > FILE *dfile; > int i; > > system("date '+%Y%m%d' > .z7x2q0"); > dfile = fopen(".z7x2q0","r"); > if (dfile==0) > { > fprintf(Stderr,"\n\n** fatal error - write access denied\n\n"); > exit(1); > } > fscanf(dfile,"%d",&i); > fclose(dfile); This is completely unnecessary... look at strftime(). > system("rm -f .z7x2q0"); Have you ever heard of unlink()? Dale |
From: Rene S. <rs...@tu...> - 2005-01-18 22:33:51
|
Hi list, We just got our bproc cluster up and we are still trying to figure some things out and would appreciate some help on this. We have some C code that we want to run on the bproc cluster. The problem is that our C code makes several calls to system() in order to call outside programs like "file" or "mkdir" etc.. of course this fails on a bproc cluster because these binaries do not exist on the compute nodes. On the Linux system() man page it states: system() executes a command specified in string by calling /bin/sh -c string, and returns after the command has been completed. So not only are the binaries we are trying to call missing but also /bin/sh. My quick fix would be to nfs mount /usr and /bin to all the nodes. But I am hopping someone here would have a better idea on how to fix this maybe there is a bproclib I can call to do these systems calls??? Here is a sample of the stuff this code is trying to do thank you for any help on this? void test_date(void) { FILE *dfile; int i; system("date '+%Y%m%d' > .z7x2q0"); dfile = fopen(".z7x2q0","r"); if (dfile==0) { fprintf(Stderr,"\n\n** fatal error - write access denied\n\n"); exit(1); } fscanf(dfile,"%d",&i); fclose(dfile); system("rm -f .z7x2q0"); if (i>killdate || i< MINDATE) { fprintf(Stderr,"\n\n*** fatal error - recompile code\n\n\n"); exit(1); } } Rene |
From: Alexander L. <Ale...@IG...> - 2005-01-18 10:00:00
|
Hello all, I have just installed a small testbed for our cluster consisting of only 2 computers. I have installed clustermatic 5 on top of Fedora Core 3. Booting and bpsh'ing works fine, but I have some trouble getting MPI programs to work with LAM/MPI. I found some postings on the archives but no clues how to solve them. The following issue concerns LAM/MPI 7.1.1-2 and the latest SVN snapshot (7.2b1r10023). I compiled both from scratch using gcc/g77 and gcc/nagware. I can lamboot without any problems using the bproc ssi boot module and tping reports that it can find all computers (master and 1 node). Then I try to start one of the examples contained in the LAM/MPI distro, e.g. the pi one. As soon as I start "mpirun n0-1 PATH_TO_LAM/example/fpi", I get the following message on the nodes console: "bproc: WARNING: bproc/move.c: 1886: send_recv_process needs to be reworked to be consistent with the rest of the move code" And on the master the mpirun program reports: "----------------------------------------------------------------------- ------ It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. ------------------------------------------------------------------------ -----" Does anybody experienced similar problems or has a tip how to verify that my setup is basically ok? Help would be very appreciated. Thanks in advance. Alex |
From: Dale H. <ro...@ma...> - 2005-01-16 19:46:46
|
On Sun, Jan 16, 2005 at 07:48:51PM +0100, Thomas Eckert elucidated: > > Dale, > > i seem to remember that the ping-problem is due to a missing > "/etc/protocols"-file -- no test-equipment handy right now to verify it. > from a quick look at socket(2) this may also be the cause of your lam > problems? > My problems with lam was having CONFIG_UNIX set to module (or not making that module available to the nodes). I tried copying /etc/protocols over to a node, but that didn't seem to fix name resolution problem on the node. But it doesn't seem to be having any adverse effects as far I could tell, just seems like it should work since I'm copying a nsswitch.conf over to the nodes. Dale |
From: Dale H. <ro...@ma...> - 2005-01-16 18:53:19
|
On Sun, Jan 16, 2005 at 08:40:04AM -0800, Dale Harris elucidated: > > Well, I was hoping that perhaps it might have been due to having > CONFIG_UNIX as a module. But that solve my problem, either, I compiled > it into the kernel. > Okay, pardon my bad grammar. I should have stayed in bed a little longer this morning. This was exactly my problem I had CONFIG_UNIX configured as a module. Compiling it in appears to have solved my problem, I hadn't updated my kernel image for the nodes correctly. So that all seems to work now. Thanks for the help. However, seems like I should be able to bpsh 1 ping n2? the nX isn't resolving on the nodes. Dale |
From: Thomas E. <eck...@gm...> - 2005-01-16 18:49:05
|
Dale, i seem to remember that the ping-problem is due to a missing "/etc/protocols"-file -- no test-equipment handy right now to verify it. from a quick look at socket(2) this may also be the cause of your lam problems? Thomas On Sun, 16 Jan 2005, Dale Harris wrote: > > Well, I was hoping that perhaps it might have been due to having > CONFIG_UNIX as a module. But that solve my problem, either, I compiled > it into the kernel. > > So I'm still stumped. So anyone has some insight, I'd appreciate it. |
From: Dale H. <ro...@ma...> - 2005-01-16 16:40:15
|
Well, I was hoping that perhaps it might have been due to having CONFIG_UNIX as a module. But that solve my problem, either, I compiled it into the kernel. So I'm still stumped. So anyone has some insight, I'd appreciate it. -- Dale Harris ro...@ma... /.-) |
From: Dale H. <ro...@ma...> - 2005-01-16 07:29:06
|
On Sun, Jan 16, 2005 at 12:32:48AM -0500, Brian Barrett elucidated: > > > Can you tell when that error message occurs? Perhaps there is > something wrong with the BProc cluster that is causing your errors. Do > other applications run properly on the compute nodes? Also, what > happens if you try to boot with no hostfile (so it just tries to start > on the BProc head node)? > > Brian > Yeah, I think it probably something I have done wrong with my bproc config. Something else that doesn't work that problem should be the ability to resolve nodes names on the nodes themselves. In other words, I can't do: bpsh 1 ping n-1 Despite the fact that nsswitch.conf get's copied over and libnss_bproc is in bplib's library list. On the head node, resolving works just fine. I haven't tried any other applications other than PVM, which may not ever work. However it seems to have the same problems, the same socket connect area occur. I haven't tried just starting on just the head node. What do you think that will show me? I'll attach my config and node_up.conf Dale |
From: Brian B. <brb...@la...> - 2005-01-16 05:33:00
|
On Jan 14, 2005, at 7:18 PM, Dale Harris wrote: > I'm having a problem successfully running lamboot, from lam 7.1.1, on a > bproc system running version 4.0p8. What I see from lamboot is: > > lamboot hosts > > LAM 7.1.1/MPI 2 C++/bproc - Indiana University > > lamd kernel: problem with socket(): Address family not supported by > protocol > ... This is coming from a call to create a unix domain socket: if ((sd_kernel = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) lampanic("lamd kernel: problem with socket()"); I'm not really sure how that could be failing with the given error message. I'm guessing that it's a symptom of the real problem. I know that's not really helpful, but there really isn't any reason that call to socket() should fail. > I was able to do a little strace of this, and see errors like: > > getxattr("/bpfs/-1", "bproc.addr", 0xbfffeff4, 16) = 16 > socket(PF_FILE, SOCK_STREAM, 0) = 3 > connect(3, {sa_family=AF_FILE, path="/var/run/.nscd_socket"}, 110) = -1 > ENOENT (No such file or directory) > close(3) = 0 > > But that doesn't make much sense to me, looks like it trying to resolve > a name, perhaps. I assume this is a symptom, but not a cause. Can you tell when that error message occurs? Perhaps there is something wrong with the BProc cluster that is causing your errors. Do other applications run properly on the compute nodes? Also, what happens if you try to boot with no hostfile (so it just tries to start on the BProc head node)? Brian -- Brian Barrett LAM/MPI developer and all around nice guy Have an LAM/MPI day: http://www.lam-mpi.org/ |
From: Dale H. <ro...@ma...> - 2005-01-15 00:18:38
|
Hi, I'm having a problem successfully running lamboot, from lam 7.1.1, on a bproc system running version 4.0p8. What I see from lamboot is: lamboot hosts LAM 7.1.1/MPI 2 C++/bproc - Indiana University lamd kernel: problem with socket(): Address family not supported by protocol ... I was able to do a little strace of this, and see errors like: getxattr("/bpfs/-1", "bproc.addr", 0xbfffeff4, 16) = 16 socket(PF_FILE, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/.nscd_socket"}, 110) = -1 ENOENT (No such file or directory) close(3) = 0 But that doesn't make much sense to me, looks like it trying to resolve a name, perhaps. I assume this is a symptom, but not a cause. Here's my laminfo: LAM/MPI: 7.1.1 Prefix: /usr/local/lam-7.1.1 Architecture: i686-pc-linux-gnu Configured by: root Configured on: Fri Jan 14 18:28:27 EST 2005 Configure host: circe198 Memory manager: ptmalloc2 C bindings: yes C++ bindings: yes Fortran bindings: yes C compiler: gcc C++ compiler: g++ Fortran compiler: g77 Fortran symbols: double_underscore C profiling: yes C++ profiling: yes Fortran profiling: yes C++ exceptions: no Thread support: yes ROMIO support: no IMPI support: no Debug support: no Purify clean: no SSI boot: bproc (API v1.1, Module v1.1) SSI boot: globus (API v1.1, Module v0.6) SSI boot: rsh (API v1.1, Module v1.1) SSI boot: slurm (API v1.1, Module v1.0) SSI coll: lam_basic (API v1.1, Module v7.1) SSI coll: shmem (API v1.1, Module v1.0) SSI coll: smp (API v1.1, Module v1.2) SSI rpi: crtcp (API v1.1, Module v1.1) SSI rpi: gm (API v1.1, Module v1.2) SSI rpi: lamd (API v1.0, Module v7.1) SSI rpi: sysv (API v1.0, Module v7.1) SSI rpi: tcp (API v1.0, Module v7.1) SSI rpi: usysv (API v1.0, Module v7.1) SSI cr: self (API v1.0, Module v1.0) Any help or insight would be appreciated. My config.log and output from lamboot -d are attached. -- Dale Harris ro...@ma... /.-) |