bproc-users Mailing List for BProc: Beowulf Distributed Process Space (Page 6)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

<BODY><P>Hello list,</P>
<P>Just wondering if there is a special version of HPL for running under clustermatic(BProc)</P>
<P>thanks</P>
<P>Brady</P></BODY>

<BODY><P>Hello all,</P>
<P>I'm running into a problem wtih clustermatic 5 ,FC3, NFS. all nodes NFS mounted successfully.</P>
<P>However, I see the node_up in the ps output eventhough I have no problem access /home(exportfs) from n16 and it will not go away.</P>
<P>root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7554&nbsp; 2268&nbsp; 0 17:16 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 /bin/sh /etc/clustermatic/node_up 16<BR>root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7582&nbsp; 7554&nbsp; 0 17:16 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 bpsh 16 mount -t nfs 172.17.100.1:/home /home<BR>root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7663&nbsp; 2268&nbsp; 0 17:19 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 /bin/sh /etc/clustermatic/node_up 16<BR>root&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 7691&nbsp; 7663&nbsp; 0 17:19 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 bpsh 16 mount -t nfs 172.17.100.1:/home /home.</P>
<P>thanks in advance for your help</P>
<P>Brady</P></BODY>

Is there a way to set the default permissions (automagically configure)
for a node once they are up?  I'd like to separate some nodes out for a
particular group, and only let them have execute permission to that once
they bootup and are available.

-- 
Dale Harris   
ro...@ma...
/.-)

See attachment message.html

Hey, I was looking some web page talking about schedulers and Scyld's
beowulf, and using the beorun command.  I'm not able to find much of any
documentation out there about what this command is, or does.  Anyone
familiar with it?

-- 
Dale Harris   
ro...@ma...
/.-)

On Jan 25, 2005, at 6:14 PM, Dale Harris wrote:

> So anyone tried to get this too work?  Doesn't appear lamboot is very
> happy with bjssub.  Any hints on how to get this to work?

I'm finally getting a chance to look at this, and am seeing some odd 
behavior with LAM under bjs.  The lamd processes we start on the 
compute nodes seem to disappear almost immediately after doing their 
startup handshake with lamboot.  Is there anything different / special 
with the behavior of BProc when using bjs as opposed to not using bjs, 
other than the default permissions of the nodes?  Any killing programs 
that get run or anything?

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/

On Tue, Feb 01, 2005 at 08:52:45AM -0600, Rene Salmon elucidated:
> 
> 
> 
> Hi,
> 
> It looks like PBSPro at some point officially started to support Scyld and
> even put out some test beta versions.
> 

Right.  I don't anything about the status of that.  Big thing with
PBSPro is you're looking a something that costs around $125 to $500 per
CPU.  

Dale

Sorry, I forgot the attachment...
-- 

Dr. Daniel Gruner                        dg...@ch...
Dept. of Chemistry                       dan...@ut...
University of Toronto                    phone:  (416)-978-8689
80 St. George Street                     fax:    (416)-978-5325
Toronto, ON  M5S 3H6, Canada             finger for PGP public key

HI Jordan,

I am attaching here a tar.gz file with all my /etc/clustermatic stuff.
You will find there, in the node/ directory, the modprobe.conf stuff.
The node_up script is totally different from the original one.

I should say that all this was reworked by Michal Jaegermann, from
HardData.

I hope it clears up your problems.  Let me know...

Regards,
Daniel

On Tue, Feb 01, 2005 at 11:00:55AM -0800, J. Dawe wrote:
> Hi all.  I've read through the list messages and found the references to
> how Clustermatic 5 doesn't mount nfs the old way anymore, and I've tried
> to set up the nfs.init script that Daniel Gruner posted, but it doesn't
> seem to be running.
> 
> I have added
> 
> . /etc/clustermatic/nfs.init
> 
> to my node_up script, but when I "bpsh 0 ls", /home doesn't appear in the
> dir list.  Also, I can run the nfs.init file from the prompt as:
> 
> /etc/clustermatic/nfs.init 0
> 
> and it will run, with the following error messages:
> 
> FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or
> directory
> mount: fs type rpc_pipefs not supported by kernel
> FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or
> directory
> FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or
> directory
> mount: fs type nfs not supported by kernel
> mount: fs type nfs not supported by kernel
> 
> After I run nfs.init manually, /home is created on the slave, which I can
> see with "bpsh 0 ls".
> 
> So how do I fix this?  I would think maybe the node isn't finding the
> /etc/modprobe.conf.dist file it needs, except the lack of the
> creation of /home on boot suggests to me the script isn't running at all.
> Suggestions?
> 
> Jordan Dawe
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> Tool for open source databases. Create drag-&-drop reports. Save time
> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> BProc-users mailing list
> BPr...@li...
> https://lists.sourceforge.net/lists/listinfo/bproc-users

-- 

Dr. Daniel Gruner                        dg...@ch...
Dept. of Chemistry                       dan...@ut...
University of Toronto                    phone:  (416)-978-8689
80 St. George Street                     fax:    (416)-978-5325
Toronto, ON  M5S 3H6, Canada             finger for PGP public key

Hi all.  I've read through the list messages and found the references to
how Clustermatic 5 doesn't mount nfs the old way anymore, and I've tried
to set up the nfs.init script that Daniel Gruner posted, but it doesn't
seem to be running.

I have added

. /etc/clustermatic/nfs.init

to my node_up script, but when I "bpsh 0 ls", /home doesn't appear in the
dir list.  Also, I can run the nfs.init file from the prompt as:

/etc/clustermatic/nfs.init 0

and it will run, with the following error messages:

FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or
directory
mount: fs type rpc_pipefs not supported by kernel
FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or
directory
FATAL: Failed to open config file /etc/modprobe.conf.dist: No such file or
directory
mount: fs type nfs not supported by kernel
mount: fs type nfs not supported by kernel

After I run nfs.init manually, /home is created on the slave, which I can
see with "bpsh 0 ls".

So how do I fix this?  I would think maybe the node isn't finding the
/etc/modprobe.conf.dist file it needs, except the lack of the
creation of /home on boot suggests to me the script isn't running at all.
Suggestions?

Jordan Dawe

So I'm thinking about hacking in support for LAM into BJS, by basically
doing a fork()/exec() of lamboot in bjssub.  Any suggestions on the best
way to go about doing that?  I'm still puzzling over the code some.

Dale

For a little more information, I'm trying a script like the below to
bjssub:

  #!/bin/sh
  TMPFILE=`mktemp /tmp/lamhosts.$JOBID.XXXXXX`

  echo $NODES | awk '{ split($0,nodes,","); print "n-1 no-schedule=1";
  for (i in nodes) printf "n%d\n", nodes[i] }' > $TMPFILE

  lamboot -d $TMPFILE

  sleep 2

  mpirun C  <mpijob>

--
Dale Harris   
ro...@ma...
/.-)

So anyone tried to get this too work?  Doesn't appear lamboot is very
happy with bjssub.  Any hints on how to get this to work?

-- 
Dale Harris   
ro...@ma...
/.-)

On Sat, Dec 11, 2004 at 11:19:11AM -0600, Rene Salmon elucidated:
> 
> Hi,
> 
> Does anyone know if the Maui scheduler  can be used with bjs?
> 
> I found some postings on clubmask http://clubmask.sourceforge.net/
> which uses bproc and maui but I don't think it uses bjs.

No, clubmask isn't bjs.  Last thing I heard about Clubmask was that the
primary developer got a new job and was too swamped to work on it.
However, that may have changed.

You might look at using SGE, it's a bit of a kludge to fit it to bproc,
but it can work. 

Dale

Hello again,

    Just thought I would keep you up to date with our experiments on 
getting MATLAB running with BProc on a Clustermatic cluster.  The 
sysadmin here finally figured out that the reason why the node_up and 
nfs.init scripts were not working was that they were in Windows format 
and not Unix format, though this was not immediately visible e.g. from 
vi.  So after getting the MATLAB directory mounted on the slave nodes, 
MATLAB is working more or less. 

    The main problem we are having now is that after running 'bpsh 0 
matlab' for example (actually 'bpsh 0 matlab2' where matlab2 is a 
modified script), the terminal hangs after the script is run and the 
MATLAB prompt or command prompt is not visible after the script is 
complete (a Ctrl-C is required to get this to show up).  It seems the 
master node doesn't know that the slave node has finished its job.  Even 
if I put an 'exit' at the end of the MATLAB script, it still does the 
same thing.  If anybody has any idea of a workaround for this, then that 
would be really great.  Thanks everybody for all your help!

- Reza

Reza Shahidi wrote:

> Hi,
>
>    I think this is not a mount problem.  Even if I comment out the 
> entire nfs.init file, the nodes still hang on booting.  The boot 
> process must be getting stuck in the node_up script.  It is too bad I 
> am unable to find any useful log messages.  If anybody can think of 
> what could be happening, please let me know.  Thanks.
>
> Happy New Year,
>
> Reza
>
> Steven James wrote:
>
>> Greetings,
>>
>> NFS mounts can hang up if the server isn't running lockd and the mount
>> options don't include nolock.
>>
>> G'day,
>> sjames
>>
>>
>>
>> On Fri, 31 Dec 2004, Reza Shahidi wrote:
>>
>>  
>>
>>> Hello,
>>>
>>>    I tried the script you sent below, but now the nodes get stuck with
>>> a status of boot when Clustermatic is restarted.  I can't bpsh to the
>>> nodes or anything.  On the screen of node 0, the boot sequence gets
>>> stuck at bpslave-0: setting node number to 0, and stays that way if not
>>> restarted.  This does not happen when the regular node_up/nfs.init
>>> scripts are used, but of course, I am still not able to get the NFS
>>> mount working in this case either.  Any more ideas?
>>>
>>> Thanks,
>>>
>>> Reza
>>>
>>> Daniel Gruner wrote:
>>>
>>>   
>>>
>>>> Reza,
>>>>
>>>> For some reason, in Clustermatic 5, trying to do NFS mounts according
>>>> to the "manual" (which is what you tried, and used to work in
>>>> Clustermatic 4), doesn't work anymore.  We've had to do some hacks 
>>>> in order
>>>> to make it work.  In short, do NOT try the NFS mounts in
>>>> /etc/clustermatic/fstab.  What you have to do is run a script from the
>>>> /etc/clustermatic/node_up script, which will do all the necessary 
>>>> stuff on
>>>> the nodes.
>>>>
>>>> I am attaching here my /etc/clustermatic/node_up, and another file 
>>>> called
>>>> nfs.init which is also put in /etc/clustermatic.  This scheme works
>>>> well for us, and it should work for you as well.  You will need to 
>>>> modify
>>>> the nfs.init script to mount your particular filesystem(s).
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>>
>>>> #!/bin/sh
>>>> #
>>>> # This shell script is called automatically by BProc to perform any
>>>> # steps necessary to bring up the nodes.  This is just a stub script
>>>> # pointing to the program that does the real work.
>>>> #
>>>> # $Id: node_up.stub,v 1.3 2003/11/12 23:30:59 mkdist Exp $
>>>>
>>>> # All changes up to "############" line by
>>>> # Michal Jaegermann, mi...@ha...
>>>>
>>>> seterror () {
>>>>   bpctl -S $1 -s error
>>>>   exit 1
>>>> }
>>>>
>>>> if [ -x /usr/lib64/beoboot/bin/node_up ] ; then
>>>>   /usr/lib64/beoboot/bin/node_up $* || seterror $*
>>>> else
>>>>   /usr/lib/beoboot/bin/node_up $* || seterror $*
>>>> fi
>>>> # we are "sourcing" these script so variable assignments
>>>> # remain like in here; pass a node number as an argument
>>>> # if you want to _run_ them from a shell and wrap in a loop
>>>> # for multiple nodes
>>>> #
>>>> # lm_sensors - 'bpsh 3 sensors' will produce sensors information 
>>>> for node 3
>>>> # . /etc/clustermatic/sensors.init
>>>> # if we use pathscale libraries we have to make them available on 
>>>> nodes
>>>> # . /etc/clustermatic/pathscale.init
>>>> # similarly for Intel compiler
>>>> # . /etc/clustermatic/intel.init
>>>> # Turn the next line on for NFS support on nodes
>>>> . /etc/clustermatic/nfs.init
>>>>
>>>> exit
>>>>
>>>> ############
>>>>
>>>> # below the original script - now NOT executing due to 'exit' above
>>>>
>>>> if [ -x /usr/lib64/beoboot/bin/node_up ] ; then
>>>>   exec /usr/lib64/beoboot/bin/node_up $*
>>>> else
>>>>   exec /usr/lib/beoboot/bin/node_up $*
>>>> fi
>>>>
>>>> # If we reach this point there's an error.
>>>> bpctl -S $* -s error
>>>> exit 1
>>>>
>>>> # If you want to put more setup stuff here, make sure do replace the
>>>> # "exec" above with the following:
>>>> # /usr/lib/beoboot/bin/node_up $* || exit 1
>>>>
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>>
>>>> #!/bin/sh
>>>> #
>>>> # A sample how to get NFS modules on a node.
>>>> # Make sure that /etc/modules.conf.dist for a node does not
>>>> # define any 'install' actions for these
>>>> #
>>>> #  Michal Jaegermann, 2004/Aug/19, mi...@ha...
>>>> #
>>>>
>>>> node=$1
>>>> # get the list of modules, and copy them to the node
>>>> mod=nfs
>>>> modules=$( grep $mod.ko /lib/modules/$(uname -r)/modules.dep)
>>>> modules=${modules/:/}
>>>> modules=$(
>>>> for m in $modules ; do
>>>>   echo $m
>>>> done | tac )
>>>> ( cd /
>>>>   for m in $modules ; do
>>>>     echo $m
>>>>   done
>>>> ) | ( cd / ; cpio -o -c --quiet ) | bpsh $node cpio -imd --quiet
>>>> bpsh $node depmod -a
>>>> # fix the permissions after cpio
>>>> bpsh $node chmod -R a+rX /lib
>>>> # load the modules
>>>> for m in $modules ; do
>>>>   m=$(basename $m .ko)
>>>>   m=${m/_/-}
>>>>   case $m in
>>>>     sunrpc)
>>>>         bpsh $node modprobe -i sunrpc
>>>>         bpsh $node mkdir -p /var/lib/nfs/rpc_pipefs
>>>>         bpsh $node mount | grep -q rpc_pipefs || \
>>>>         bpsh $node mount -t rpc_pipefs sunrpc /var/lib/nfs/rpc_pipefs
>>>>         ;;
>>>>     *)  bpsh $node modprobe -i $m
>>>>   esac
>>>> done
>>>> # these are for a benfit of rpc.statd
>>>> bpsh $node mkdir -p /var/lib/nfs/statd/
>>>> bpsh $node mkdir -p /var/run
>>>> bpsh $node portmap
>>>> bpsh $node rpc.statd
>>>> bpsh $node mkdir /home
>>>> bpsh $node mount -t nfs -o nfsvers=3,rw,noac  master:/home /home
>>>> bpsh $node mkdir /usr/local
>>>> bpsh $node mount -t nfs -o nfsvers=3,rw,noac  master:/usr/local 
>>>> /usr/local
>>>>
>>>>
>>>>     
>>>
>>>
>>> -------------------------------------------------------
>>> The SF.Net email is sponsored by: Beat the post-holiday blues
>>> Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
>>> It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
>>> _______________________________________________
>>> BProc-users mailing list
>>> BPr...@li...
>>> https://lists.sourceforge.net/lists/listinfo/bproc-users
>>>
>>>   
>>
>>
>> ||||| |||| |||||||||||||  |||
>> by Linux Labs International, Inc.
>>   Steven James, CTO
>>
>> 55 Marietta Street
>> Suite 1830
>> Atlanta, Ga 30303
>> 866 824 9737 support
>>
>>  
>>
>
>
>
> -------------------------------------------------------
> The SF.Net email is sponsored by: Beat the post-holiday blues
> Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
> It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
> _______________________________________________
> BProc-users mailing list
> BPr...@li...
> https://lists.sourceforge.net/lists/listinfo/bproc-users
>

On Tue, Jan 18, 2005 at 04:33:46PM -0600, Rene Salmon elucidated:
> Hi list,
> 
> 
> 
> We just got our bproc cluster up and we are still trying to figure some 
> things out and would appreciate some help on this.
> 
> We have some C code that we want to run on the bproc cluster.  The problem 
> is that our C code makes several calls to system() in order to call outside 
> programs like "file" or "mkdir" etc..
> 

Well, you could always use a fork(), exec() combo instead of doing
system().  But there should be a libc call for mkdir().  Things like
file, you could use stat() instead.  using strftime().  The is a whole
library of calls you could be using instead of system() calls.

> Here is a sample of the stuff this code is trying to do thank you for any 
> help on this?
> 
> void test_date(void)
> {
>     FILE *dfile;
>     int i;
> 
>     system("date '+%Y%m%d' > .z7x2q0");
>     dfile = fopen(".z7x2q0","r");
>     if (dfile==0)
>     {
>         fprintf(Stderr,"\n\n** fatal error - write access denied\n\n");
>         exit(1);
>     }
>     fscanf(dfile,"%d",&i);
>     fclose(dfile);

This is completely unnecessary... look at strftime().

>     system("rm -f .z7x2q0");

Have you ever heard of unlink()?

Dale

Hi list,

We just got our bproc cluster up and we are still trying to figure some 
things out and would appreciate some help on this.

We have some C code that we want to run on the bproc cluster.  The problem 
is that our C code makes several calls to system() in order to call outside 
programs like "file" or "mkdir" etc..

of course this fails on a bproc cluster because these binaries do not 
exist on the compute nodes.  On the Linux system() man page it states:

system()  executes  a command specified in string by calling /bin/sh -c string,
and returns after the command has been completed.

So not only are the binaries we are trying to call missing but also 
/bin/sh.

My quick fix would be to nfs mount /usr and /bin to all the nodes.  But I 
am hopping someone here would have a better idea on how to fix this maybe 
there is a bproclib I can call to do these systems calls???

Here is a sample of the stuff this code is trying to do thank you for any 
help on this?

void test_date(void)
{
     FILE *dfile;
     int i;

     system("date '+%Y%m%d' > .z7x2q0");
     dfile = fopen(".z7x2q0","r");
     if (dfile==0)
     {
         fprintf(Stderr,"\n\n** fatal error - write access denied\n\n");
         exit(1);
     }
     fscanf(dfile,"%d",&i);
     fclose(dfile);
     system("rm -f .z7x2q0");
     if (i>killdate || i< MINDATE)
     {
         fprintf(Stderr,"\n\n*** fatal error - recompile code\n\n\n");
         exit(1);
     }
}

Rene

Hello all,

I have just installed a small testbed for our cluster consisting of only
2 computers. I have installed clustermatic 5 on top of Fedora Core 3.
Booting and bpsh'ing works fine, but I have some trouble getting MPI
programs to work with LAM/MPI.

I found some postings on the archives but no clues how to solve them.

The following issue concerns LAM/MPI 7.1.1-2 and the latest SVN snapshot
(7.2b1r10023). I compiled both from scratch using gcc/g77 and
gcc/nagware. I can lamboot without any problems using the bproc ssi boot
module and tping reports that it can find all computers (master and 1
node).

Then I try to start one of the examples contained in the LAM/MPI distro,
e.g. the pi one. As soon as I start "mpirun n0-1
PATH_TO_LAM/example/fpi", I get the following message on the nodes
console:
"bproc: WARNING: bproc/move.c: 1886: send_recv_process needs to be
reworked to be consistent with the rest of the move code"
And on the master the mpirun program reports:

"-----------------------------------------------------------------------
------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
------------------------------------------------------------------------
-----"

Does anybody experienced similar problems or has a tip how to verify
that my setup is basically ok?
Help would be very appreciated.
Thanks in advance.

Alex

On Sun, Jan 16, 2005 at 07:48:51PM +0100, Thomas Eckert elucidated:
> 
> Dale,
> 
> i seem to remember that the ping-problem is due to a missing
> "/etc/protocols"-file -- no test-equipment handy right now to verify it.
> from a quick look at socket(2) this may also be the cause of your lam
> problems?
> 

My problems with lam was having CONFIG_UNIX set to module (or not making
that module available to the nodes).  I tried copying /etc/protocols
over to a node, but that didn't seem to fix name resolution problem on
the node.   But it doesn't seem to be having any adverse effects as far
I could tell, just seems like it should work since I'm copying a
nsswitch.conf over to the nodes.

Dale

On Sun, Jan 16, 2005 at 08:40:04AM -0800, Dale Harris elucidated:
> 
> Well, I was hoping that perhaps it might have been due to having
> CONFIG_UNIX as a module.  But that solve my problem, either, I compiled
> it into the kernel.   
> 

Okay, pardon my bad grammar.  I should have stayed in bed a little
longer this morning.  This was exactly my problem I had CONFIG_UNIX
configured as a module.  Compiling it in appears to have solved my
problem, I hadn't updated my kernel image for the nodes correctly.  So
that all seems to work now.  Thanks for the help.

However, seems like I should be able to bpsh 1 ping n2?  the nX isn't
resolving on the nodes.  

Dale

Dale,

i seem to remember that the ping-problem is due to a missing
"/etc/protocols"-file -- no test-equipment handy right now to verify it.
from a quick look at socket(2) this may also be the cause of your lam
problems?

Thomas

On Sun, 16 Jan 2005, Dale Harris wrote:

>
> Well, I was hoping that perhaps it might have been due to having
> CONFIG_UNIX as a module.  But that solve my problem, either, I compiled
> it into the kernel.
>
> So I'm still stumped.  So anyone has some insight, I'd appreciate it.

Well, I was hoping that perhaps it might have been due to having
CONFIG_UNIX as a module.  But that solve my problem, either, I compiled
it into the kernel.   

So I'm still stumped.  So anyone has some insight, I'd appreciate it. 

--
Dale Harris   
ro...@ma...
/.-)

On Sun, Jan 16, 2005 at 12:32:48AM -0500, Brian Barrett elucidated:
> 
> 
> Can you tell when that error message occurs?  Perhaps there is 
> something wrong with the BProc cluster that is causing your errors.  Do 
> other applications run properly on the compute nodes?  Also, what 
> happens if you try to boot with no hostfile (so it just tries to start 
> on the BProc head node)?
> 
> Brian
> 

Yeah, I think it probably something I have done wrong with my bproc
config.  Something else that doesn't work that problem should be the
ability to resolve nodes names on the nodes themselves.  In other words,
I can't do:

bpsh 1 ping n-1

Despite the fact that nsswitch.conf get's copied over and libnss_bproc
is in bplib's library list.  On the head node, resolving works just
fine.

I haven't tried any other applications other than PVM, which may not
ever work.  However it seems to have the same problems, the same socket
connect area occur.  

I haven't tried just starting on just the head node.  What do you think
that will show me?

I'll attach my config and node_up.conf

Dale

On Jan 14, 2005, at 7:18 PM, Dale Harris wrote:

> I'm having a problem successfully running lamboot, from lam 7.1.1, on a
> bproc system running version 4.0p8.  What I see from lamboot is:
>
> lamboot hosts
>
> LAM 7.1.1/MPI 2 C++/bproc - Indiana University
>
> lamd kernel: problem with socket(): Address family not supported by 
> protocol
> ...

This is coming from a call to create a unix domain socket:

         if ((sd_kernel = socket(AF_UNIX, SOCK_STREAM, 0)) < 0)
           lampanic("lamd kernel: problem with socket()");

I'm not really sure how that could be failing with the given error 
message.  I'm guessing that it's a symptom of the real problem.  I know 
that's not really helpful, but there really isn't any reason that call 
to socket() should fail.

> I was able to do a little strace of this, and see errors like:
>
> getxattr("/bpfs/-1", "bproc.addr", 0xbfffeff4, 16) = 16
> socket(PF_FILE, SOCK_STREAM, 0)         = 3
> connect(3, {sa_family=AF_FILE, path="/var/run/.nscd_socket"}, 110) = -1
> ENOENT (No such file or directory)
> close(3)                                = 0
>
> But that doesn't make much sense to me, looks like it trying to resolve
> a name, perhaps.  I assume this is a symptom, but not a cause.

Can you tell when that error message occurs?  Perhaps there is 
something wrong with the BProc cluster that is causing your errors.  Do 
other applications run properly on the compute nodes?  Also, what 
happens if you try to boot with no hostfile (so it just tries to start 
on the BProc head node)?

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have an LAM/MPI day: http://www.lam-mpi.org/

Hi, 

I'm having a problem successfully running lamboot, from lam 7.1.1, on a
bproc system running version 4.0p8.  What I see from lamboot is:

lamboot hosts 

LAM 7.1.1/MPI 2 C++/bproc - Indiana University

lamd kernel: problem with socket(): Address family not supported by protocol
...

I was able to do a little strace of this, and see errors like:

getxattr("/bpfs/-1", "bproc.addr", 0xbfffeff4, 16) = 16
socket(PF_FILE, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_FILE, path="/var/run/.nscd_socket"}, 110) = -1
ENOENT (No such file or directory)
close(3)                                = 0

But that doesn't make much sense to me, looks like it trying to resolve
a name, perhaps.  I assume this is a symptom, but not a cause.

Here's my laminfo:

             LAM/MPI: 7.1.1
              Prefix: /usr/local/lam-7.1.1
        Architecture: i686-pc-linux-gnu
       Configured by: root
       Configured on: Fri Jan 14 18:28:27 EST 2005
      Configure host: circe198
      Memory manager: ptmalloc2
          C bindings: yes
        C++ bindings: yes
    Fortran bindings: yes
          C compiler: gcc
        C++ compiler: g++
    Fortran compiler: g77
     Fortran symbols: double_underscore
         C profiling: yes
       C++ profiling: yes
   Fortran profiling: yes
      C++ exceptions: no
      Thread support: yes
       ROMIO support: no
        IMPI support: no
       Debug support: no
        Purify clean: no
            SSI boot: bproc (API v1.1, Module v1.1)
            SSI boot: globus (API v1.1, Module v0.6)
            SSI boot: rsh (API v1.1, Module v1.1)
            SSI boot: slurm (API v1.1, Module v1.0)
            SSI coll: lam_basic (API v1.1, Module v7.1)
            SSI coll: shmem (API v1.1, Module v1.0)
            SSI coll: smp (API v1.1, Module v1.2)
             SSI rpi: crtcp (API v1.1, Module v1.1)
             SSI rpi: gm (API v1.1, Module v1.2)
             SSI rpi: lamd (API v1.0, Module v7.1)
             SSI rpi: sysv (API v1.0, Module v7.1)
             SSI rpi: tcp (API v1.0, Module v7.1)
             SSI rpi: usysv (API v1.0, Module v7.1)
              SSI cr: self (API v1.0, Module v1.0)

Any help or insight would be appreciated.  My config.log and output from
lamboot -d are attached.

--
Dale Harris   
ro...@ma...
/.-)

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (25)	Nov	Dec (22)
2002	Jan (13)	Feb (22)	Mar (39)	Apr (10)	May (26)	Jun (23)	Jul (38)	Aug (20)	Sep (27)	Oct (76)	Nov (32)	Dec (11)
2003	Jan (8)	Feb (23)	Mar (12)	Apr (39)	May (1)	Jun (48)	Jul (35)	Aug (15)	Sep (60)	Oct (27)	Nov (9)	Dec (32)
2004	Jan (8)	Feb (16)	Mar (40)	Apr (25)	May (12)	Jun (33)	Jul (49)	Aug (39)	Sep (26)	Oct (47)	Nov (26)	Dec (36)
2005	Jan (29)	Feb (15)	Mar (22)	Apr (1)	May (8)	Jun (32)	Jul (11)	Aug (17)	Sep (9)	Oct (7)	Nov (15)	Dec

bproc-users Mailing List for BProc: Beowulf Distributed Process Space (Page 6)

bproc-users — General discussion about BProc.