[BProc] Re: More LAM/BProc changes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I think I'm going to crawl into a corner and sleep for a while.  Maybe 
that will help me with my utter stupidity.  So the n<integer> notation 
is overloaded on a LAM/BProc cluster.  LAM assigned nodes under it's 
run-time environment a node number, always starting from 0.  BProc 
obviously assigned nodes a node number, with the -1 for MASTER.  So on 
your output, we have:

n-1<28809> ssi:boot:bproc: resolved hosts:
n-1<28809> ssi:boot:bproc:   n0 192.168.101.1 --> 192.168.101.1 (origin)
n-1<28809> ssi:boot:bproc:   n1 192.168.101.100 --> 192.168.101.100
n-1<28809> ssi:boot:bproc:   n2 192.168.101.101 --> 192.168.101.101
n-1<28809> ssi:boot:bproc:   n3 192.168.101.102 --> 192.168.101.102
n-1<28809> ssi:boot:bproc:   n4 192.168.101.103 --> 192.168.101.103
n-1<28809> ssi:boot:bproc:   n5 192.168.101.104 --> 192.168.101.104
n-1<28809> ssi:boot:bproc: found master node (192.168.101.1).  Skipping 
checks.
n-1<28809> ssi:boot:bproc: n0 node status: up
n-1<28809> ssi:boot:bproc: n0 access rights not checked.
n-1<28809> ssi:boot:bproc: n1 node status: up
n-1<28809> ssi:boot:bproc: n1 access rights not checked.
n-1<28809> ssi:boot:bproc: n2 node status: up
n-1<28809> ssi:boot:bproc: n2 access rights not checked.
n-1<28809> ssi:boot:bproc: n3 node status: up
n-1<28809> ssi:boot:bproc: n3 access rights not checked.
n-1<28809> ssi:boot:bproc: n4 node status: up
n-1<28809> ssi:boot:bproc: n4 access rights not checked.

The list under resolved hosts is of the format: <LAM node number> 
<HOSTNAME> <IP>.  Because of some BProc-3 behaviors, hostname is 
already resolved to IP.  The next section is where we look at BProc 
node numbers.  As you can see, 101.1 was found to be the master, as it 
should.  So LAM n0 will be BProc n-1, LAM n0 will be BProc n0, etc.  
Confused yet?

If you run the "lamnodes" command, you should see all 6 machines 
listed.  Next to the Master node, you should see a notation like 
(origin, this_node, no_schedule).  Meaning that it was the node used to 
boot LAM, it is the node lamnodes is running on, and it has been set 
not to be available for scheduling jobs.  So aside from the already 
mentioned bproc_nodeinfo() tests that we are currently just skipping 
right now, I think all looks good.

Brian

On Jun 22, 2004, at 7:06 AM, Daniel Gruner wrote:

> Hi Brian,
>
> The latest version in SVN (as of yesterday evening) seems to work fine!
> I can lamboot, and then mpirun a process compiled with mpif77, and all
> that jazz...
>
> The master is still assigned as n0, but I didn't notice whether the
> actual mpi code is running on the master or not (the job I tested was
> too quick.  When the job starts it prints:
>
> racaille:dgruner{132}> mpirun -np 4 ./fpi
>  Process            1  of            4  is alive
>  Process            0  of            4  is alive
>  Process            2  of            4  is alive
>  Process            3  of            4  is alive
>
> and lamboot produced the output in the attached file.  It looks ok to
> me, but I want to make sure that the mpi jobs are NOT run on the 
> master.
>
> Regards,
> Daniel
>
>
> On Sun, Jun 20, 2004 at 08:57:26PM -0700, Brian W. Barrett wrote:
>> Hey all -
>>
>> I think I finally fixed things so LAM really does avoid the
>> bproc_nodeinfo() call on BProc 4 clusters.  Changes should be in SVN.
>> Let mw know if you have any problems.  I think this leaves us with
>> Luke's master node being identified as n0 problem as the one remaining
>> bug.  Still not sure why Luke would see that and Daniel wouldn't.
>> *shrug*.
>>
>> Brian
>>
>> -- 
>>    Brian Barrett
>>    LAM/MPI developer and all around nice guy
>>    Have a LAM/MPI day: http://www.lam-mpi.org/
>
> -- 
>
> Dr. Daniel Gruner                        dg...@ti...
> Dept. of Chemistry                       dan...@ut...
> University of Toronto                    phone:  (416)-978-8689
> 80 St. George Street                     fax:    (416)-978-5325
> Toronto, ON  M5S 3H6, Canada             finger for PGP public key
> <junk>
-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/