From: Daniel G. <dg...@ti...> - 2004-06-22 15:29:51
|
On Tue, Jun 22, 2004 at 08:19:41AM -0700, Brian Barrett wrote: > I think I'm going to crawl into a corner and sleep for a while. Maybe > that will help me with my utter stupidity. So the n<integer> notation > is overloaded on a LAM/BProc cluster. LAM assigned nodes under it's > run-time environment a node number, always starting from 0. BProc > obviously assigned nodes a node number, with the -1 for MASTER. So on > your output, we have: > > n-1<28809> ssi:boot:bproc: resolved hosts: > n-1<28809> ssi:boot:bproc: n0 192.168.101.1 --> 192.168.101.1 (origin) > n-1<28809> ssi:boot:bproc: n1 192.168.101.100 --> 192.168.101.100 > n-1<28809> ssi:boot:bproc: n2 192.168.101.101 --> 192.168.101.101 > n-1<28809> ssi:boot:bproc: n3 192.168.101.102 --> 192.168.101.102 > n-1<28809> ssi:boot:bproc: n4 192.168.101.103 --> 192.168.101.103 > n-1<28809> ssi:boot:bproc: n5 192.168.101.104 --> 192.168.101.104 > n-1<28809> ssi:boot:bproc: found master node (192.168.101.1). Skipping > checks. > n-1<28809> ssi:boot:bproc: n0 node status: up > n-1<28809> ssi:boot:bproc: n0 access rights not checked. > n-1<28809> ssi:boot:bproc: n1 node status: up > n-1<28809> ssi:boot:bproc: n1 access rights not checked. > n-1<28809> ssi:boot:bproc: n2 node status: up > n-1<28809> ssi:boot:bproc: n2 access rights not checked. > n-1<28809> ssi:boot:bproc: n3 node status: up > n-1<28809> ssi:boot:bproc: n3 access rights not checked. > n-1<28809> ssi:boot:bproc: n4 node status: up > n-1<28809> ssi:boot:bproc: n4 access rights not checked. > > > The list under resolved hosts is of the format: <LAM node number> > <HOSTNAME> <IP>. Because of some BProc-3 behaviors, hostname is > already resolved to IP. The next section is where we look at BProc > node numbers. As you can see, 101.1 was found to be the master, as it > should. So LAM n0 will be BProc n-1, LAM n0 will be BProc n0, etc. > Confused yet? Not to worry. I see what is going on, and as long as LAM is not confused then I am happy with it. > > If you run the "lamnodes" command, you should see all 6 machines > listed. Next to the Master node, you should see a notation like > (origin, this_node, no_schedule). Meaning that it was the node used to > boot LAM, it is the node lamnodes is running on, and it has been set > not to be available for scheduling jobs. So aside from the already > mentioned bproc_nodeinfo() tests that we are currently just skipping > right now, I think all looks good. Yep. Here is what lamnodes had to say: racaille:dgruner{105}> lamnodes n0 master:1:no_schedule,origin,this_node n1 n0:1: n2 n1:1: n3 n2:1: n4 n3:1: n5 n4:1: This looks to me like normal behaviour, and I am satisfied that mpi jobs actually run. Thanks a bunch! (and do take a rest, it is important...:-) Daniel > > Brian > > > On Jun 22, 2004, at 7:06 AM, Daniel Gruner wrote: > > > Hi Brian, > > > > The latest version in SVN (as of yesterday evening) seems to work fine! > > I can lamboot, and then mpirun a process compiled with mpif77, and all > > that jazz... > > > > The master is still assigned as n0, but I didn't notice whether the > > actual mpi code is running on the master or not (the job I tested was > > too quick. When the job starts it prints: > > > > racaille:dgruner{132}> mpirun -np 4 ./fpi > > Process 1 of 4 is alive > > Process 0 of 4 is alive > > Process 2 of 4 is alive > > Process 3 of 4 is alive > > > > and lamboot produced the output in the attached file. It looks ok to > > me, but I want to make sure that the mpi jobs are NOT run on the > > master. > > > > Regards, > > Daniel > > > > > > On Sun, Jun 20, 2004 at 08:57:26PM -0700, Brian W. Barrett wrote: > >> Hey all - > >> > >> I think I finally fixed things so LAM really does avoid the > >> bproc_nodeinfo() call on BProc 4 clusters. Changes should be in SVN. > >> Let mw know if you have any problems. I think this leaves us with > >> Luke's master node being identified as n0 problem as the one remaining > >> bug. Still not sure why Luke would see that and Daniel wouldn't. > >> *shrug*. > >> > >> Brian > >> > >> -- > >> Brian Barrett > >> LAM/MPI developer and all around nice guy > >> Have a LAM/MPI day: http://www.lam-mpi.org/ > > > > -- > > > > Dr. Daniel Gruner dg...@ti... > > Dept. of Chemistry dan...@ut... > > University of Toronto phone: (416)-978-8689 > > 80 St. George Street fax: (416)-978-5325 > > Toronto, ON M5S 3H6, Canada finger for PGP public key > > <junk> > -- > Brian Barrett > LAM/MPI developer and all around nice guy > Have a LAM/MPI day: http://www.lam-mpi.org/ -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |