From: Brian B. <brb...@la...> - 2004-06-22 15:23:04
|
I think I'm going to crawl into a corner and sleep for a while. Maybe that will help me with my utter stupidity. So the n<integer> notation is overloaded on a LAM/BProc cluster. LAM assigned nodes under it's run-time environment a node number, always starting from 0. BProc obviously assigned nodes a node number, with the -1 for MASTER. So on your output, we have: n-1<28809> ssi:boot:bproc: resolved hosts: n-1<28809> ssi:boot:bproc: n0 192.168.101.1 --> 192.168.101.1 (origin) n-1<28809> ssi:boot:bproc: n1 192.168.101.100 --> 192.168.101.100 n-1<28809> ssi:boot:bproc: n2 192.168.101.101 --> 192.168.101.101 n-1<28809> ssi:boot:bproc: n3 192.168.101.102 --> 192.168.101.102 n-1<28809> ssi:boot:bproc: n4 192.168.101.103 --> 192.168.101.103 n-1<28809> ssi:boot:bproc: n5 192.168.101.104 --> 192.168.101.104 n-1<28809> ssi:boot:bproc: found master node (192.168.101.1). Skipping checks. n-1<28809> ssi:boot:bproc: n0 node status: up n-1<28809> ssi:boot:bproc: n0 access rights not checked. n-1<28809> ssi:boot:bproc: n1 node status: up n-1<28809> ssi:boot:bproc: n1 access rights not checked. n-1<28809> ssi:boot:bproc: n2 node status: up n-1<28809> ssi:boot:bproc: n2 access rights not checked. n-1<28809> ssi:boot:bproc: n3 node status: up n-1<28809> ssi:boot:bproc: n3 access rights not checked. n-1<28809> ssi:boot:bproc: n4 node status: up n-1<28809> ssi:boot:bproc: n4 access rights not checked. The list under resolved hosts is of the format: <LAM node number> <HOSTNAME> <IP>. Because of some BProc-3 behaviors, hostname is already resolved to IP. The next section is where we look at BProc node numbers. As you can see, 101.1 was found to be the master, as it should. So LAM n0 will be BProc n-1, LAM n0 will be BProc n0, etc. Confused yet? If you run the "lamnodes" command, you should see all 6 machines listed. Next to the Master node, you should see a notation like (origin, this_node, no_schedule). Meaning that it was the node used to boot LAM, it is the node lamnodes is running on, and it has been set not to be available for scheduling jobs. So aside from the already mentioned bproc_nodeinfo() tests that we are currently just skipping right now, I think all looks good. Brian On Jun 22, 2004, at 7:06 AM, Daniel Gruner wrote: > Hi Brian, > > The latest version in SVN (as of yesterday evening) seems to work fine! > I can lamboot, and then mpirun a process compiled with mpif77, and all > that jazz... > > The master is still assigned as n0, but I didn't notice whether the > actual mpi code is running on the master or not (the job I tested was > too quick. When the job starts it prints: > > racaille:dgruner{132}> mpirun -np 4 ./fpi > Process 1 of 4 is alive > Process 0 of 4 is alive > Process 2 of 4 is alive > Process 3 of 4 is alive > > and lamboot produced the output in the attached file. It looks ok to > me, but I want to make sure that the mpi jobs are NOT run on the > master. > > Regards, > Daniel > > > On Sun, Jun 20, 2004 at 08:57:26PM -0700, Brian W. Barrett wrote: >> Hey all - >> >> I think I finally fixed things so LAM really does avoid the >> bproc_nodeinfo() call on BProc 4 clusters. Changes should be in SVN. >> Let mw know if you have any problems. I think this leaves us with >> Luke's master node being identified as n0 problem as the one remaining >> bug. Still not sure why Luke would see that and Daniel wouldn't. >> *shrug*. >> >> Brian >> >> -- >> Brian Barrett >> LAM/MPI developer and all around nice guy >> Have a LAM/MPI day: http://www.lam-mpi.org/ > > -- > > Dr. Daniel Gruner dg...@ti... > Dept. of Chemistry dan...@ut... > University of Toronto phone: (416)-978-8689 > 80 St. George Street fax: (416)-978-5325 > Toronto, ON M5S 3H6, Canada finger for PGP public key > <junk> -- Brian Barrett LAM/MPI developer and all around nice guy Have a LAM/MPI day: http://www.lam-mpi.org/ |