From: Greg W. <gw...@la...> - 2004-06-11 00:54:02
|
Brian, Actually BProc uses -1 to designate the master node. Slave nodes are numbered starting from 0, and in the BProc world the master is treated differently from the slaves. You could try calling bproc_nodeinfo() with -1 for the first argument. I would try and avoid running any applications on the master, since it is critical to the operation of the cluster. Also, your error messages seem to indicate you're trying to look up node -3. This is probably wrong. Regards, Greg On Jun 10, 2004, at 11:42 AM, Brian Barrett wrote: > Hi all - > > I'm running into some problems with making LAM work on BProc 4.0 and I > was hoping someone would shed some light on this... > > First, I have some code Greg Watson sent a month or so ago to iterate > over the results of bproc_nodelist to translate from IP address to > node number. bproc_nodelist doesn't seem to include the head node - > is that expected? If so, is there a way to get the address of the > head node so I can special case that check? > > Second, Daniel Gruner is running into problems on his cluster with LAM > (it's a bproc-4 cluster), and it looks like it is because LAM's calls > to bproc_nodeinfo() are all failing (we do some sanity checks on > permissions before launch, just to pretty print to the user if we find > something wrong). The following is in a for loop iterating across the > list of nodeids we've resolved: > > if (target_node != -1){ > if (bproc_nodeinfo(target_node, &ninfo) < 0) { > lam_debug(lam_ssi_boot_did, "bproc: n%d bproc_nodeinfo failed > (%s)\n", > target_node, strerror(errno)); > sfh_argv_add(&node_downc, &node_downv, nodes[i].lnd_hname); > } > > /* Do stuff here... */ > > } > > When the code is run, I see ( the node status: printf is from earlier > in the for loop): > > n-1<10200> ssi:boot:bproc: n-3 nodestatus failed (-1) > n-1<10200> ssi:boot:bproc: n-3 node status down, failure > n-1<10200> ssi:boot:bproc: n0 node status: up > n-1<10200> ssi:boot:bproc: n0 bproc_nodeinfo failed (Unknown error 300) > n-1<10200> ssi:boot:bproc: n1 node status: up > n-1<10200> ssi:boot:bproc: n1 bproc_nodeinfo failed (Unknown error 300) > n-1<10200> ssi:boot:bproc: n2 node status: up > n-1<10200> ssi:boot:bproc: n2 bproc_nodeinfo failed (Unknown error 300) > n-1<10200> ssi:boot:bproc: n3 node status: up > n-1<10200> ssi:boot:bproc: n3 bproc_nodeinfo failed (Unknown error 300) > n-1<10200> ssi:boot:bproc: n4 node status: up > n-1<10200> ssi:boot:bproc: n4 bproc_nodeinfo failed (Unknown error 300) > n-1<10200> ssi:boot:bproc: n5 node status: up > n-1<10200> ssi:boot:bproc: n5 bproc_nodeinfo failed (Unknown error 300) > > Code diving, errno 300 is invalid node. Which is kind of unexpected, > since the nodes seem to exist in previous bproc calls. The first > failure is mostly expected at this time, since I haven't figured out > the resolving the master node issue yet. > > Any advice on either issue would be much appreciated. > > Brian > > > -- > Brian Barrett > LAM/MPI developer and all around nice guy > Have a LAM/MPI day: http://www.lam-mpi.org/ > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the new InstallShield X. >> From Windows to Linux, servers to mobile, InstallShield X is the > one installation-authoring solution that does it all. Learn more and > evaluate today! http://www.installshield.com/Dev2Dev/0504 > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |