From: Brian B. <brb...@la...> - 2004-06-10 17:45:56
|
Hi all - I'm running into some problems with making LAM work on BProc 4.0 and I was hoping someone would shed some light on this... First, I have some code Greg Watson sent a month or so ago to iterate over the results of bproc_nodelist to translate from IP address to node number. bproc_nodelist doesn't seem to include the head node - is that expected? If so, is there a way to get the address of the head node so I can special case that check? Second, Daniel Gruner is running into problems on his cluster with LAM (it's a bproc-4 cluster), and it looks like it is because LAM's calls to bproc_nodeinfo() are all failing (we do some sanity checks on permissions before launch, just to pretty print to the user if we find something wrong). The following is in a for loop iterating across the list of nodeids we've resolved: if (target_node != -1){ if (bproc_nodeinfo(target_node, &ninfo) < 0) { lam_debug(lam_ssi_boot_did, "bproc: n%d bproc_nodeinfo failed (%s)\n", target_node, strerror(errno)); sfh_argv_add(&node_downc, &node_downv, nodes[i].lnd_hname); } /* Do stuff here... */ } When the code is run, I see ( the node status: printf is from earlier in the for loop): n-1<10200> ssi:boot:bproc: n-3 nodestatus failed (-1) n-1<10200> ssi:boot:bproc: n-3 node status down, failure n-1<10200> ssi:boot:bproc: n0 node status: up n-1<10200> ssi:boot:bproc: n0 bproc_nodeinfo failed (Unknown error 300) n-1<10200> ssi:boot:bproc: n1 node status: up n-1<10200> ssi:boot:bproc: n1 bproc_nodeinfo failed (Unknown error 300) n-1<10200> ssi:boot:bproc: n2 node status: up n-1<10200> ssi:boot:bproc: n2 bproc_nodeinfo failed (Unknown error 300) n-1<10200> ssi:boot:bproc: n3 node status: up n-1<10200> ssi:boot:bproc: n3 bproc_nodeinfo failed (Unknown error 300) n-1<10200> ssi:boot:bproc: n4 node status: up n-1<10200> ssi:boot:bproc: n4 bproc_nodeinfo failed (Unknown error 300) n-1<10200> ssi:boot:bproc: n5 node status: up n-1<10200> ssi:boot:bproc: n5 bproc_nodeinfo failed (Unknown error 300) Code diving, errno 300 is invalid node. Which is kind of unexpected, since the nodes seem to exist in previous bproc calls. The first failure is mostly expected at this time, since I haven't figured out the resolving the master node issue yet. Any advice on either issue would be much appreciated. Brian -- Brian Barrett LAM/MPI developer and all around nice guy Have a LAM/MPI day: http://www.lam-mpi.org/ |