From: Peter L. <lar...@um...> - 2005-08-04 16:52:59
|
On Aug 2, 2005, at 7:04 PM, adrian wrote: > On 8/1/05, Peter Larkowski <lar...@um...> wrote: >> Hello: >> >> I'm setting up a smallish (24 node, 48 processor) cluster to run a >> software package called dacapo (atomic simulation package). This >> package consists of a binary (pile of fortran compiled against mpi) >> that >> calculates the wavefunctions and some python modules to control what >> the >> binaries calculate. I have the cluster setup with clustermatic 5 and >> the binary executes fine with mpirun. What happens after the >> wavefunctions are calculated (interitive solution) is python is >> supposed >> to get a signal and do the next thing in the script (geometry step or >> finish or whatever the script says to do next, etc...) but this never >> happens and it just runs forever or until bjs effectively "kills" the >> job. >> >> If I kill the job myself, the python script on the head node just runs >> forever (it doesn't ever seem to figure out that dacapo has died). >> >> I've played around with various ways of running this software >> (bpsh'ing >> the python script to a node first, and then having dacapo execute - >> this >> doesn't work at all which is I guess expected) and I've played messed >> with /proc/sys/bproc/shell_hack, etc.... Just running the python >> script >> on the head node and having it call a shell script which executes >> mpirun >> is the closest I get, but it does what I describe above. >> >> I realize the people on this list probably don't have experience with >> this software, and I'm not even sure I understand the internals >> completely anyway (I've dug into some of the python and I'll admit >> some >> of it starts to border on magic to me.....), but the basic question >> here >> is does this sound like the kind of thing that bproc just doesn't >> handle >> well yet? I'm starting to get the feeling I should cut my losses and >> just setup independent nodes with ssh keys shared, lam-mpi or mpich, >> and >> either openpbs or pbs-pro. We have access to a cluster set up this >> way, >> and the software does work fine, but this clustermatic setup is nifty >> from an admin standpoint so I'd love to make it work, but we need this >> cluster running, so I'm starting to think I should abort and go the >> other way now. What do you guys think? Sorry for the long message, >> but >> I thought I should describe my problems as thoroughly as I can. >> >> > Peter, > I would be willing to take a look at the python code for you, > and hopefully shed a little light on what the python code is doing. > Have you tried using the python debugger ? From you explaination, it > sounds like perfect clustermatic material to me. > I've done some more testing, and I think I can get around my orginal problem at least to some extent, but a larger problem has presented itself. Our code is run from python scripts, many of which read in very large input files (> 1GB). On a traditional cluster, the jobs get run on the 1st node on the list of nodes assigned to that job. On our bproc cluster, all the python is getting executed on the head node and a couple of jobs running simultaneously brings the head node to its knees. Is there a way to execute python scripts on the slave nodes, and then have them execute the compute binaries via mpirun? I've read about some bproc python bindings, but they seem very old and don't compile against 4.0.0pre8. Is there a way to start the job on the head node, and then migrate it to the slave node or something along those lines? Thanks, Peter |