From: Peter L. <lar...@um...> - 2005-08-01 16:25:51
|
Hello: I'm setting up a smallish (24 node, 48 processor) cluster to run a software package called dacapo (atomic simulation package). This package consists of a binary (pile of fortran compiled against mpi) that calculates the wavefunctions and some python modules to control what the binaries calculate. I have the cluster setup with clustermatic 5 and the binary executes fine with mpirun. What happens after the wavefunctions are calculated (interitive solution) is python is supposed to get a signal and do the next thing in the script (geometry step or finish or whatever the script says to do next, etc...) but this never happens and it just runs forever or until bjs effectively "kills" the job. If I kill the job myself, the python script on the head node just runs forever (it doesn't ever seem to figure out that dacapo has died). I've played around with various ways of running this software (bpsh'ing the python script to a node first, and then having dacapo execute - this doesn't work at all which is I guess expected) and I've played messed with /proc/sys/bproc/shell_hack, etc.... Just running the python script on the head node and having it call a shell script which executes mpirun is the closest I get, but it does what I describe above. I realize the people on this list probably don't have experience with this software, and I'm not even sure I understand the internals completely anyway (I've dug into some of the python and I'll admit some of it starts to border on magic to me.....), but the basic question here is does this sound like the kind of thing that bproc just doesn't handle well yet? I'm starting to get the feeling I should cut my losses and just setup independent nodes with ssh keys shared, lam-mpi or mpich, and either openpbs or pbs-pro. We have access to a cluster set up this way, and the software does work fine, but this clustermatic setup is nifty from an admin standpoint so I'd love to make it work, but we need this cluster running, so I'm starting to think I should abort and go the other way now. What do you guys think? Sorry for the long message, but I thought I should describe my problems as thoroughly as I can. Thanks for your input. -Peter |