From: Daniel G. <dg...@ti...> - 2004-07-21 17:51:00
|
On Wed, Jul 21, 2004 at 10:58:39AM -0500, Brian Barrett wrote: > On Jul 20, 2004, at 9:19 AM, Daniel Gruner wrote: > > > On Tue, Jul 20, 2004 at 09:03:01AM -0500, Brian Barrett wrote: > >> On Jul 20, 2004, at 8:08 AM, Thomas Eckert wrote: > >> > >>> this thread seems to have slipped off the bproc-list -- most likely I > >>> replied > >>> to the wrong message -- so here is a forward of my reply :( > >>> > >>> I'm interested in the bproc3<->lam-7.0.x results: have you tried > >>> bproc3 with > >>> the latest stable lam (7.0.x) and it did not work or are you focusing > >>> on > >>> bproc4 now anyway due to other reasons (want to use 2.6-kernels, > >>> ...)? > >> > >> Luke was using BProc 4, which LAM 7.0.x does not support (LAM 7.1, > >> which just went into beta, supports what is currently in the BProc 4 > >> API. Hopefully, that means it will support BProc 4 when it goes > >> stable). > >> > >> If you have any problems using LAM 7.0.x with BProc 3, please let us > >> (the LAM developers) know. There has been some fairly extensive > >> testing, so I would be surprised if there were problems in that area. > > > > I have recently installed LAM 7.0.2 on a CM3 (BProc 3) cluster. It > > mostly > > works, but there are a few disturbing glitches: > > > > - I cannot seem to run 2 MPI jobs as the same user simultaneously (on > > different sets of nodes, of course), since when I do the second > > invocation > > of mpiexec (or its equivalent lamboot/run/lamhalt) it kills the first > > lamd > > on the master node. It does seem to work for different users, though. > > This is expected behavior. The design of LAM is that your start the > RTE (the daemons) on the nodes you will use for all MPI applications, > then run your application (or applications) inside that universe. If > you need two separate universes, you can use the LAM_MPI_SESSION_SUFFIX > environment variable to keep the daemons from clobbering each other. > See the lamboot(1) man page and the LAM/MPI User Document (available in > pdf form on the web page) for more information. I see the logic in this, but what happens when you are running some kind of batch queuing system, request a set of nodes/processors, and then run lamboot on them? I guess the LAM_MPI_SESSION_SUFFIX must then be used for all mpiexec jobs, right? I had better rtfm... Also, why did it only happen for the same user, but not when different users do it? I guess you can only kill your own jobs, but do the LAM daemons keep separate universes in that case? > > > - Just doing lamboot followed by lamhalt (whether or not some mpi job > > is run) > > produces a core dump (I guess it is by lamhalt). Always. Running > > mpiexec does > > it too. > > Yeah, that's disturbing. Is there a core file left around? If so, can > you gdb the core file and send me a stack trace? I'll try when I see one. Thanks. Daniel -- Dr. Daniel Gruner dg...@ti... Dept. of Chemistry dan...@ut... University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key |