From: Erik H. <eah...@gm...> - 2005-06-22 05:54:31
|
On 6/20/05, Greg Watson <gw...@la...> wrote: > Interestingly, I was just talking to one of the OpenMPI developers > who have a very similar problem. On most systems, mpirun will launch > a daemon on the remote nodes, then the daemon forks and execs the > program to be run. It does this to maintain control of I/O forwarding > amongst other things. Parallel debuggers also need to do a similar > thing. Yep. BProc is definitely not "most systems." :-) > Unfortunately, bproc 4 does not support exec'ing an executable on a > remote node (although older versions did), so they have to use some > other mechanism like copying the executable to the node. > Unfortunately, both the exec and direct copying (and NFS for that > matter) bypass the tree spawn mechanism that makes bproc so efficient > and scalable. The execve hook became difficult to support because of some of changes in the process movement code (specifically atomic conversion of process to/from a bproc message). It was relatively easy in the context of BProc 3 because it didn't handle the real process/process in a message distinction very well at all. It was a pretty big misfeature anyway. The manner in which people generally wanted to use it was exactly the wrong thing to do with it - have mpirun/parallel debugger/etc. exec on the node, etc. We used to watch master nodes crumble trying to migrate only a 100 or so processes off at once. > The basic problem is the need to get two (or more) > executables onto a node in such a way as they know about each other, > but as far as I know this functionality is not available. I don't > know how difficult it would be to modify vexecmove to handle multiple > executables - perhaps Erik could answer that? Multiple executables? I don't understand how that makes sense in the context of an 'execve' type system call. You cannot load multiple binaries into a single process in any kind of meaningful way.=20 Otherwise you're back to populating the file system which should definitely not be part of execmove. BTW, I think the general movement should be in the other direction.=20 vexecmove() is a hugely complicated system call. There are all kinds of problems with it because it effectively combines fork() and execve(). Both these calls have special semantics wrt ptrace.=20 There's a bunch of kernel code in there to try and emulate these weird cases without really returning to user space between these calls.=20 This is nasty and gross and gdb doesn't work right without it. As a possible fix, I observed that execve can be implemented as execdump into my own memory space followed by a move and then undump from my own memory space. I have a branch in the CVS which does this and removes vexecmove as a primitive (it's emulated in the user space bproc library). I think I axed 1000 lines of kernel code as a result. It's still half baked because dump to/from my own memory isn't implemented. There are a number of other advantages to handling execmove as a dump/move/undump sequence in user space as well. The bottom line is I think the kernel goop needs to get simpler, not more complicated. - Erik P.S. The execve hook wouldn't work for something like valgrind anyway since it doesn't use execve to load the program to be debugged. |