Re: [BProc] Valgrind and BProc (again)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 6/20/05, Greg Watson <gw...@la...> wrote:
> Interestingly, I was just talking to one of the OpenMPI developers
> who have a very similar problem. On most systems, mpirun will launch
> a daemon on the remote nodes, then the daemon forks and execs the
> program to be run. It does this to maintain control of I/O forwarding
> amongst other things. Parallel debuggers also need to do a similar
> thing.

Yep.  BProc is definitely not "most systems."  :-)

> Unfortunately, bproc 4 does not support exec'ing an executable on a
> remote node (although older versions did), so they have to use some
> other mechanism like copying the executable to the node.
> Unfortunately, both the exec and direct copying (and NFS for that
> matter) bypass the tree spawn mechanism that makes bproc so efficient
> and scalable.

The execve hook became difficult to support because of some of changes
in the process movement code (specifically atomic conversion of
process to/from a bproc message).  It was relatively easy in the
context of BProc 3 because it didn't handle the real process/process
in a message distinction very well at all.

It was a pretty big misfeature anyway.  The manner in which people
generally wanted to use it was exactly the wrong thing to do with it -
have mpirun/parallel debugger/etc. exec on the node, etc.  We used to
watch master nodes crumble trying to migrate only a 100 or so
processes off at once.

> The basic problem is the need to get two (or more)
> executables onto a node in such a way as they know about each other,
> but as far as I know this functionality is not available. I don't
> know how difficult it would be to modify vexecmove to handle multiple
> executables - perhaps Erik could answer that?

Multiple executables?  I don't understand how that makes sense in the
context of an 'execve' type system call.  You cannot load multiple
binaries into a single process in any kind of meaningful way.=20
Otherwise you're back to populating the file system which should
definitely not be part of execmove.

BTW, I think the general movement should be in the other direction.=20
vexecmove() is a hugely complicated system call. There are all kinds
of problems with it because it effectively combines fork() and
execve().  Both these calls have special semantics wrt ptrace.=20
There's a bunch of kernel code in there to try and emulate these weird
cases without really returning to user space between these calls.=20
This is nasty and gross and gdb doesn't work right without it.

As a possible fix, I observed that execve can be implemented as
execdump into my own memory space followed by a move and then undump
from my own memory space.  I have a branch in the CVS which does this
and removes vexecmove as a primitive  (it's emulated in the user space
bproc library).  I think I axed 1000 lines of kernel code as a result.
 It's still half baked because dump to/from my own memory isn't
implemented.  There are a number of other advantages to handling
execmove as a dump/move/undump sequence in user space as well.  The
bottom line is I think the kernel goop needs to get simpler, not more
complicated.

- Erik

P.S.  The execve hook wouldn't work for something like valgrind anyway
since it doesn't use execve to load the program to be debugged.