Re: [BProc] Valgrind and BProc (again)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> [...]
>
> Interesting.  If I read this right valgrind acting as ELF loader.

Correct.  There's no way to avoid this [that we know of.]

> Does it do linking stuff too?

No.  We merely load the executable and its direct dependencies,
then start up the stated ELF interpreter (ld.so) on our virtual
CPU.  So we don't have to get into the dynamic linking swamp, 
fortunately.

> Does the target program effectively 
> have it's own dynamic linker or is it shared with valgrind?  
> Does it 
> share instances of the libraries? - it appears that stage2 is
> dynamically linked as well.

Our design goal is that V is completely independent of any other
libraries.  We haven't quite got there yet, but it nearly is.  The
primary motivation is that V has to maintain complete control over
the process' address space and signal state, and that's essentially
impossible if we defer to glibc to do low level stuff like malloc, 
free, etc.

Also, imagine the potential chaos if V and the program shared
glibc.so, and the simulated program was part way through doing malloc
(on the simulated CPU) when V decided to call malloc on the real CPU.  
Even if this didn't turn out to be a problem, the difficulty in
convincing ourselves that it's safe and always going to work is
huge.  So our policy is to make V (viz, stage2) as completely
self-contained as we can.

Since we're not quite there yet .. V does use glibc.so and ld.so,
but has its own instances of them.

> It's true that it's only one executable but that could be something
> pretty weird.  I'm kinda just thinking out loud here but what about
> the following:
>
> - valgrind starts up and gets through loading the program to be debugged.
> - valgrind stops and dumps itself w/ vmadump (bproc_dump()).
> - bpsh/mpirun migrates THAT process image instead of some fresh executable.
> - half started process w/ valgrind + other executable wakes up and
> runs on the slave node.
>
> The nasty bit here is that valgrind would have to be linked w/ bproc.
> I did some weird stuff w/ editing freshly loaded elf binaries to add a
> preinit section that called bproc.  That basically allowed the kernel
> to take over again after dynamic linking was done but before the
> program ran.  I don't know if some similar hack could work here.  I
> don't know - just a thought.
>
> This would be pretty easy to test, I think. If you added the
> bproc_dump call and just dumped to a plain file, you can execve that
> file directly to reload the dump.  That would allow bpsh to do its
> thing.  I real solution would probably look more like dump into a pipe
> or something.
>
> That still leaves the problem of valgrind getting at files when it
> pleases.  Would it be possible/reasonable for valgrind to pre-load
> everything it *might* need down the line?  That could be optional.

That kind of thing might be a possibility, although I have to be honest
and say I'd prefer not to have to put BProc specifics into V if I don't
have to, especially as at this time we're working hard to make V less
target-specific.

I've been pondering a more generalised solution .. tell me if this 
sounds crazy.

It's a modified version of bpsh (or a replacement).  Instead of
doing

   bpsh <node_specifiers> program args

do 

   modified_bpsh <node_specifiers> path program args

modified_bpsh reads the entire tree rooted at path into itself
(mmap games, perhaps), migrates to the nodes, dumps the tree back
into the node-local filesystem, and execs program w/args as usual.

Running V on a slave node is then

  modified_bpsh <node_specifiers>    \
          /where/V/is/installed/on/master \               # the path
          /where/V/is/installed/on/master/bin/valgrind \  # stage1
          program \
          args

This strikes me as having several advantages:

* doesn't require slaves to use an NFS-mounted filesystem
* is useful for any kind of tool requiring a readonly filesystem
* doesn't require linking V against BProc

The intention is that the carried-around tree is small, as is indeed
V's install tree is (5.7 M).  If bandwidth is an issue (fair enough, if
sending copies to N hundred nodes) then it might be possible to 
compress the tree as it is read using a real-time compression package
and decompress on the slaves.  I'm thinking of LZO
(http://www.oberhumer.com/opensource/lzo) which is GPLd and very fast.

Comments?

J