RE: [Valgrind-developers] Developing Valgrind for Windows

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Fri, 2004-03-05 at 09:10, KJK::Hyperion wrote: 
> At 01.14 02/03/2004, Jeremy Fitzhardinge wrote:
> >Well, the thing which Valgrind ideally wants is two complete address 
> >spaces: one for the client, and one for Valgrind.
> 
> are you 100% sure of what you're saying? if I understand correctly, at 
> least the JIT compiler needs to run in the same process as the client. 
> Otherwise we'd have an insane amount of inter-process memory copying, which 
> doesn't exactly come for free

Well, there is the question of which address space the generated code
should go into.  You're right, it needs quick access to the client
memory for client execution, and quick access to the shadow memory for
the instrumentation.  I guess in principle you could play segment games
for this or something...

I haven't thought about this too deeply.  We definitely need the notion
of multiple address spaces, but Valgrind's precise requirements somewhat
more demanding than the normal uses of address spaces.  It wants a
distinct address space to keep the client in, and one for the core
Valgrind code, but since the client code doesn't execute directly, we
need some union space in which generated code can get to both the client
and Valgrind address spaces equally easily.

This is doable but awkward in a single linear address space: you just
partition the address space at some point, and say everything below is
client, and everything above is Valgrind - this is what we do now, and
so long as client-visible objects don't have high fixed addresses, it
all works.  The main problem is that 4G is a bit of a squeeze.  This
model should work a lot better for 64-bit address spaces, since there's
a lot more room to play with.  (x86-64 is a bit odd, since the toolchain
doesn't support code above 4G, and is "only" a 53/48? bit address space
anyway.)

Two linear address spaces (ie, two separate unix processes) would allow
the client to have full run of one whole address space, but there's no
clear way in which generated code could have efficient access to both
address spaces.

Some kind of segmentation scheme, in which we map two segments to two
distinct linear address spaces would work, since client code could use
some kind of segment prefix on memory operations to distinguish which
address space it wants to access.  Unfortunately, I don't think this
works on x86 (since segmentation is layered on top of paging, so all
segments ultimately map onto the same underlying 4G paged address
space).  PPC's notion of segments is somewhat different, but I don't
think it does the right thing either.

In other words, if someone wanted to make a CPU and OS optimised for
running Valgrind, it would look somewhat different from the Unix model
or the Windows model, I think.

> this sounds like good news, finally. Is there an updated technical 
> overview? last time I looked, it said Valgrind didn't do multithreading

I think that overview was obsolete then, then.  Multithreading has been
in there a long time.

But it needs a lot more updating, since things changed quite a bit in
2.1.0, and 2.1.1 (when it appears).  2.1.2 will rearrange the source
tree and probably include at least the FreeBSD port, but I don't think
there'll be much deep architectural change.

> >Does this mean that if we translate this code into the instrumented code 
> >cache, then things will care because the EIP isn't within the 
> >kernel/user/gdi.dll?
> 
> no, this shouldn't be a problem, it only hurt full virtualization. The 
> kernel-mode windowing and graphics code does call back to user-mode, but it 
> only does so through a well-defined entry point - one of the entry points 
> Valgrind will have to catch anyway not to lose control

I don't really understand what the problem with FV is.  If kernel32.dll, user32.dll 
and gdi32.dll need to be loaded once at a fixed address, there's still
no reason why the client and Valgrind couldn't share the same copy.  All
the client uses of code in those .dlls will be virtualized, of course. 
Hm, I guess if those libraries are holding state, we need to make sure
the client version and Valgrind versions are distinct.  If this is a
case, it doesn't seem like FV itself is the problem - its the more
general problem of how to multiplex one "process" state/context between
two somewhat independent separate programs.

> Apropos, some details for the other Windows guys:
>   - the entry points (exported by ntdll.dll) are:
>      - KiRaiseUserExceptionDispatcher
>      - KiUserApcDispatcher
>      - KiUserCallbackDispatcher
>      - KiUserExceptionDispatcher
>     KiUserApcDispatcher will always be hit at least once per thread, as an 
> APC is queued to all threads to run the initialization code. We won't need 
> to do special handling of any of them, though. We'll just switch to the JIT 
> upon entering one of them. The first thread entering the JIT will 
> initialize it, the others will spin until initialization is done. We could 
> detect new threads by checking the flat address of their TEB against an 
> internal table, or by allocating a TLS slot and checking for its value 
> (NULL -> new thread)

What's an APC?  Async procedure call?  What does that mean?

>   - the entry points above aren't enough. Some control transfers happen 
> outside of them - luckily there aren't many. I've counted three: 
> ZwContinue, ZwSetContextThread and ZwVdmControl. The first two are easy, 
> the third is a mistery. I know it causes control transfers to and from V86 
> virtual machines, but how does it do that is not known - luckily, only 
> NTVDM uses it

I guess this would be an elaboration of the games we play currently with
signals?  That's the only place in Unix where the kernel asynchronously
changes the process context.

>   - catching system calls is a mess. Hooking system calls directly in 
> kernel mode, as dangerous as it is, is the best way for several reasons. I 
> don't like how that strace for Windows does it, though. To signal a call 
> I'd raise an exception: it's semantically correct, so it will work well 
> with existing (and future) code

Is there some distinct instruction or class of instructions used for
calling into the kernel? int? lcall through some special call gate?  Any
of those we can identify at translation time and do the right thing.

	J