[Valgrind-developers] speeding up addrcheck idea

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Anybody fancy a nice self-contained xmas hack?  -- J

Speeding up the "addrcheck" skin  --  Julian Seward, 7 December 02
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What follows is an idea for an optimisation aimed at speeding up
one of the new tools in valgrind-2.0: the "addrcheck", fine-grained
address checker.

I don't have time to try this myself.  But it would be a shame to ship
2.0 without at least trying this idea.  It might make addrcheck a lot
faster, or it might have no effect.  Either way, I'd like to know.

So here is the idea.  Realistically, trying it out would take a hacker
experienced in valgrind internals, perhaps a weekend.  A competent
hacker, with a good handle on low-level bit twiddling and x86
assembly, but no knowledge of valgrind, might take more like a week.
Or perhaps less.  The nice thing about this one is it's very
self-contained, and you don't actually need to know anything much
about V to do it.

If you have the time, understanding and enthusiasm to try this out,
please go ahead, and let me know you're on the case.  This work can be
carried out either on the 1.1.0 snapshot or the CVS head; either way
it can be installed into the CVS head easily enough, if successful.

Timescale: I'd like to ship valgrind-2.0 in late Jan, if possible, so
this would form an interesting xmas-break hack for someone, ideally.

Background
~~~~~~~~~~
The addrcheck skin (tool) is a new bug-detecting tool in valgrind-2.0.
It is a simplified version of the "traditional" valgrind checks that
1.0.X does, with one crucial detail different: there is no checking
for undefined values.  Result is that addrcheck does fine-grained
address checking only: for every read and write, it checks that the
program is really allowed to read/write at that address.

This forms an interesting compromise from full-scale 1.0.X-style
valgrinding.  It still picks up: reading/writing freed memory,
reading/writing off the start/end of malloc'd blocks, and passing
invalid addresses to system calls.  These bugs are hard to find and do
cause crashes in practice.  On the other hand, you don't find
undefined-value errors any more; you'll need the memcheck skin to do
that.

So, on the plus side, addrcheck runs about twice as fast as full-scale
valgrinding, whilst still picking up important bugs.  On the minus side,
you lose undefined-value checking.  No such thing as a free lunch.

How Addrcheck works at the moment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pretty much everything you need to know is in addrcheck/ac_main.c, and
the entire experiment should be possible only hacking that file.

Addrcheck keeps a bitmap, with (potentially) one bit for each byte of
the 4G address space.  That bit indicates whether or not the
associated address is currently valid or not.  The bitmap is
represented by a two-level sparse array, which is grown dynamically,
so as to keep its space usage sensible.

Every memory access has to be checked.  For a 4-byte load or store --
by far the most common case -- the code generated by the valgrind
dynamic translator has to call ac_helperc_ACCESS4(), in
addrcheck/ac_main.c, passing it the address to be checked.  This
function does the check, emits a warning if needed, and returns.
The speed of this operation is critical since it is very frequent.

The whole business looks like this.  First we start off in 
code generated by valgrind.  The address to be checked is in %edx:

                pushl %eax
                pushl %edx
                movl    %edx, %eax
                call * 36(%ebp)

This call takes us directly to ac_helperc_ACCESS4, the fast
(common) path through which is:

   ac_helperc_ACCESS4:
        pushl   %ebx
        movl    %eax, %ebx
        roll    $18, %eax
        andl    $1048572, %eax
        movl    primary_map(%eax), %edx
        movzwl  %bx,%eax
        shrl    $3, %eax
        movzbl  (%eax,%edx), %eax
        movl    %ebx, %ecx
        andl    $4, %ecx
        sarl    %cl, %eax
        testl   $15, %eax
        je      .L314   (expected taken -- the no-error case)
        ....
   .L314:
        popl    %ebx
        ret

Now we're back in generated code, restore callee-save regs:

                popl %edx
                popl %eax

This means the fast case takes 4 + 15 + 2 = 21 instructions, which
isn't good.  Also not good is the rotate and two shifts, both of which
are expensive on the P4 (4 cycle latency each).

The idea
~~~~~~~~
(It occurs to me that you need a rock-solid understanding of how a
direct-mapped cache works, to make sense of the following.)

I believe the common case can be done in 4 or 5 instructions, which
can be generated in-line in the translation, so it doesn't even
involve a call.

The basic idea is to add a simulated direct-mapped cache, which
doesn't hold any data -- we only care about missing vs not missing in
the cache.  The cache is an array of 2^N 32-bit addresses, for some N
(the size can be tuned later).

The meaning of the cache is as follows.  If we succeed in finding an
address in the cache, it means that that all 4 bytes of the word
surrounding the address are accessible, which is the expected
fast-case.  If any of the 4 bytes in a word surrounding the address
are not accessible, we arrange for the cache always to generate a miss
when presented with that address.

The cache-hit/miss test is done in-line in the generated code and
takes 4 or 5 instructions.  If we hit, we just keep going and that's
the end of it.  If we miss, then we have to call out to a helper
function to handle it, but that should be relatively rare.

Each entry in the cache array is simply a 4-byte-aligned address where
we guarantee that the address is valid.  For example, if cache entry
[0] contains the value 0x12345600, this tells us that addresses
0x12345600 to 0x12345603 inclusive, are valid.

The cache (well, really the only part which exists is the tag array)
is indexed by bits (N+2 .. 2) inclusive of the address.  That is,
define the function

#define N_MASK ((1 << N)-1)

   index(a) = (a >> 2) & N_MASK

Any 4-byte address a will then have a hit in the cache exactly when

   a == cache[ index(a) ]

ie, we look up in the relevant slot and find our own address.  

Notice this is more subtle than it at first appears.  Specifically,
how do we handle a being misaligned (very rare in practice, but we
still need to handle it) ?  Well, we only allow the cache[] array to
hold addresses whose lowest two bits are zero.  So if a is misaligned
(ie, its lowest two bits are not both zero), the comparison will
always fail.

Another subtlety is how we invalidate an entry.  Suppose that the
cache says that some address a is valid, by having an entry satisfying
a == cache[ index(a) ].  And now we want to make that word invalid,
perhaps because it's part of a block of memory being released by the
simulated application calling free().

So we arrange that the value in the cache slot can never match 
any address which might map to that slot.  An easy way to do this is
to set  cache[index(a)] = ~a (bitwise inversion of a).

Now the really neat thing about this is that the fast-case
check a == cache[ index(a) ] translates to not much x86 code.

Let %a be a reg holding a, which we don't want to trash, and let
%r be some spare reg.  The check is then:

   movl %a, %r                   -- %r := %a
   andl $(N_MASK << 2), %r       -- %r := sizeof(cache-slot) * index(a)
   cmpl cache(%r), %a            -- Z flag set iff cache hit
   jz   fast-case-continuation

   call slow-case-helper

 fast-case-continuation:

Not bad!  The (>> 2) in the index() calculation is exactly compensated
for by the implicit (<< 2) in the cache[...] array access, saving
insns and (crucially on a P4) any shifting operations.  Valgrind's
code generator keeps track of free registers, so we can usually get a
suitable candidate for %r at no extra expense.  If we're unlucky we
can push and pop some other register around the sequence, but even
that's pretty darn quick.

Actually doing it
~~~~~~~~~~~~~~~~~
Implementing this idea doesn't mean writing a lot of code.  It does
mean considerable preliminary thought, writing designs on paper, etc,
so as to:

-- work out the fiddly details of extending this to accesses of
   sizes 1, 2 and 8 (treat as 2 x 4 ?).

-- make sure that the working out gives correct, slow-case behaviour
   for all possible cases of misaligned addresses.

-- figure out how the cache interacts with it's backing store,
   ie the existing sparse array

-- sanity check the entire story, including that about invalidating
   cache entries (I think my story is ok, but not 100% sure)

-- figure out how this impacts set_address_range_perms(), since that
   is a frequently-called function (every time the simulated machine's
   %esp changes!)

Although performance is the aim here, the number one priority is
correctness.  Valgrind is a debugging tool, so it is vital that this
address-check machinery works correctly.  IOW, apart from doing the
hackery you'll also need to convince us that your hackery is
absolutely and totally correct under all circumstances!
Antidisirregardless!

A way to get started, without having to immerse yourself in the grotty
details of the x86 insn set, is to add this cache purely inside
ac_main.c and rewrite the *ACCESS* functions to use the cache.  This
allows you to develop and debug your logic whilst operating in C land.
Making sure the logic is correct is the hard bit of this.  Once that's
done it's easy to persuade V's code generator to emit the
(abovementioned) fast-case code fragments in-line.

Anyway, if you can make this fly, I'd love to hear the outcome.
Please contact me (js...@ac...) and cc to the developers
mailing list val...@li....