|
From: Julian S. <js...@ac...> - 2002-12-07 13:36:04
|
Anybody fancy a nice self-contained xmas hack? -- J
Speeding up the "addrcheck" skin -- Julian Seward, 7 December 02
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What follows is an idea for an optimisation aimed at speeding up
one of the new tools in valgrind-2.0: the "addrcheck", fine-grained
address checker.
I don't have time to try this myself. But it would be a shame to ship
2.0 without at least trying this idea. It might make addrcheck a lot
faster, or it might have no effect. Either way, I'd like to know.
So here is the idea. Realistically, trying it out would take a hacker
experienced in valgrind internals, perhaps a weekend. A competent
hacker, with a good handle on low-level bit twiddling and x86
assembly, but no knowledge of valgrind, might take more like a week.
Or perhaps less. The nice thing about this one is it's very
self-contained, and you don't actually need to know anything much
about V to do it.
If you have the time, understanding and enthusiasm to try this out,
please go ahead, and let me know you're on the case. This work can be
carried out either on the 1.1.0 snapshot or the CVS head; either way
it can be installed into the CVS head easily enough, if successful.
Timescale: I'd like to ship valgrind-2.0 in late Jan, if possible, so
this would form an interesting xmas-break hack for someone, ideally.
Background
~~~~~~~~~~
The addrcheck skin (tool) is a new bug-detecting tool in valgrind-2.0.
It is a simplified version of the "traditional" valgrind checks that
1.0.X does, with one crucial detail different: there is no checking
for undefined values. Result is that addrcheck does fine-grained
address checking only: for every read and write, it checks that the
program is really allowed to read/write at that address.
This forms an interesting compromise from full-scale 1.0.X-style
valgrinding. It still picks up: reading/writing freed memory,
reading/writing off the start/end of malloc'd blocks, and passing
invalid addresses to system calls. These bugs are hard to find and do
cause crashes in practice. On the other hand, you don't find
undefined-value errors any more; you'll need the memcheck skin to do
that.
So, on the plus side, addrcheck runs about twice as fast as full-scale
valgrinding, whilst still picking up important bugs. On the minus side,
you lose undefined-value checking. No such thing as a free lunch.
How Addrcheck works at the moment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pretty much everything you need to know is in addrcheck/ac_main.c, and
the entire experiment should be possible only hacking that file.
Addrcheck keeps a bitmap, with (potentially) one bit for each byte of
the 4G address space. That bit indicates whether or not the
associated address is currently valid or not. The bitmap is
represented by a two-level sparse array, which is grown dynamically,
so as to keep its space usage sensible.
Every memory access has to be checked. For a 4-byte load or store --
by far the most common case -- the code generated by the valgrind
dynamic translator has to call ac_helperc_ACCESS4(), in
addrcheck/ac_main.c, passing it the address to be checked. This
function does the check, emits a warning if needed, and returns.
The speed of this operation is critical since it is very frequent.
The whole business looks like this. First we start off in
code generated by valgrind. The address to be checked is in %edx:
pushl %eax
pushl %edx
movl %edx, %eax
call * 36(%ebp)
This call takes us directly to ac_helperc_ACCESS4, the fast
(common) path through which is:
ac_helperc_ACCESS4:
pushl %ebx
movl %eax, %ebx
roll $18, %eax
andl $1048572, %eax
movl primary_map(%eax), %edx
movzwl %bx,%eax
shrl $3, %eax
movzbl (%eax,%edx), %eax
movl %ebx, %ecx
andl $4, %ecx
sarl %cl, %eax
testl $15, %eax
je .L314 (expected taken -- the no-error case)
....
.L314:
popl %ebx
ret
Now we're back in generated code, restore callee-save regs:
popl %edx
popl %eax
This means the fast case takes 4 + 15 + 2 = 21 instructions, which
isn't good. Also not good is the rotate and two shifts, both of which
are expensive on the P4 (4 cycle latency each).
The idea
~~~~~~~~
(It occurs to me that you need a rock-solid understanding of how a
direct-mapped cache works, to make sense of the following.)
I believe the common case can be done in 4 or 5 instructions, which
can be generated in-line in the translation, so it doesn't even
involve a call.
The basic idea is to add a simulated direct-mapped cache, which
doesn't hold any data -- we only care about missing vs not missing in
the cache. The cache is an array of 2^N 32-bit addresses, for some N
(the size can be tuned later).
The meaning of the cache is as follows. If we succeed in finding an
address in the cache, it means that that all 4 bytes of the word
surrounding the address are accessible, which is the expected
fast-case. If any of the 4 bytes in a word surrounding the address
are not accessible, we arrange for the cache always to generate a miss
when presented with that address.
The cache-hit/miss test is done in-line in the generated code and
takes 4 or 5 instructions. If we hit, we just keep going and that's
the end of it. If we miss, then we have to call out to a helper
function to handle it, but that should be relatively rare.
Each entry in the cache array is simply a 4-byte-aligned address where
we guarantee that the address is valid. For example, if cache entry
[0] contains the value 0x12345600, this tells us that addresses
0x12345600 to 0x12345603 inclusive, are valid.
The cache (well, really the only part which exists is the tag array)
is indexed by bits (N+2 .. 2) inclusive of the address. That is,
define the function
#define N_MASK ((1 << N)-1)
index(a) = (a >> 2) & N_MASK
Any 4-byte address a will then have a hit in the cache exactly when
a == cache[ index(a) ]
ie, we look up in the relevant slot and find our own address.
Notice this is more subtle than it at first appears. Specifically,
how do we handle a being misaligned (very rare in practice, but we
still need to handle it) ? Well, we only allow the cache[] array to
hold addresses whose lowest two bits are zero. So if a is misaligned
(ie, its lowest two bits are not both zero), the comparison will
always fail.
Another subtlety is how we invalidate an entry. Suppose that the
cache says that some address a is valid, by having an entry satisfying
a == cache[ index(a) ]. And now we want to make that word invalid,
perhaps because it's part of a block of memory being released by the
simulated application calling free().
So we arrange that the value in the cache slot can never match
any address which might map to that slot. An easy way to do this is
to set cache[index(a)] = ~a (bitwise inversion of a).
Now the really neat thing about this is that the fast-case
check a == cache[ index(a) ] translates to not much x86 code.
Let %a be a reg holding a, which we don't want to trash, and let
%r be some spare reg. The check is then:
movl %a, %r -- %r := %a
andl $(N_MASK << 2), %r -- %r := sizeof(cache-slot) * index(a)
cmpl cache(%r), %a -- Z flag set iff cache hit
jz fast-case-continuation
call slow-case-helper
fast-case-continuation:
Not bad! The (>> 2) in the index() calculation is exactly compensated
for by the implicit (<< 2) in the cache[...] array access, saving
insns and (crucially on a P4) any shifting operations. Valgrind's
code generator keeps track of free registers, so we can usually get a
suitable candidate for %r at no extra expense. If we're unlucky we
can push and pop some other register around the sequence, but even
that's pretty darn quick.
Actually doing it
~~~~~~~~~~~~~~~~~
Implementing this idea doesn't mean writing a lot of code. It does
mean considerable preliminary thought, writing designs on paper, etc,
so as to:
-- work out the fiddly details of extending this to accesses of
sizes 1, 2 and 8 (treat as 2 x 4 ?).
-- make sure that the working out gives correct, slow-case behaviour
for all possible cases of misaligned addresses.
-- figure out how the cache interacts with it's backing store,
ie the existing sparse array
-- sanity check the entire story, including that about invalidating
cache entries (I think my story is ok, but not 100% sure)
-- figure out how this impacts set_address_range_perms(), since that
is a frequently-called function (every time the simulated machine's
%esp changes!)
Although performance is the aim here, the number one priority is
correctness. Valgrind is a debugging tool, so it is vital that this
address-check machinery works correctly. IOW, apart from doing the
hackery you'll also need to convince us that your hackery is
absolutely and totally correct under all circumstances!
Antidisirregardless!
A way to get started, without having to immerse yourself in the grotty
details of the x86 insn set, is to add this cache purely inside
ac_main.c and rewrite the *ACCESS* functions to use the cache. This
allows you to develop and debug your logic whilst operating in C land.
Making sure the logic is correct is the hard bit of this. Once that's
done it's easy to persuade V's code generator to emit the
(abovementioned) fast-case code fragments in-line.
Anyway, if you can make this fly, I'd love to hear the outcome.
Please contact me (js...@ac...) and cc to the developers
mailing list val...@li....
|
|
From: Jeremy F. <je...@go...> - 2002-12-09 05:29:29
|
On Sat, 2002-12-07 at 05:43, Julian Seward wrote:
> -- work out the fiddly details of extending this to accesses of
> sizes 1, 2 and 8 (treat as 2 x 4 ?).
1 and 2 should be easy; just round the address down to the next multiple
of 4 and probe (since presence in the hash means that N - N+3 are
valid). 8 should also be easy, as two probes, but probably isn't common
enough to spend lots of effort on. Misalignment where the access can
cross a multiple of 4 boundary is irritating, but see below:
> -- make sure that the working out gives correct, slow-case behaviour
> for all possible cases of misaligned addresses.
Misaligned accesses are the tricky bit. Detecting a mis-aligned access
is going to complicate the test site somewhat (another test and
conditional jump). You could make it fall into the slow path, but
testing the next hash entry up would be just as simple (and a hack to
make this slightly quicker: if you have a hash with 2^N entries, then
make the array 2^N+1 entries long, with the last entry always being a
copy of the first entry - that way you can always probe the next one up
without worrying about wrapping). On the other hand, there probably
aren't enough misaligned accesses to make it worth complicating the
inline fastpath.
Hm, so the details:
For size == 4, the access is aligned iff a & 3 == 0. So testing for
that is easy.
For size == 2, the access is aligned (as in not crossing a multiple-of-4
address) if (a & 3 < 3), which can be tested with:
testl $3, %addr
jz aligned // addr is ....00
jnp aligned // addr is ....10 or ....01
call slowpath
jmp done
aligned:
// fastpath
done: ...
And fortunately, size==1 can't be misaligned.
So I guess the full code for size = 4 would be:
testl $3, %a
jnz slow
movl %a, %r
andl $MASK, %r
cmpl cache(%r), %a
jz done
slow: call slow-path
done:
For size == 2:
testl $3, %a
jz fast
jp slow
fast: movl %a, %r
andl $MASK, %r
cmpl cache(%r), %a
jz done
slow: call slow-path
done:
For size ==1:
> movl %a, %r -- %r := %a
> andl $(N_MASK << 2), %r -- %r := sizeof(cache-slot) *
> index(a)
> cmpl cache(%r), %a -- Z flag set iff cache hit
> jz fast-case-continuation
>
> call slow-case-helper
>
> fast-case-continuation:
> -- figure out how the cache interacts with it's backing store,
> ie the existing sparse array
> -- sanity check the entire story, including that about invalidating
> cache entries (I think my story is ok, but not 100% sure)
You mean putting ~a into index(a). Seems reasonable to me; it could
only be a problem if (a & mask) can ever equal (~a & mask); the simple
case is where mask = ~0: can a == ~a?
> -- figure out how this impacts set_address_range_perms(), since that
> is a frequently-called function (every time the simulated machine's
> %esp changes!)
Shouldn't be too hard to write an efficient cache-stomper.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-11 01:24:17
|
> So I guess the full code for size = 4 would be:
>
> testl $3, %a
> jnz slow
> movl %a, %r
> andl $MASK, %r
> cmpl cache(%r), %a
> jz done
> slow: call slow-path
> done:
Hey? That seems like too many instructions to me. The idea is that the
cache entries are arranged so as to cause lookup failures on misalignment,
so that the testl and jnz are not needed.
If a cache slot mentions (holds) some address a, this means that a .. a + 3
inclusive are addressible. Furthermore we require that a has 00 as its lowest
two bits. (**)
----------------
So a test for a 4-byte access at address vv is
movl %vv, %temp
andl $MASK, %temp
cmpl cache(%temp), %vv
jz done
slow:
where MASK is (CACHE_MASK << 2) and CACHE_MASK is ((1 << CACHE_BITS)-1).
If %vv ends in anything other than 00, it cannot match any cache[] value
as implied by ** above.
To mark the ith cache slot empty, we place in it the value ((~i) << 2).
That causes all checks to fail since the middle CACHE_BITS cannot ever
then match. It also observes (**).
----------------
The test for a 1-byte access at address vv is
movl %vv, %temp
movl %vv, %temp2
andl $MASK, %temp -- cache index, as before
andl $(~3), %temp2 -- dump bits 0 and 1 of address (~3 == 111...11100b)
cmpl cache(%temp), %temp2
jz done
slow:
This is not so good (trashes a second reg), so perhaps your code is better
here. OTOH, providing enough spare regs exist, all reasonable machines
have 2 ALUs capable of doing the andls in parallel, so the sequence should be
fast.
----------------
Finally 2-byte is a minor variant of the 1-byte version:
movl %vv, %temp
movl %vv, %temp2
andl $MASK, %temp -- cache index, as before
andl $(~2), %temp2 -- dump bit 1 of address (~2 == 111...11101b)
cmpl cache(%temp), %temp2
jz done
slow:
The andl $(~2) is the subtlety. For the lowest two bits it gives the mapping
00 -> 00, 01 -> 01, 10 -> 00, 11 -> 01
So if the address was 2-aligned (00, 10) it produces 00, which can potentially
match the cache[] entry.
If the address was not 2-aligned (01, 11) it produces 01, which can never
match and forces us to the slow case. It is true to say that this forces
addresses ending in 01 unneccesarily into the slow case whereas your test-
based code doesn't, but misaligned accesses are so rare I think its more
important to accelerate the common case.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-11 01:51:11
|
On Tue, 2002-12-10 at 17:31, Julian Seward wrote: > Hey? That seems like too many instructions to me. The idea is that the > cache entries are arranged so as to cause lookup failures on misalignment, > so that the testl and jnz are not needed. Yep, you're right. > This is not so good (trashes a second reg), so perhaps your code is better > here. OTOH, providing enough spare regs exist, all reasonable machines > have 2 ALUs capable of doing the andls in parallel, so the sequence should be > fast. I made the ACCESS UInstr take two args: the address and the rounded address, so that I didn't have to scrounge for a pair of temps. It would help if AND accepted a Lit32 argument though. > movl %vv, %temp > movl %vv, %temp2 > andl $MASK, %temp -- cache index, as before > andl $(~2), %temp2 -- dump bit 1 of address (~2 == 111...11101b) > cmpl cache(%temp), %temp2 > jz done > slow: > > The andl $(~2) is the subtlety. For the lowest two bits it gives the mapping > 00 -> 00, 01 -> 01, 10 -> 00, 11 -> 01 > So if the address was 2-aligned (00, 10) it produces 00, which can potentially > match the cache[] entry. Nice. J |