On Sat, 29 Mar 2003, Julian Seward wrote:
> A big bottleneck (we surmise, but haven't accurately measured)
> for skins which deal with A bits, is handling %ESP changes.
> Consider a source push instr:
>
> 0x40184623: pushl %esi
>
> 10: GETL %ESI, t8
> 11: GETL %ESP, t12
> 12: MOVL t12, t10
> 13: SUBL $0x4, t10
> 14: PUTL t10, %ESP
> 15: STL t8, (t10)
> 16: INCEIPo $1
>
> Eventually there is, a result of this, a call to VG_(handle_esp_assignment),
> assuming the skin has asked for shadow memory (aiui).
It gets called if any of the following events are tracked by the skin:
new_mem_stack, new_mem_stack_aligned, die_mem_stack, die_mem_stack_aligned.
> This passes the old and new %ESP values, and in
> VG_(handle_esp_assignment) these are subtracted to discover that the
> stack delta is -4. Then we do
>
> VG_TRACK(new_mem_stack_aligned, new_esp, -delta);
>
> which necessarily involve a loop, and probably a test on the sign of
> delta, in the relevant skin's permissions-mangling function.
>
> This is all rather inefficient considering that we knew from the start
> that delta would be -4, and it would have been better to plant, in the
> generated code, a call directly to a skin function specialised to a delta
> of -4. Most deltas are small (+4, -4, +16, -16) and so a small bunch of
> specialised functions + a general fallback case would probably constitute
> a significant improvement.
>
> The improvements would be because (1) the specialised functions have no
> need to test the sign of delta nor have a loop to handle all values of
> it, and (2) because we could save having to call VG_(handle_esp_assignment)
> and then call onwards to the skin function; we could just call the skin
> function directly.
You're right about the inefficiency. I just tried altering Nulgrind so it
tracked the four events I mentioned above, where the called functions were
empty. My one test program slowed down from 0.20s to 0.35s, on a longer
run it went from 1.0s to 2.4s. Just to make sure it wasn't simply the
function call overhead, I tried instead inserting calls to an empty
function wherever %esp was updated; my test program (small input) only
slowed to 0.24s. So yes, the current mechanism is slow, well done for
spotting it.
[It's weird, something is wrong with profiling which is maybe why this
hasn't been spotted before -- there is an event VgpStack for measuring
exactly this stack-adjustment overhead, but I always get zero ticks for
it. I even put in a small busy loop into VG_(handle_esp_assignment)() so
that the program ran ten times longer than normal, 90% of that in the
VG_(handle_esp_assignment), and it still gave zero ticks (the ticks were
allocated to "running"). I'll investigate that when I look at this in
more detail.]
Your idea is quite plausible. Basically the skins would bypass the core's
built-in way of handling stack updates in order to do them itself, because
it can be more efficient that way.
Which makes me think: if we want to make this improvement for
{Mem,Addr}check, any other skin that tracks %esp changes probably wants it
as well. So let's try to improve the general mechanism rather than
implementing a {Mem,Addr}check-only optimisation.
The current approach is simplest from a code-generation point of view;
doing it this optimised way would require a little bit of smartness to
work out the delta (you have to look back a few UCode instructions to see
when %ESP was loaded and then what ADDs/SUBs were done on it), but it
shouldn't be too hard.
You could then have a number of trackable events for skins to hook into:
new_mem_stack_aligned_4
new_mem_stack_aligned_8
etc.
die_mem_stack_aligned_4
die_mem_stack_aligned_8
etc.
for the most common cases (eg. 4, 8, 12, 16, 20, 24). They would be
passed the old %esp. The skins could have unrolled versions of the
general stack-adjusting code for these cases.
Also have:
new_mem_stack_aligned_gen
new_mem_stack
die_mem_stack_aligned_gen
die_mem_stack
If a skin didn't provide these special case functions, the core could fall
back to using the general case ones if they were provided -- this would be
useful when first writing skins, when you don't want to write five
different versions of the same function. Ie. new_mem_stack_aligned_4
would be used if present, but fall back to new_mem_stack_aligned_gen if
present, but fall back to new_mem_stack if present, else do nothing.
One complication -- how do we know at compile-time if a stack-adjustment
is aligned? We can't (AFAICT) so maybe the events shouldn't have any
mention of alignment, and it's up to the skin to do an alignment check and
speed up its actions based on this if it wants. So the events might be
new_mem_stack_4, new_mem_stack_8, ..., new_mem_stack_gen.
Doing some highly unjustifiable mental calculations, I think this could
result in a speedup for Memcheck of 15%, and Addrcheck a bit better, maybe
20%, at least for my test program. That would be pretty nice.
I'll definitely look into this once I've finished looking at moving
malloc() et al out of core.
N
|