|
From: Julian S. <js...@ac...> - 2003-03-29 12:45:44
|
While you're on the software-structure warpath, here's something
I noticed a while back, around xmas, when JeremyF and me were trying
to improve performance.
A big bottleneck (we surmise, but haven't accurately measured)
for skins which deal with A bits, is handling %ESP changes.
The core/skin split has made this more expensive, I think, by
causing longer chains of calls into the relevant skin's memory-
permissions adjustor functions. That isn't the problem, tho;
the following nonsensicality was there prior to that.
Consider a source push instr:
0x40184623: pushl %esi
10: GETL %ESI, t8
11: GETL %ESP, t12
12: MOVL t12, t10
13: SUBL $0x4, t10
14: PUTL t10, %ESP
15: STL t8, (t10)
16: INCEIPo $1
Eventually there is, a result of this, a call to VG_(handle_esp_assignment),
assuming the skin has asked for shadow memory (aiui). This passes the
old and new %ESP values, and in VG_(handle_esp_assignment) these are
subtracted to discover that the stack delta is -4. Then we do
VG_TRACK(new_mem_stack_aligned, new_esp, -delta);
(vg_memory.c:320ish)
which necessarily involve a loop, and probably a test on the sign of
delta, in the relevant skin's permissions-mangling function.
This is all rather inefficient considering that we knew from the start
that delta would be -4, and it would have been better to plant, in the
generated code, a call directly to a skin function specialised to a delta
of -4. Most deltas are small (+4, -4, +16, -16) and so a small bunch of
specialised functions + a general fallback case would probably constitute
a significant improvement.
The improvements would be because (1) the specialised functions have no
need to test the sign of delta nor have a loop to handle all values of
it, and (2) because we could save having to call VG_(handle_esp_assignment)
and then call onwards to the skin function; we could just call the skin
function directly.
You're the Architect(tm); I don't know if the idea makes sense or would
cause other difficulties, but I did notice this a while back.
J
|
|
From: Nicholas N. <nj...@ca...> - 2003-03-29 23:29:38
|
On Sat, 29 Mar 2003, Julian Seward wrote:
> A big bottleneck (we surmise, but haven't accurately measured)
> for skins which deal with A bits, is handling %ESP changes.
> Consider a source push instr:
>
> 0x40184623: pushl %esi
>
> 10: GETL %ESI, t8
> 11: GETL %ESP, t12
> 12: MOVL t12, t10
> 13: SUBL $0x4, t10
> 14: PUTL t10, %ESP
> 15: STL t8, (t10)
> 16: INCEIPo $1
>
> Eventually there is, a result of this, a call to VG_(handle_esp_assignment),
> assuming the skin has asked for shadow memory (aiui).
It gets called if any of the following events are tracked by the skin:
new_mem_stack, new_mem_stack_aligned, die_mem_stack, die_mem_stack_aligned.
> This passes the old and new %ESP values, and in
> VG_(handle_esp_assignment) these are subtracted to discover that the
> stack delta is -4. Then we do
>
> VG_TRACK(new_mem_stack_aligned, new_esp, -delta);
>
> which necessarily involve a loop, and probably a test on the sign of
> delta, in the relevant skin's permissions-mangling function.
>
> This is all rather inefficient considering that we knew from the start
> that delta would be -4, and it would have been better to plant, in the
> generated code, a call directly to a skin function specialised to a delta
> of -4. Most deltas are small (+4, -4, +16, -16) and so a small bunch of
> specialised functions + a general fallback case would probably constitute
> a significant improvement.
>
> The improvements would be because (1) the specialised functions have no
> need to test the sign of delta nor have a loop to handle all values of
> it, and (2) because we could save having to call VG_(handle_esp_assignment)
> and then call onwards to the skin function; we could just call the skin
> function directly.
You're right about the inefficiency. I just tried altering Nulgrind so it
tracked the four events I mentioned above, where the called functions were
empty. My one test program slowed down from 0.20s to 0.35s, on a longer
run it went from 1.0s to 2.4s. Just to make sure it wasn't simply the
function call overhead, I tried instead inserting calls to an empty
function wherever %esp was updated; my test program (small input) only
slowed to 0.24s. So yes, the current mechanism is slow, well done for
spotting it.
[It's weird, something is wrong with profiling which is maybe why this
hasn't been spotted before -- there is an event VgpStack for measuring
exactly this stack-adjustment overhead, but I always get zero ticks for
it. I even put in a small busy loop into VG_(handle_esp_assignment)() so
that the program ran ten times longer than normal, 90% of that in the
VG_(handle_esp_assignment), and it still gave zero ticks (the ticks were
allocated to "running"). I'll investigate that when I look at this in
more detail.]
Your idea is quite plausible. Basically the skins would bypass the core's
built-in way of handling stack updates in order to do them itself, because
it can be more efficient that way.
Which makes me think: if we want to make this improvement for
{Mem,Addr}check, any other skin that tracks %esp changes probably wants it
as well. So let's try to improve the general mechanism rather than
implementing a {Mem,Addr}check-only optimisation.
The current approach is simplest from a code-generation point of view;
doing it this optimised way would require a little bit of smartness to
work out the delta (you have to look back a few UCode instructions to see
when %ESP was loaded and then what ADDs/SUBs were done on it), but it
shouldn't be too hard.
You could then have a number of trackable events for skins to hook into:
new_mem_stack_aligned_4
new_mem_stack_aligned_8
etc.
die_mem_stack_aligned_4
die_mem_stack_aligned_8
etc.
for the most common cases (eg. 4, 8, 12, 16, 20, 24). They would be
passed the old %esp. The skins could have unrolled versions of the
general stack-adjusting code for these cases.
Also have:
new_mem_stack_aligned_gen
new_mem_stack
die_mem_stack_aligned_gen
die_mem_stack
If a skin didn't provide these special case functions, the core could fall
back to using the general case ones if they were provided -- this would be
useful when first writing skins, when you don't want to write five
different versions of the same function. Ie. new_mem_stack_aligned_4
would be used if present, but fall back to new_mem_stack_aligned_gen if
present, but fall back to new_mem_stack if present, else do nothing.
One complication -- how do we know at compile-time if a stack-adjustment
is aligned? We can't (AFAICT) so maybe the events shouldn't have any
mention of alignment, and it's up to the skin to do an alignment check and
speed up its actions based on this if it wants. So the events might be
new_mem_stack_4, new_mem_stack_8, ..., new_mem_stack_gen.
Doing some highly unjustifiable mental calculations, I think this could
result in a speedup for Memcheck of 15%, and Addrcheck a bit better, maybe
20%, at least for my test program. That would be pretty nice.
I'll definitely look into this once I've finished looking at moving
malloc() et al out of core.
N
|
|
From: Julian S. <js...@ac...> - 2003-03-30 01:20:01
|
> Your idea is quite plausible. Basically the skins would bypass the core's
> built-in way of handling stack updates in order to do them itself, because
> it can be more efficient that way.
>
> Which makes me think: if we want to make this improvement for
> {Mem,Addr}check, any other skin that tracks %esp changes probably wants it
> as well. So let's try to improve the general mechanism rather than
> implementing a {Mem,Addr}check-only optimisation.
Yes, indeed. But your suggestion below is an improvement to the general
mechanism, not just to {Mem,Addr}check, yes?
> You could then have a number of trackable events for skins to hook into:
>
> new_mem_stack_aligned_4
> new_mem_stack_aligned_8
> etc.
> die_mem_stack_aligned_4
> die_mem_stack_aligned_8
> etc.
>
> for the most common cases (eg. 4, 8, 12, 16, 20, 24). They would be
> passed the old %esp. The skins could have unrolled versions of the
> general stack-adjusting code for these cases.
>
> Also have:
>
> new_mem_stack_aligned_gen
> new_mem_stack
> die_mem_stack_aligned_gen
> die_mem_stack
>
> If a skin didn't provide these special case functions, the core could fall
> back to using the general case ones if they were provided -- this would be
> useful when first writing skins, when you don't want to write five
> different versions of the same function. Ie. new_mem_stack_aligned_4
> would be used if present, but fall back to new_mem_stack_aligned_gen if
> present, but fall back to new_mem_stack if present, else do nothing.
>
> One complication -- how do we know at compile-time if a stack-adjustment
> is aligned? We can't (AFAICT) so maybe the events shouldn't have any
> mention of alignment, and it's up to the skin to do an alignment check and
> speed up its actions based on this if it wants. So the events might be
> new_mem_stack_4, new_mem_stack_8, ..., new_mem_stack_gen.
Yes. Although in practice the stack is almost always kept at least 4-aligned,
we can't be sure of this, so all the functions need to check this.
Tell me if the following proposal makes sense:
-------------
any skin which wishes to track stack
memory permissions must provide at least the general functions
new_mem_stack
die_mem_stack
They may optionally provide any subset of the following:
{new,die}_mem_stack_{4,8,12,16,20,24,28,32}
in which case those will get called in preference, in cases where the code
generator can establish what the delta is, and generally when convenient for
the code generator. Code generator will try to call the specialised fns but
does not promise to do so, if it isn't convenient for it.
The specialised functions are passed %ESP after the adjustment has been made,
ie the new value. The general fns are passed old and new values. None of the
functions can make any assumptions about alignment; they have to check for
themselves.
-------------
Another performance benefit, albeit a small one, is that the specialised
fns take one param rather than two, so that means less register shuffling
etc in the call sequence.
As a sanity check, let me ask (possibly again): does this scheme in any way
mess up the nice core/skin split, or generally clog up your architectural
cleanups in any other way?
> I'll definitely look into this once I've finished looking at moving
> malloc() et al out of core.
Great.
J
|
|
From: Nicholas N. <nj...@ca...> - 2003-03-30 09:11:31
|
On Sun, 30 Mar 2003, Julian Seward wrote:
> any skin which wishes to track stack memory permissions must provide at
> least the general functions
>
> new_mem_stack
> die_mem_stack
>
> They may optionally provide any subset of the following:
>
> {new,die}_mem_stack_{4,8,12,16,20,24,28,32}
>
> in which case those will get called in preference, in cases where the code
> generator can establish what the delta is, and generally when convenient for
> the code generator. Code generator will try to call the specialised fns but
> does not promise to do so, if it isn't convenient for it.
>
> The specialised functions are passed %ESP after the adjustment has been made,
> ie the new value. The general fns are passed old and new values. None of the
> functions can make any assumptions about alignment; they have to check for
> themselves.
Yes. Or maybe the specialised functions can be passed the old %ESP, that
might make slightly more sense for them.
Something else occurred to me though. This "trackable event" is something
that the skin could do -- it could spot PUT %ESP instructions itself --
it's currently done by the core just as a convenience for skins.
So my second idea is this: make the skins do this. Don't have any core
"trackable events" for this (new_mem_stack, etc). Instead, have the core
provide a function VG_(ESP_changed_by)(). When a skin instruments a PUT
instruction, it can call this to see what the ESP delta is. Some special
value (0? 0xcfffffff? 0x87654321?) would indicate "don't know". The skin
can then choose to CCALL whatever function it likes based on this value,
eg. a specialised one if it was a common delta, or a general one if an
uncommon or unknown delta.
It would have exactly the same effect as doing it with "trackable events",
but gives slightly more flexibility to the skin -- it could have as many
specialised %esp-adjustment functions as it wanted, which could have any
function arguments it wanted -- and would not saddle the core with this
family of "trackable events". But it wouldn't make life any more
difficult for the skin... well the only onus is on the skin writer to
realise they have to use this function if they want to know about
%esp-adjustments. But skin writers have to know about a lot already, so I
don't think this is so bad.
Combining this idea with the move-malloc-into-skins idea, and we've lost
quite a few "trackable events" -- new_mem_heap, new_mem_stack, etc. This
represents a minor shift in philosophy. Previously, "trackable events"
were for things that couldn't be detected by the skin, plus enough other
events to cover all memory operations of interest. With these proposed
changes, "trackable events" would only cover those things that couldn't be
detected by the skin.
In one way this is bad -- there's no uniform mechanism for detecting all
interesting memory ops. But in another way it's good -- the core is purer
(more RISC-like, if that sounds better :) since "trackable" events really
are only those things that cannot be detected by a skin.
To me, this feels like the right way to do things.
> As a sanity check, let me ask (possibly again): does this scheme in any way
> mess up the nice core/skin split, or generally clog up your architectural
> cleanups in any other way?
No. And with my new proposed change, it's even better :)
N
|