|
From: Julian S. <js...@ac...> - 2002-12-14 19:58:20
Attachments:
ac_main.c
|
I've spent several hours messing with this and arrived at the attached
result. Outcome so far is that some programs (bzip2) run about 25%
faster; others (OO) are unchanged, and moz is a little slower (57 s increases
to 62 s). Attached version measures a load of numbers.
There were a couple of performance bogons in the 73- patch. First is
that it's important to use requests to make an address range accessible
as an opportunity to prefetch into the cache, since in most cases, most
especially with %esp moving down, there's a pretty good bet that the
new area is just about to be referenced.
Doing so dramatically reduces the miss rate. That allows the size of the
cache to be shrunk from 2^20 to 2^14 (currently) or even 2^13 entries,
which helps a lot because now the real machine's D1/L2 caches aren't so
hammered (I assume).
Finally the miss handlers all call cache_valid_word() to fetch into the
cache. Mostly they do this (eg, in from the ACCESS4 fn) when they have
already established that the word in question is accessible, so cache_
valid_word's test for validity is redundant.
That said, performance is still not good enough to make it worthwhile.
Looking at the counts for make_{noaccess,writable}_aligned shown by the
attached version, I think we are losing a great deal of time messing
with the stack permissions every time %esp changes. Not only is the
loop surrounding that counter used a lot, the loop body is surprisingly
long (see below), and then there is the cost of getting here at all
from handle_esp_assignment in vg_memory.c. So I'd guess that doing
something about this would help performance, and would probably help
memcheck too. The only problem is I can't think a way to improve it :-(
This is all a bit disappointing, because for progs which don't mess
with %esp much, it makes a big improvement.
J
Main loop in ac_make_writable_aligned; is done once for each word on
the stack covered or uncovered (!)
.L247:
movl %esi, %eax
shrl $16, %eax
leal 0(,%eax,4), %ebx
cmpl $distinguished_secondary_map, (%ebx,%ebp)
jne .L249
pushl $.LC35
call alloc_secondary_map
movl %eax, (%ebx,%ebp)
addl $4, %esp
.L249:
movl %esi, %eax
shrl $16, %eax
movl primary_map(,%eax,4), %ebx
movl %esi, %edx
andl $65535, %edx
movl $15, %eax
movl %esi, %ecx
andl $4, %ecx
sall %cl, %eax
shrl $3, %edx
notl %eax
andb %al, (%edx,%ebx)
incl mw_aligned
movl %esi, %eax
andl $32764, %eax
movl %esi, valid_cache(%eax)
addl $4, %esi
cmpl %edi, %esi
jb .L247
|