|
From: Julian S. <js...@ac...> - 2002-11-23 13:35:29
|
Hi. Nice hacking. I made some measurements of the new stuff.
The test programs are the simple loop program discussed in previous
mail (25.2 million basic blocks), and bzip2 compressing a 700k .ps file
(77 million bbs). Also ktuberling, starting and exiting a silly
(if somewhat amusing) children's game on KDE.
loop bzip2 bzip2 bzip2 ktuberling
nulgrind nulgrind addrcheck memcheck addrcheck
native 0.25 0.69 0.69 0.69 0.61
nobbchain 2.56 7.77 11.75 17.29 8.09
bbchain 2.23 6.08 10.11 15.68 7.58
chindir 2.18 6.17 10.14 16.02 7.38
fastjcc 2.22 5.14 9.17 14.98 7.38
synceip 1.60 4.48 8.88 14.69 6.61
ALL-chindir 1.59 4.46 8.91 14.81
native is native. nobbchain is with none of the recent opts. bbchain
adds bbchaining. chindir adds indirect bb chaining. fastjcc adds
fastjcc. synceip adds synceip (ie is all opts so far). ALL-chindir
is everything except chindir; I am a bit suspicious of that one and
wanted to see if it was slowing things down sometime.
Measurements made on a noisy PIII (ie, D was hacking C++ at the
same time), although I made runs when it was pretty much idle, and
the numbers are the best of >= 3 runs. Nevertheless there is some
level of noise, so don't take the above too precisely.
ktuberling's gains are smaller than the rest because it spends a
lot of time translating. It only runs for 32 million bbs but it
does translate about 940k of original code, which is a lot really.
Also spends considerable time reading full debug info from the
qt and kde .so's (I built them -O -g). Of course once it gets
going, I expect speed gains similar to the rest.
Just tried running konq on my 1.13 GHz P3 with full opts on addrcheck
and it's surprisingly usable. Great!
Some points to note
- bbchain is always a win. I'll move it into the head once I get
a good LRU story figured out.
- fastjcc is probably always no effect or a win. It is no effect
in "loop" because that jumps back to the loop start with a case
which isn't covered by fastjcc, unfortunately. I was wondering
how difficult it would be to cover the L/NGE, NL/GE, LE/NG and NLE/G
cases -- exprs of the form ((SF xor OF) or ZF) == 1 or 0. I can't
think of a neat way to do xor of two bits alas, and your implementation
is neat indeed. Even if those cases took (eg) 4 insns instead of 1,
it would probably be better than the 10+ cycle loss of popfl.
For all real progs I expect it is a big win.
I'll move this into the head too. Is it always beneficial, and has
only minor and localised complexity.
- chindir looks suspiciously like it slows some things down, although
I couldn't convince myself either way, even with the ALL-chindir
measurements. Maybe it's just measurement noise.
The comment in vg_dispatch.S is good, but I still am a bit unclear
as to the precise behaviour of the prediction mechanism. My impression
is that after two consecutive jumps to the same target, the translation
is patched with a compare-and-jumpdirectly-or-go-via-lookup piece
of code. Also AIUI, there is no way to unto the patching and
commit to some other target later, should the patched code start
to consistently mispredict.
Is my understanding correct? If so doesn't it potentially generate
permanent mispredictions for returns from any function called from
many places, or for unpredictable switch statements? Is there a
way to adjust this mechanism so (like all good prediction mechanisms)
it eventually forgets about ancient history, so it can track changes
in the current environment?
I'd like to see a program where it gives a clear gain ... do you
have one?
- SYNCEIP is a good idea. Certainly I'll incorporate something like
that, although I'm not sure of the final shape of it. Two issues:
(1) precise exceptions. SYNCEIP doesn't give that as it stands.
If a memory load/store should segfault and we wind up in the signal
handler, we do not have the precise %EIP to hand at that point
because there is no SYNCEIP before the LD/ST uinstr. That can
cause problems in some obscure, if POSIXly-illegal, sighandling
cases.
(2) not sure how SYNCEIP would interact with proposed lazy eflags
save/restore.
Generally, should we stick with INCEIP+SYNCEIP, or have just SETEIP,
or what? And how do we establish exactly where to insert EIP updates?
Should the skin itself insert them (as per SYNCEIP)? Or should there
be a redundant SETEIP-removal pass done by the core, which asks uinstr-
adding skins whether a uinstr could need to know EIP? How should we
handle EIP updates needed by the core itself, specifically if we want
to supply precise exceptions? [probably disabled by default, btw]
------------
I think it will soon be time to "pull over" and consolidate what we've
got (which is some nice speedups), since:
1. I'd like to get this thing out the door sometime this century :)
2. Nick is disappearing from active hacking in about a week, really
3. We're getting borkage (as is expected from change). I was surfing
just now with konqueror on addrcheck on all of Jeremy's opts, and it
crapped out (exited unexpectedly, but cleanly) for no apparent reason,
several times in a row. Natively it's ok; on 1.0.X it's ok.
(un?)Fortunately it also craps out when running on the cvs head, so
we've got bogons somewhere.
J
|