You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Jeremy F. <je...@go...> - 2002-11-23 19:19:48
|
On Sat, 2002-11-23 at 04:34, Julian Seward wrote:
> - fastjcc is probably always no effect or a win. It is no effect
> in "loop" because that jumps back to the loop start with a case
> which isn't covered by fastjcc, unfortunately. I was wondering
> how difficult it would be to cover the L/NGE, NL/GE, LE/NG and NLE/G
> cases -- exprs of the form ((SF xor OF) or ZF) == 1 or 0. I can't
> think of a neat way to do xor of two bits alas, and your implementation
> is neat indeed. Even if those cases took (eg) 4 insns instead of 1,
> it would probably be better than the 10+ cycle loss of popfl.
>
> For all real progs I expect it is a big win.
Even better if we can stomp all the other flag stack ops (I was thinking
we should add a flag to synth_jcond_lit() to tell it the flags are
already in the CPU flags register, so it can skip all the testing mess).
I was thinking about testing for other conditions (sequences much like
you suggest in your other mail), but didn't bother with it just yet
because it doesn't seem to affect many branches (statically; obviously
only one branch is significant if its in the right loop).
> - chindir looks suspiciously like it slows some things down, although
> I couldn't convince myself either way, even with the ALL-chindir
> measurements. Maybe it's just measurement noise.
I get the same impression. It mostly speeds things up slightly, but
could slow things down slightly. The trouble is that the patching
mechanism is pretty slow compared to the dispatch-loop path (writing
into the instruction stream is *evil*), so you really want it to pay
for itself later. Most programs don't have nearly as many indirect
jumps and returns as direct jumps. That said, it does seem to be worth
a few percent gain for C++ code, and never seems to cost that much.
Are you measuring chain-ret and chain-indirect together as chindir?
> The comment in vg_dispatch.S is good, but I still am a bit unclear
> as to the precise behaviour of the prediction mechanism. My impression
> is that after two consecutive jumps to the same target, the translation
> is patched with a compare-and-jumpdirectly-or-go-via-lookup piece
> of code. Also AIUI, there is no way to unto the patching and
> commit to some other target later, should the patched code start
> to consistently mispredict.
>
> Is my understanding correct? If so doesn't it potentially generate
> permanent mispredictions for returns from any function called from
> many places, or for unpredictable switch statements? Is there a
> way to adjust this mechanism so (like all good prediction mechanisms)
> it eventually forgets about ancient history, so it can track changes
> in the current environment?
The only time it would reevaluate is if it gets unchained (as part of
the LRU mechanism, for example) and rechained. My guess (which is
inherently suspicious) is that it isn't all that significant: either it
will predict the dominant target or it will pick at least one target
which has some proportion of total targets, which is enough to pay for
the cost of the test rather than always falling into the dispatch loop.
I had a separate flag for ret precisely because I thought it would be
less useful than ret. It turns out the opposite is true: chained-ret
seems to get wins where chained-indir is a loss. Probably because there
are some functions which end up being called from one site dynamically,
and other functions with lots of callers, but if we make returns to just
one of those callers cheaper it is a net win (obviously we'd like it to
be the most common call site rather than the first one to make a call in
a loop, but that would be too expensive).
> I'd like to see a program where it gives a clear gain ... do you
> have one?
valgrind --skin=none -q --fast-jcc=yes --chain-bb=yes
--enable-inceip=no --chain-indirect=no --chain-ret=no
/usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed
coregrind/vg_from_ucode.i -quiet -dumpbase vg_from_ucode.i -O2 -version
-o /dev/null 2> /dev/null
time=31.55s ratio=15.85
vs
valgrind --skin=none -q --fast-jcc=yes --chain-bb=yes
--enable-inceip=no --chain-indirect=yes --chain-ret=yes
/usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed
coregrind/vg_from_ucode.i -quiet -dumpbase vg_from_ucode.i -O2 -version
-o /dev/null 2> /dev/null
time=30.42s ratio=15.28
Not huge, but not bad.
This is on my P4 machine; in general V has horrible ratios on the P4
compared to running on my P3 (the ratio for the fastest setting on this
test on the P3 is around 10). The weightings of all these changes are
different between P3 and P4 (and I suspect other architectures as well:
the Via C3, for example, claims to implement pushfl/popfl as fast 2
cycle pipelined instructions, and it probably makes the fast-jcc change
less compelling; I haven't measured it yet).
It may be that we need to keep all of these as options, and generate
different profiles for different target CPUs to get the best
combination.
> - SYNCEIP is a good idea. Certainly I'll incorporate something like
> that, although I'm not sure of the final shape of it. Two issues:
> (1) precise exceptions. SYNCEIP doesn't give that as it stands.
> If a memory load/store should segfault and we wind up in the signal
> handler, we do not have the precise %EIP to hand at that point
> because there is no SYNCEIP before the LD/ST uinstr. That can
> cause problems in some obscure, if POSIXly-illegal, sighandling
> cases.
Hm, good point. We could generate a SYNCEIP before each load and store;
mightn't be too costly (and leave an option defaulting to
safe-but-slowish, but you can turn it off if your program isn't trying
to make use of controlled sigsegvs).
> (2) not sure how SYNCEIP would interact with proposed lazy eflags
> save/restore.
What's the problem? It doesn't touch the flags.
> Generally, should we stick with INCEIP+SYNCEIP, or have just SETEIP,
> or what? And how do we establish exactly where to insert EIP updates?
> Should the skin itself insert them (as per SYNCEIP)?
I was surprised at how neatly SYNCEIP turned out. I think adding an
explicit token in the instruction stream has worked more nicely than
some kind of implicit mechanism.
> 3. We're getting borkage (as is expected from change). I was surfing
> just now with konqueror on addrcheck on all of Jeremy's opts, and it
> crapped out (exited unexpectedly, but cleanly) for no apparent reason,
> several times in a row. Natively it's ok; on 1.0.X it's ok.
> (un?)Fortunately it also craps out when running on the cvs head, so
> we've got bogons somewhere.
How do you run konq under V? I tried running some kde thing (probably
kcachegrind) under it, but it kept getting caught up in helper
processes. Is that because I'm not running kde as my desktop env?
J
|
|
From: Julian S. <js...@ac...> - 2002-11-23 14:57:42
|
Jeremy
The best I can think of for ((SF xor OF) or ZF) == 1 or 0
(the LE/NG and NLE/G conditions) is
let r denote a reg we can trash (a new complication), but Nick's liveness
analysis makes this easier. In the case where none are free (unlikely at
the end of a bb) we can bracket this in push %r .. pop %r since they are
flag-unaffecting.
movl 32(%ebp), %r -- r := %EFLAGS
shrl $delta, %r -- where delta = log2(EFlagO) - log2(EFlagS)
-- now %r has O bit in S bit's position
xorl 32(%ebp), %r -- %r has (O xor S) in S bit's position
shrl $1, %r -- %r has (O xor S) in Z bit's position
orl 32(%ebp), %r -- %r has ((O xor S) or Z) in Z bit's position
andl $nnn, %r -- where nnn = (1 << EFlagZ)
now (real Z) is set iff ((SF xor OF) or ZF) == 1
Hmm. Is it worth it, I wonder.
For the simpler (SF xor OF) == 1 or 0 (L/NGE and NL/GE) the final
shift and or can clearly be omitted, which gives a more competitive
4-insn sequence.
I wonder if there is a way to do the sequence with > 1 reg so as to
avoid the problem that all 6 insns are chained together thru a hard
data dependency on %r, and thus it has a latency, as it stands,
of at least 5 ALU + 1 LSU times, and cannot make use of multiple ALUs.
(assuming the 2nd and 3rd LSU uses can be overlapped with other
stuff). iow it has a poor schedule in current form.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-23 13:35:29
|
Hi. Nice hacking. I made some measurements of the new stuff.
The test programs are the simple loop program discussed in previous
mail (25.2 million basic blocks), and bzip2 compressing a 700k .ps file
(77 million bbs). Also ktuberling, starting and exiting a silly
(if somewhat amusing) children's game on KDE.
loop bzip2 bzip2 bzip2 ktuberling
nulgrind nulgrind addrcheck memcheck addrcheck
native 0.25 0.69 0.69 0.69 0.61
nobbchain 2.56 7.77 11.75 17.29 8.09
bbchain 2.23 6.08 10.11 15.68 7.58
chindir 2.18 6.17 10.14 16.02 7.38
fastjcc 2.22 5.14 9.17 14.98 7.38
synceip 1.60 4.48 8.88 14.69 6.61
ALL-chindir 1.59 4.46 8.91 14.81
native is native. nobbchain is with none of the recent opts. bbchain
adds bbchaining. chindir adds indirect bb chaining. fastjcc adds
fastjcc. synceip adds synceip (ie is all opts so far). ALL-chindir
is everything except chindir; I am a bit suspicious of that one and
wanted to see if it was slowing things down sometime.
Measurements made on a noisy PIII (ie, D was hacking C++ at the
same time), although I made runs when it was pretty much idle, and
the numbers are the best of >= 3 runs. Nevertheless there is some
level of noise, so don't take the above too precisely.
ktuberling's gains are smaller than the rest because it spends a
lot of time translating. It only runs for 32 million bbs but it
does translate about 940k of original code, which is a lot really.
Also spends considerable time reading full debug info from the
qt and kde .so's (I built them -O -g). Of course once it gets
going, I expect speed gains similar to the rest.
Just tried running konq on my 1.13 GHz P3 with full opts on addrcheck
and it's surprisingly usable. Great!
Some points to note
- bbchain is always a win. I'll move it into the head once I get
a good LRU story figured out.
- fastjcc is probably always no effect or a win. It is no effect
in "loop" because that jumps back to the loop start with a case
which isn't covered by fastjcc, unfortunately. I was wondering
how difficult it would be to cover the L/NGE, NL/GE, LE/NG and NLE/G
cases -- exprs of the form ((SF xor OF) or ZF) == 1 or 0. I can't
think of a neat way to do xor of two bits alas, and your implementation
is neat indeed. Even if those cases took (eg) 4 insns instead of 1,
it would probably be better than the 10+ cycle loss of popfl.
For all real progs I expect it is a big win.
I'll move this into the head too. Is it always beneficial, and has
only minor and localised complexity.
- chindir looks suspiciously like it slows some things down, although
I couldn't convince myself either way, even with the ALL-chindir
measurements. Maybe it's just measurement noise.
The comment in vg_dispatch.S is good, but I still am a bit unclear
as to the precise behaviour of the prediction mechanism. My impression
is that after two consecutive jumps to the same target, the translation
is patched with a compare-and-jumpdirectly-or-go-via-lookup piece
of code. Also AIUI, there is no way to unto the patching and
commit to some other target later, should the patched code start
to consistently mispredict.
Is my understanding correct? If so doesn't it potentially generate
permanent mispredictions for returns from any function called from
many places, or for unpredictable switch statements? Is there a
way to adjust this mechanism so (like all good prediction mechanisms)
it eventually forgets about ancient history, so it can track changes
in the current environment?
I'd like to see a program where it gives a clear gain ... do you
have one?
- SYNCEIP is a good idea. Certainly I'll incorporate something like
that, although I'm not sure of the final shape of it. Two issues:
(1) precise exceptions. SYNCEIP doesn't give that as it stands.
If a memory load/store should segfault and we wind up in the signal
handler, we do not have the precise %EIP to hand at that point
because there is no SYNCEIP before the LD/ST uinstr. That can
cause problems in some obscure, if POSIXly-illegal, sighandling
cases.
(2) not sure how SYNCEIP would interact with proposed lazy eflags
save/restore.
Generally, should we stick with INCEIP+SYNCEIP, or have just SETEIP,
or what? And how do we establish exactly where to insert EIP updates?
Should the skin itself insert them (as per SYNCEIP)? Or should there
be a redundant SETEIP-removal pass done by the core, which asks uinstr-
adding skins whether a uinstr could need to know EIP? How should we
handle EIP updates needed by the core itself, specifically if we want
to supply precise exceptions? [probably disabled by default, btw]
------------
I think it will soon be time to "pull over" and consolidate what we've
got (which is some nice speedups), since:
1. I'd like to get this thing out the door sometime this century :)
2. Nick is disappearing from active hacking in about a week, really
3. We're getting borkage (as is expected from change). I was surfing
just now with konqueror on addrcheck on all of Jeremy's opts, and it
crapped out (exited unexpectedly, but cleanly) for no apparent reason,
several times in a row. Natively it's ok; on 1.0.X it's ok.
(un?)Fortunately it also craps out when running on the cvs head, so
we've got bogons somewhere.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-23 03:38:30
|
I just uploaded a patch which seems to do a good job of killing INCEIP
without being overly complex or putting undue burden on skins.
I added a new UInstr, SYNCEIP, which skins can insert as part of their
instrumentation where ever they want to be sure that the EIP has been
updated to match the execution state of the program.
SYNCEIP just generates a constant store into the m_eip slot of
baseBlock. However, there are a number of useful optimisations it can
do. Obviously, if the EIP hasn't changed since the last SYNCEIP, it
doesn't need to emit anything. More interestingly, if the EIP has only
changed in its lower 8 bits, it can just emit a byte write rather than a
32-bit write. This is the same sized instruction as INCEIP's add, but
doesn't change the flags.
For the nulgrind skin, the performance improvements are very good
(execution times are 70% to 50% of the INCEIP version). It makes much
less difference for memcheck, because the overhead of the skin's
instrumentation is more significant (also, I just conservatively
inserted a SYNCEIP before every instruction anyway; maybe fewer can be
inserted with more care).
I also implemented Julian's suggestion for more efficient Jcc
instructions. It works well (and definitely suggests we should remove
as many pushfl/popfl instructions as possible).
These and more at the usual place.
J
|