You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Julian S. <js...@ac...> - 2002-11-19 23:58:51
|
> > Yesterday afternoon I ran KDE-3.1-rc3 on addrcheck. This is the entire > > process tree starting from startkde. It pretty much worked, but was a > > bit slow and one process bombed valgrind. All output was logged to a > > different machine running the listener. > > Wow that sounds great. How have things come since Monday? Are your > scripts and tools for this all packaged up? No, but we can do 'make dist' and write a half-page README if you want to try it. I found that valgrind bombed once during the previous session, and I wanted to track it down. + we are not in a total hurry to release it. But as I say, I am very happy to make you a snapshot tarball (version 1.9.1) if you want. > It would be nice to see those Arts errors too. Arts has lots of memory > problems on PPC architecture apparently. Attached. I don't think they are particularly interesting. I'm not using artsd for anything. After startup I get a box saying "Sound server CPU overload" or something like that; when I clicked OK the attached gunk came out. I tried to find all lines in the log for this process (5805) but I am not sure why the startup stuff is missing; that would tell us exactly what's running. Note, this is with RC3; I haven't scagged RC4 yet. > It really seems that my testing suggestion is far too late for 3.1.0, but > I'd like to start formalising it for 3.2.0. You are doing a Good Thing here, and you have my encouragement. J |
|
From: Jeremy F. <je...@go...> - 2002-11-19 23:14:13
|
On Tue, 2002-11-19 at 11:10, Julian Seward wrote:
> > I got basic block chaining working last night. I got about 25%
> > improvement (which is nice, but I was hoping for more) in the particular
> > benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> > whole, the performance was pretty dismal: the native run took about 4.6
> > seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> > chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> > I was hoping it would be a largely CPU-bound test, but maybe its
> > actually spending all its time in malloc or something.
>
> Strange it's so slow. I think you should try bzip2; that is almost
> completely compute bound and does something like 7 malloc calls per
> file processed. Here's what I have:
>
> time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> > /dev/null
> (valgrind startup msgs deleted)
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m7.760s
> user 0m7.670s
> sys 0m0.030s
>
> time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m0.738s
> user 0m0.680s
> sys 0m0.050s
>
> So more like a 11 - 12 x slowdown than the 35 x you get.
Well, I found that there is an easy way of getting V to generate
extended basic blocks: simply allow conditional jumps to not end the
basic block (they code was all set up to do it, complete with
almost-accurate comment). I haven't done on overall measurement of what
the average increase in BB length is, but from looking a the output of
--trace-codegen, they can get quite long. It certainly allows us to see
what the effect of having long register lifetimes is vs.
indescriminately over-compiling things. The results are far from
conclusive: it seems to vary between a few percent better to a few
percent worse (and it also seems to depend on the skin, though I haven't
really tested that yet).
These are the results I'm seeing:
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no bzip2 < TAGS > /dev/null
time=6.15s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=no bzip2 < TAGS > /dev/null
time=4.95s ratio:10.1
valgrind --skin=none --chain-bb=no --extended-bb=yes bzip2 < TAGS > /dev/null
time=6.17s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=yes bzip2 < TAGS > /dev/null
time=5.48s ratio:11.1
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i \
-quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.44s
valgrind --skin=none --chain-bb=no --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=79.88s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=59.14s ratio:13.3
valgrind --skin=none --chain-bb=no --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=73.77s ratio:16.6
valgrind --skin=none --chain-bb=yes --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.16s ratio:13.0
> > My next experiment might be [...]
>
> Let me encourage you to make measurements (direct vs indirect jump counts)
> to gain insight into your current hackery, before embarking on more. I for
> one would like to be assured with numbers that the Right Thing is happening
> and that our assumptions about costs, event frequencies, etc, are justified.
Oh, yes, I quite agree. The results above make me think that doing
naive trace caching probably won't help, but eliminating dispatch-loop
calls probably will.
I've put the patch up in the usual place.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 23:11:16
|
Hi. Am trying to be a bit more diligent re testsuite failures now that I've written down how to run it on a bit of paper :) Just did cvs up, rebuild from scratch. I've got 3 fails, which is a whole lot better than a few days ago. Are they expected? J == Failed tests =============================== memcheck/tests/memalign_test stderr memcheck/tests/supp2 stderr memcheck/tests/tronical stderr |
|
From: George S. <st...@kd...> - 2002-11-19 21:31:57
|
On Monday November 18 2002 04:11, Julian Seward wrote: > Just fwiw, > > Yesterday afternoon I ran KDE-3.1-rc3 on addrcheck. This is the entire > process tree starting from startkde. It pretty much worked, but was a > bit slow and one process bombed valgrind. All output was logged to a > different machine running the listener. Wow that sounds great. How have things come since Monday? Are your scripts and tools for this all packaged up? Pardon my ignorance if this is well known already. I've been in and out of "the loop" for the past few weeks. > So following some robustification, I think this might be workable > (P4 1.7 GHz, 512 MB). Even from my small test it picked up some > reads/writes of freed memory in artsd (something to do with KDE > sound I think). It would be nice to see those Arts errors too. Arts has lots of memory problems on PPC architecture apparently. It really seems that my testing suggestion is far too late for 3.1.0, but I'd like to start formalising it for 3.2.0. -- George Staikos |
|
From: Julian S. <js...@ac...> - 2002-11-19 19:08:35
|
> > 1. Lack of translation chaining. Has been discussed a lot already, and I > > think has reasonable probability of getting addressed at some point, > > since we're now floating detailed ideas for doing it. > > I think this is major, and we should be able to tell soon: I hacked up > most of an implementation on the train this afternoon. Doesn't quite > work yet, but it looks convincing. Cool! Keep on hacking. I'd be very interested to hear the outcome of this. > > 2. The inability to cache simulated regs in real ones across bb > > boundaries (they all get flushed with PUTs at the end). I think JF > > mentioned this at some point. I don't know how much effect this has. > > I think this is relatively minor compared to chaining, Possibly true. I think trace cacheing would break cachegrind, which makes some assumptions about uniqueness of bb translations (or something? Nick? but that can't be right, because it copes with multiple tail translations of bbs, as we discussed once). Unfortunately there is no easy way I can see to assess the performance loss from this inability to hold values in regs for long. > > 3. (nobody mentioned this, but I think it is very significant): the > > inability of the code generator to consider groups of UInstrs together > > when generating code. Specifically, this has the following bad effect. > > [...] > > This is surely something worth looking into. An interesting mechanism > > to create would be one which measures the (dynamic) frequency of > > adjacent uinstr pairs, to search for common sequences (pairs, to start > > with) which would be worth putting special code-generation effort into. > > Of course the results would depend on what instrumentation had been added > > by the skin. > > I'm not sure its all that important. It would reduce the actual > instruction count and register pressure, but I'm not sure it would do > all that much for modern CPUs (the Via C3 being the exception, and it > does need all the help it can get). Um, yes, I guess because it merely produces insns which will turn mostly into single micro-ops in the P6/P7/K6/K7 cores anyway, yes? I'd forgotten about that. What it does mean is that the translated code is more likely to stall due to lack of insn decoder bandwidth, compared to "normal" code. J |
|
From: Julian S. <js...@ac...> - 2002-11-19 19:03:51
|
On Tuesday 19 November 2002 5:20 pm, Jeremy Fitzhardinge wrote:
> On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> > There are sure to be other problems, though. And I have no idea how it
> > would affect the MemCheck instrumentation; already the code produced
> > contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> > needs %ESP to be up-to-date.
>
> Perhaps a slightly simpler approach would be to dedicate a register to
> ESP rather than a memory location, so that it is persistently in
> register across basic blocks. With any luck the register allocator
> could then remove redundant moves, and turn (assuming we reserve %edi
> for %ESP):
Yes, this seems a good compromise. We already played with the issue of
5 vs 6 allocatable registers and it makes very little difference at least
to memcheck, so perhaps it would be ok to give one to %ESP.
Nick, one thing you seemed to not comment on is ...
> addl $2, 36(%ebp) (the INCEIP)
ie, we keep exact track of %eip; does DynamoRIO ? What sort of
exception model at they presenting?
> I got basic block chaining working last night. I got about 25%
> improvement (which is nice, but I was hoping for more) in the particular
> benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> whole, the performance was pretty dismal: the native run took about 4.6
> seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> I was hoping it would be a largely CPU-bound test, but maybe its
> actually spending all its time in malloc or something.
Strange it's so slow. I think you should try bzip2; that is almost
completely compute bound and does something like 7 malloc calls per
file processed. Here's what I have:
time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> /dev/null
(valgrind startup msgs deleted)
(stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
real 0m7.760s
user 0m7.670s
sys 0m0.030s
time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
(stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
real 0m0.738s
user 0m0.680s
sys 0m0.050s
So more like a 11 - 12 x slowdown than the 35 x you get.
> My next experiment might be [...]
Let me encourage you to make measurements (direct vs indirect jump counts)
to gain insight into your current hackery, before embarking on more. I for
one would like to be assured with numbers that the Right Thing is happening
and that our assumptions about costs, event frequencies, etc, are justified.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 17:20:13
|
On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> There are sure to be other problems, though. And I have no idea how it
> would affect the MemCheck instrumentation; already the code produced
> contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> needs %ESP to be up-to-date.
Perhaps a slightly simpler approach would be to dedicate a register to
ESP rather than a memory location, so that it is persistently in
register across basic blocks. With any luck the register allocator
could then remove redundant moves, and turn (assuming we reserve %edi
for %ESP):
movl %edi, %ecx
subl $0x4, %ecx
movl %ecx, %edi
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
into
subl $0x4, %edi
xorl %edx, %edx
movl %edx, (%edi)
addl $2, 36(%ebp)
which, while not being push 0x0, isn't too far off. Will still need to
work out appropriate times to save ESP into the baseblock for when
control enters C code.
It seems to me that these optimisations we're talking about all feed
into each other, so there's a synergistic effect:
1. The reason we've got so much memory traffic to the baseBlock is
that we don't have enough registers to keep the simulated
registers in real registers while having enough working
registers for generated code
2. The reason we need lots of working registers is because we break
all the x86 instructions up, and therefore need more temps
3. Even so, we can't possibly really fit all the simulated
registers into real registers, so we're always going to be
saving them off to memory, but we'd like to keep to a minimum
4. We can't do a reasonable global register allocation to try and
keep the working set in registers as much as possible, because
we have a largely local, basic-block, view of the world (except,
perhaps, for special registers, like %esp which may well be used
in every basic block and is therefore worth keeping around).
therefore:
1. Using more compact instructions will cut down on our use of
temps, leaving more registers available for caching architecture
registers during a basic block (as well as making our
instruction footprint smaller)
2. Using trace caching will lengthen the effective size of a basic
block, and thereby amortize the register load/save over more
instructions
In other news:
I got basic block chaining working last night. I got about 25%
improvement (which is nice, but I was hoping for more) in the particular
benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
whole, the performance was pretty dismal: the native run took about 4.6
seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
chained-bb nulgrind took 60 seconds. I haven't looked into it further:
I was hoping it would be a largely CPU-bound test, but maybe its
actually spending all its time in malloc or something.
I also haven't measured how many jumps actually get chained (statically
and dynamically). I only bother with direct jumps and calls, so all
indirects and rets still go through the dispatch loop. Also, some of
the functionality in the dispatch loop needs to be compiled into each
basic block (EIP and VG_(dispatch_ctr) update), which expands the
generated code size and probably detracts from the performance wins.
My next experiment might be to try some rudimentary trace caching by
doing as Josef suggested and having vg_to_ucode simply follow
unconditional jumps and thereby create large basic blocks (hm, better be
careful about infinite loops...). It might do nothing other than
massively increase the dynamic compiler costs...
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-19 16:12:39
|
On Tue, 19 Nov 2002, Nicholas Nethercote wrote:
> I think it is important, ie. the fact that Valgrind virtualises all the
> registers and then loses where they were in the UCode.
Thinking about this some more, I think that one particular source of
slowdown is the fact that %esp is virtualised as %ESP.
This is because several very common instructions (push, pop, call, ret)
update %esp as a side-effect. For example:
pushl $0x0
becomes:
GETL %ESP, t4
SUBL $0x4, t4
PUTL t4, %ESP
MOVL $0x0, t6
STL t6, (t4)
INCEIPo $2
becomes:
movl 0x10(%ebp), %ecx
subl $0x4, %ecx
movl %ecx, 0x10(%ebp)
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
A lot of time is spent moving values to and from %ESP. I was wondering
about the possibility of "de-virtualising" %esp -- ie. store %ESP in %esp.
Then maybe the UCode translation could be as simple as "PUSHL 0x0".
I'm sure to have overlooked several bazillion complications here. For
example, %esp is used inside generated code, but I think that might be ok:
- for transferring %eflags to %EFLAGS -- just pushes then pops off the
top of the stack, should be fine
- for CCALLs -- saves caller-save regs, pushes args, clears args,
restores regs, should be fine?
There are sure to be other problems, though. And I have no idea how it
would affect the MemCheck instrumentation; already the code produced
contains redundant "PUTL tX, %ESP" instructions because MemCheck always
needs %ESP to be up-to-date.
I don't know, maybe it's totally impossible, but maybe it's not.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-19 13:51:02
|
> > Hmm, that differs from what the internal valgrind profiler itself tells > > us ... about 30-40% of the time starting up mozilla, on memcheck, is > > translation. That also corresponds (non-quantitatively) with the speed > > variations in the lines printed by -v -v when running stuff (try it). > > Is there a set of benchmarks you've been using for this kind of stuff? > It would be useful to have a set for evaluating all the proposals that > have been flying around lately. It would be nice to have something less > interactive than Mozilla, for example. I'm particularly short of nice > examples of multithreaded programs which arn't all GUI. Starting mozilla-1.0, which requires translation about 2.2M of original code into ~45M of memcheck'd translations. J |
|
From: Nicholas N. <nj...@ca...> - 2002-11-19 10:57:36
|
On 18 Nov 2002, Jeremy Fitzhardinge wrote:
> > 3. (nobody mentioned this, but I think it is very significant): the inability
> > of the code generator to consider groups of UInstrs together when
> > generating code.
>
> I'm not sure its all that important. It would reduce the actual
> instruction count and register pressure, but I'm not sure it would do
> all that much for modern CPUs (the Via C3 being the exception, and it
> does need all the help it can get).
I think it is important, ie. the fact that Valgrind virtualises all the
registers and then loses where they were in the UCode.
Look at DynamoRIO's translation of this small basic block:
add %eax, %ecx
cmp $4, %eax
jle $0x40106f
becomes
frag7: add %eax, %ecx
cmp $4, %eax
jle $0x40106f
stub1: mov %eax, eax-slot # spill %eax
mov &dstub1, %eax # store ptr to stub table
jmp context_switch
stub2: mov %eax, eax-slot # spill %eax
mov &dstub2, %eax # store ptr to stub table
jmp context_switch
Apart from control flow instructions, the resulting code (before
instrumentation and/or trace-level optimisation) is basically the same as
the original code.
Valgrind will turn it into UCode something like this:
GETL %ECX, t14
ADDL %EAX, t14 (-wOSZACP)
PUTL t14, %ECX
INCEIPo $2
GETL %EAX, t10
SUBL $0x1, t10 (-wOSZACP)
INCEIPo $3
Jleo $0x40106f (-rOSZACP)
JMPo $0x4013F9B4
There will be some optimisations, but then each UCode instr will become
one or more x86 instructions. The example isn't so convincing, but repeat
the exercise on a basic block with 10 or 20 instructions and you'll notice
the difference more.
I feel that this must be doing a lot of damage. I don't think it's a
coincidence that the Null skin increases code size by a factor of about 5
and also slows programs by a similar factor, and that Memcheck increases
code size by a factor of about 12--13 and slows programs down by a similar
factor.
> An interesting mechanism to create would be one which measures the
> (dynamic) frequency of adjacent uinstr pairs, to search for common
> sequences (pairs, to start with) which would be worth putting special
> code-generation effort into. Of course the results would depend on what
> instrumentation had been added by the skin.
I just did that... a patch for vg_from_ucode is attached, as are the
(sorted) results for a quick spin of konqueror with --skin=none...
N
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 06:01:14
|
On Mon, 2002-11-18 at 21:55, Jeremy Fitzhardinge wrote:
> > Agree. The only other major source is implementation of switch statements
> > with dense ranges (jump tables).
>
> Another optimisation I'm thinking of is
[This is what happens when you're composing two emails in parallel.]
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 05:56:01
|
On Mon, 2002-11-18 at 17:04, Julian Seward wrote:
> > > That seems unavoidable. We'd probably just recycle the current
> > > arrangement, or perhaps a simplified version of it, in this case. Bear
> > > in mind that x86 RET instructions are going to pull an original address
> > > off the stack and jump to it, so there will still me a significant
> > > minority of such lookups needed.
> >
> > It's likely that rets will be the majority of jumps with unknown
> > destinations (though very OO C++ code's use of virtual method
> > invocations will get close). I think the current mechanism is fine.
>
> Agree. The only other major source is implementation of switch statements
> with dense ranges (jump tables).
Another optimisation I'm thinking of is
> Hmm, that differs from what the internal valgrind profiler itself tells
> us ... about 30-40% of the time starting up mozilla, on memcheck, is
> translation. That also corresponds (non-quantitatively) with the speed
> variations in the lines printed by -v -v when running stuff (try it).
Is there a set of benchmarks you've been using for this kind of stuff?
It would be useful to have a set for evaluating all the proposals that
have been flying around lately. It would be nice to have something less
interactive than Mozilla, for example. I'm particularly short of nice
examples of multithreaded programs which arn't all GUI.
> Disco: Running Commodity Operating systems on scalable multiprocessors.
> Bugnion, Devine, Rosenblum (at least the last of these is a founder of VMware,
> Inc). 1997 ish. I'm sure you can scag it from the web somewhere; I'd start
> at citeseer. I only have a paper copy.
Hm, I think I've read that paper, but I don't think can I paid attention
to it. I've certainly read the Cellular Disco paper.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 05:55:40
|
On Mon, 2002-11-18 at 17:18, Julian Seward wrote:
> 1. Lack of translation chaining. Has been discussed a lot already, and I
> think has reasonable probability of getting addressed at some point, since
> we're now floating detailed ideas for doing it.
I think this is major, and we should be able to tell soon: I hacked up
most of an implementation on the train this afternoon. Doesn't quite
work yet, but it looks convincing.
> 2. The inability to cache simulated regs in real ones across bb boundaries
> (they all get flushed with PUTs at the end). I think JF mentioned this
> at some point. I don't know how much effect this has.
I think this is relatively minor compared to chaining, and could be
fixed by doing some kind of trace caching.
> 3. (nobody mentioned this, but I think it is very significant): the inability
> of the code generator to consider groups of UInstrs together when
> generating code. Specifically, this has the following bad effect.
> [...]
> This is surely something worth looking into. An interesting mechanism
> to create would be one which measures the (dynamic) frequency of adjacent
> uinstr pairs, to search for common sequences (pairs, to start with)
> which would be worth putting special code-generation effort into.
> Of course the results would depend on what instrumentation had been
> added by the skin.
I'm not sure its all that important. It would reduce the actual
instruction count and register pressure, but I'm not sure it would do
all that much for modern CPUs (the Via C3 being the exception, and it
does need all the help it can get).
That said, I've actually got a prototype of an SSA-based
codegen/CSE/constant folding pass, which could easily generate
tree-shaped output which you could run one of the BURG-style
tree-matching code generators to make better use of the x86 instruction
set. Well, when I say prototype, I've got half a days hacking on
Valgrind 1.0, so it's probably all junk at this point. And we'd like to
make the codegen faster rather than slower...
A much simpler and quite useful extension would be to allow LOAD and
STORE to have a constant offset, so at least we could use N(%reg)
addressing mode for a lot of memory operations. It would also be
interesting to see what we could do with instruction scheduling to put a
bit of space between our operations; trace caching would make this more
interesting.
Another optimisation I've been thinking about is using our dynamic
knowledge to make more of the indirect jumps direct. For example, we
know that the code is being generated immediately before it gets run, so
for example, if we see a virtual method call:
call *4(%eax)
We can get dereference 4(%eax) at compile time (0x08012345) to see what
the likely target is, and generate:
movl 4(%EAX), %t1
cmpl $0x08012345, %t1
jnz 2f
1: call 0x08012345
3: ...rest of basic block...
2: call *(%t1)
jmp 3b
As cumbersome as this seems, it should actually be faster than the
simple indirect call if our original guess was reasonably accurate.
Since the accepted statistic is that most dynamic method calls in fact
almost always end up at the same target, it's probably worthwhile.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 01:12:01
|
On Monday 18 November 2002 6:34 pm, Jeremy Fitzhardinge wrote:
> On Mon, 2002-11-18 at 09:53, Nicholas Nethercote wrote:
> > The tutorial is a bit unclear, but I think the figures in the first part
> > don't account for any optimisations within the traces. So it is likely
> > underestimating the advantage of the trace cache.
>
> I'll reread it more closely.
>
> > Interestingly, the original version of Dynamo, implemented on HPPA-RISC,
> > actually sped up most programs by up to 20%. There's no clear reason why
> > the x86 version mostly slows them down; maybe it's just an
> > engineering/maturity thing.
>
> Well, there was that section on dynamic optimisation, but I don't think
> it applied to any of the earlier discussion about the basic Dynamo
> mechanism. The optimisations are really simple though; it should be
> possible to do something similar.
Some general comments re V (lack of) performance. I think there are
three areas where we lose out:
1. Lack of translation chaining. Has been discussed a lot already, and I
think has reasonable probability of getting addressed at some point, since
we're now floating detailed ideas for doing it.
2. The inability to cache simulated regs in real ones across bb boundaries
(they all get flushed with PUTs at the end). I think JF mentioned this
at some point. I don't know how much effect this has.
3. (nobody mentioned this, but I think it is very significant): the inability
of the code generator to consider groups of UInstrs together when
generating code. Specifically, this has the following bad effect.
Consider original
movl 4(%eax,%ebx), %ecx
If TempRegs ta, tb, tc contain %EAX, %EBX, %ECX respectively, this
could become
LEA2L 4(ta,tb),temp
LDL (temp), tc
and the code generator will never reassemble this back into a single
instruction. Situation is even worse with (eg) andl 4(%eax,%ebx), %ecx
since that becomes a triple, LEA2L, LD, AND, and so must forever after
that be at least 3 generated insns + an extra register use (temp) to
carry the address.
Now admittedly the original valgrind always wanted to examine the addr
before using it, but now we are in a situation where that's not necessarily
the case, if the skin is not examining addresses. And even if it is,
we could still merge the memory load/store and ALU op in the
"andl 4(%eax,%ebx), %ecx" example.
This is surely something worth looking into. An interesting mechanism
to create would be one which measures the (dynamic) frequency of adjacent
uinstr pairs, to search for common sequences (pairs, to start with)
which would be worth putting special code-generation effort into.
Of course the results would depend on what instrumentation had been
added by the skin.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 00:57:49
|
> > That seems unavoidable. We'd probably just recycle the current > > arrangement, or perhaps a simplified version of it, in this case. Bear > > in mind that x86 RET instructions are going to pull an original address > > off the stack and jump to it, so there will still me a significant > > minority of such lookups needed. > > It's likely that rets will be the majority of jumps with unknown > destinations (though very OO C++ code's use of virtual method > invocations will get close). I think the current mechanism is fine. Agree. The only other major source is implementation of switch statements with dense ranges (jump tables). > > Yes, there's a tradeoff between complexity of the cache management and > > expense of translation. Since we've got a hugely expensive translator > > by JIT standards (10s of 000s of cycles / byte of generated code) we > > have to a have a more complex arrangement designed to minimise the miss > > rate. > > My profiling doesn't show more than about 5% of time being spent in > codegen when there's a non-trivial skin running. Unfortunately oprofile > doesn't account for time spent in generated code, so its hard to tell > what the time for code generation vs. time running generated code ratio > is. Hmm, that differs from what the internal valgrind profiler itself tells us ... about 30-40% of the time starting up mozilla, on memcheck, is translation. That also corresponds (non-quantitatively) with the speed variations in the lines printed by -v -v when running stuff (try it). > > Yes, known trick. I think VMware probably does this too (it's a light > > weight JIT, in effect). (How VMware works is really described in an > > earlier paper from Stanford; I can dig it out if anyone wants). > > Yes please. Disco: Running Commodity Operating systems on scalable multiprocessors. Bugnion, Devine, Rosenblum (at least the last of these is a founder of VMware, Inc). 1997 ish. I'm sure you can scag it from the web somewhere; I'd start at citeseer. I only have a paper copy. This says not a word about VMware, but it describes how to implement basically the same thing on a mips-based multiprocesor, and IMO clearly contains all the important ideas underlying vmware. Most clearly in Section 4. J |