|
From: Nicholas N. <nj...@ca...> - 2002-11-18 16:03:52
|
Hi, I was just reading about DynamoRIO, a "runtime optimisation and introspection" tool I recently discovered that is awfully similar to Valgrind. Basically, it does the same thing; their API looks rather familiar: dynamorio_init(); dynamorio_exit(); dynamorio_basic_block(void *drcontext, app_pc tag, InstrList *bb); dynamorio_fragment_deleted(...); etc. One large difference is that it works at the x86 instruction level, which means it's probably a better tool for low-level stuff such as runtime optimisation. There are also lots of functions for poking at instructions eg. to determine if its operands touch memory, so it might be able to be used for complicated skins like MemCheck and Cachegrind feasibly. It works under Windows NT/2000/XP and Linux. There are various limitations, and it looks a little less robust than Valgrind under Linux, but it ran Konqueror fine when I tried it. It's performance is better than Valgrind's; without instrumentation programs under Linux run about 1.0--2.0 times slower, with Valgrind it's more like 5 times slower. It's at www.cag.lcs.mit.edu/dynamorio/. The slides from the ASPLOS tutorial are very interesting. The whole thing was only released in June. It clearly has some advantages over Valgrind (eg. performance, runs under Windows). Now I'm trying to work out what advantages has over it... UCode is arguably one, but also arguably not. Hmm. N |
|
From: Jeremy F. <je...@go...> - 2002-11-18 17:44:49
|
On Mon, 2002-11-18 at 08:03, Nicholas Nethercote wrote: > It's performance is better than Valgrind's; without instrumentation > programs under Linux run about 1.0--2.0 times slower, with Valgrind it's > more like 5 times slower. > > It's at www.cag.lcs.mit.edu/dynamorio/. The slides from the ASPLOS > tutorial are very interesting. The whole thing was only released in June. Looks interesting. It seems their biggest performance win was using BB chaining. They also implemented a trace cache, though it seems to be a fair bit of complexity for hardly any incremental improvement. > It clearly has some advantages over Valgrind (eg. performance, runs under > Windows). Now I'm trying to work out what advantages has over it... V is under the GPL, as opposed to: permission to copy and use (but not distribute to others) this software for the sole purpose of internal evaluation is hereby granted without fee. Oh, and they don't seem to have released any source... J |
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 17:51:21
|
On 18 Nov 2002, Jeremy Fitzhardinge wrote: > > It clearly has some advantages over Valgrind (eg. performance, runs under > > Windows). Now I'm trying to work out what advantages has over it... > > V is under the GPL, as opposed to: > > permission to copy and use (but not distribute to others) this software > for the sole purpose of internal evaluation is hereby granted without > fee. > > Oh, and they don't seem to have released any source... Sure. It also doesn't work with pthreaded Linux programs. And I can't get their example skins (which they confusingly call "clients" to work). I was thinking more from my biased need-a-PhD-in-two-years point of view :) N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 17:54:13
|
Oh, another thing: On 18 Nov 2002, Jeremy Fitzhardinge wrote: > > It's at www.cag.lcs.mit.edu/dynamorio/. The slides from the ASPLOS > > tutorial are very interesting. The whole thing was only released in June. > > Looks interesting. It seems their biggest performance win was using BB > chaining. They also implemented a trace cache, though it seems to be a > fair bit of complexity for hardly any incremental improvement. The tutorial is a bit unclear, but I think the figures in the first part don't account for any optimisations within the traces. So it is likely underestimating the advantage of the trace cache. Interestingly, the original version of Dynamo, implemented on HPPA-RISC, actually sped up most programs by up to 20%. There's no clear reason why the x86 version mostly slows them down; maybe it's just an engineering/maturity thing. N |
|
From: Jeremy F. <je...@go...> - 2002-11-18 18:35:01
|
On Mon, 2002-11-18 at 09:53, Nicholas Nethercote wrote:
> The tutorial is a bit unclear, but I think the figures in the first part
> don't account for any optimisations within the traces. So it is likely
> underestimating the advantage of the trace cache.
I'll reread it more closely.
> Interestingly, the original version of Dynamo, implemented on HPPA-RISC,
> actually sped up most programs by up to 20%. There's no clear reason why
> the x86 version mostly slows them down; maybe it's just an
> engineering/maturity thing.
Well, there was that section on dynamic optimisation, but I don't think
it applied to any of the earlier discussion about the basic Dynamo
mechanism. The optimisations are really simple though; it should be
possible to do something similar.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 01:12:01
|
On Monday 18 November 2002 6:34 pm, Jeremy Fitzhardinge wrote:
> On Mon, 2002-11-18 at 09:53, Nicholas Nethercote wrote:
> > The tutorial is a bit unclear, but I think the figures in the first part
> > don't account for any optimisations within the traces. So it is likely
> > underestimating the advantage of the trace cache.
>
> I'll reread it more closely.
>
> > Interestingly, the original version of Dynamo, implemented on HPPA-RISC,
> > actually sped up most programs by up to 20%. There's no clear reason why
> > the x86 version mostly slows them down; maybe it's just an
> > engineering/maturity thing.
>
> Well, there was that section on dynamic optimisation, but I don't think
> it applied to any of the earlier discussion about the basic Dynamo
> mechanism. The optimisations are really simple though; it should be
> possible to do something similar.
Some general comments re V (lack of) performance. I think there are
three areas where we lose out:
1. Lack of translation chaining. Has been discussed a lot already, and I
think has reasonable probability of getting addressed at some point, since
we're now floating detailed ideas for doing it.
2. The inability to cache simulated regs in real ones across bb boundaries
(they all get flushed with PUTs at the end). I think JF mentioned this
at some point. I don't know how much effect this has.
3. (nobody mentioned this, but I think it is very significant): the inability
of the code generator to consider groups of UInstrs together when
generating code. Specifically, this has the following bad effect.
Consider original
movl 4(%eax,%ebx), %ecx
If TempRegs ta, tb, tc contain %EAX, %EBX, %ECX respectively, this
could become
LEA2L 4(ta,tb),temp
LDL (temp), tc
and the code generator will never reassemble this back into a single
instruction. Situation is even worse with (eg) andl 4(%eax,%ebx), %ecx
since that becomes a triple, LEA2L, LD, AND, and so must forever after
that be at least 3 generated insns + an extra register use (temp) to
carry the address.
Now admittedly the original valgrind always wanted to examine the addr
before using it, but now we are in a situation where that's not necessarily
the case, if the skin is not examining addresses. And even if it is,
we could still merge the memory load/store and ALU op in the
"andl 4(%eax,%ebx), %ecx" example.
This is surely something worth looking into. An interesting mechanism
to create would be one which measures the (dynamic) frequency of adjacent
uinstr pairs, to search for common sequences (pairs, to start with)
which would be worth putting special code-generation effort into.
Of course the results would depend on what instrumentation had been
added by the skin.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 05:55:40
|
On Mon, 2002-11-18 at 17:18, Julian Seward wrote:
> 1. Lack of translation chaining. Has been discussed a lot already, and I
> think has reasonable probability of getting addressed at some point, since
> we're now floating detailed ideas for doing it.
I think this is major, and we should be able to tell soon: I hacked up
most of an implementation on the train this afternoon. Doesn't quite
work yet, but it looks convincing.
> 2. The inability to cache simulated regs in real ones across bb boundaries
> (they all get flushed with PUTs at the end). I think JF mentioned this
> at some point. I don't know how much effect this has.
I think this is relatively minor compared to chaining, and could be
fixed by doing some kind of trace caching.
> 3. (nobody mentioned this, but I think it is very significant): the inability
> of the code generator to consider groups of UInstrs together when
> generating code. Specifically, this has the following bad effect.
> [...]
> This is surely something worth looking into. An interesting mechanism
> to create would be one which measures the (dynamic) frequency of adjacent
> uinstr pairs, to search for common sequences (pairs, to start with)
> which would be worth putting special code-generation effort into.
> Of course the results would depend on what instrumentation had been
> added by the skin.
I'm not sure its all that important. It would reduce the actual
instruction count and register pressure, but I'm not sure it would do
all that much for modern CPUs (the Via C3 being the exception, and it
does need all the help it can get).
That said, I've actually got a prototype of an SSA-based
codegen/CSE/constant folding pass, which could easily generate
tree-shaped output which you could run one of the BURG-style
tree-matching code generators to make better use of the x86 instruction
set. Well, when I say prototype, I've got half a days hacking on
Valgrind 1.0, so it's probably all junk at this point. And we'd like to
make the codegen faster rather than slower...
A much simpler and quite useful extension would be to allow LOAD and
STORE to have a constant offset, so at least we could use N(%reg)
addressing mode for a lot of memory operations. It would also be
interesting to see what we could do with instruction scheduling to put a
bit of space between our operations; trace caching would make this more
interesting.
Another optimisation I've been thinking about is using our dynamic
knowledge to make more of the indirect jumps direct. For example, we
know that the code is being generated immediately before it gets run, so
for example, if we see a virtual method call:
call *4(%eax)
We can get dereference 4(%eax) at compile time (0x08012345) to see what
the likely target is, and generate:
movl 4(%EAX), %t1
cmpl $0x08012345, %t1
jnz 2f
1: call 0x08012345
3: ...rest of basic block...
2: call *(%t1)
jmp 3b
As cumbersome as this seems, it should actually be faster than the
simple indirect call if our original guess was reasonably accurate.
Since the accepted statistic is that most dynamic method calls in fact
almost always end up at the same target, it's probably worthwhile.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-19 10:57:36
|
On 18 Nov 2002, Jeremy Fitzhardinge wrote:
> > 3. (nobody mentioned this, but I think it is very significant): the inability
> > of the code generator to consider groups of UInstrs together when
> > generating code.
>
> I'm not sure its all that important. It would reduce the actual
> instruction count and register pressure, but I'm not sure it would do
> all that much for modern CPUs (the Via C3 being the exception, and it
> does need all the help it can get).
I think it is important, ie. the fact that Valgrind virtualises all the
registers and then loses where they were in the UCode.
Look at DynamoRIO's translation of this small basic block:
add %eax, %ecx
cmp $4, %eax
jle $0x40106f
becomes
frag7: add %eax, %ecx
cmp $4, %eax
jle $0x40106f
stub1: mov %eax, eax-slot # spill %eax
mov &dstub1, %eax # store ptr to stub table
jmp context_switch
stub2: mov %eax, eax-slot # spill %eax
mov &dstub2, %eax # store ptr to stub table
jmp context_switch
Apart from control flow instructions, the resulting code (before
instrumentation and/or trace-level optimisation) is basically the same as
the original code.
Valgrind will turn it into UCode something like this:
GETL %ECX, t14
ADDL %EAX, t14 (-wOSZACP)
PUTL t14, %ECX
INCEIPo $2
GETL %EAX, t10
SUBL $0x1, t10 (-wOSZACP)
INCEIPo $3
Jleo $0x40106f (-rOSZACP)
JMPo $0x4013F9B4
There will be some optimisations, but then each UCode instr will become
one or more x86 instructions. The example isn't so convincing, but repeat
the exercise on a basic block with 10 or 20 instructions and you'll notice
the difference more.
I feel that this must be doing a lot of damage. I don't think it's a
coincidence that the Null skin increases code size by a factor of about 5
and also slows programs by a similar factor, and that Memcheck increases
code size by a factor of about 12--13 and slows programs down by a similar
factor.
> An interesting mechanism to create would be one which measures the
> (dynamic) frequency of adjacent uinstr pairs, to search for common
> sequences (pairs, to start with) which would be worth putting special
> code-generation effort into. Of course the results would depend on what
> instrumentation had been added by the skin.
I just did that... a patch for vg_from_ucode is attached, as are the
(sorted) results for a quick spin of konqueror with --skin=none...
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-19 16:12:39
|
On Tue, 19 Nov 2002, Nicholas Nethercote wrote:
> I think it is important, ie. the fact that Valgrind virtualises all the
> registers and then loses where they were in the UCode.
Thinking about this some more, I think that one particular source of
slowdown is the fact that %esp is virtualised as %ESP.
This is because several very common instructions (push, pop, call, ret)
update %esp as a side-effect. For example:
pushl $0x0
becomes:
GETL %ESP, t4
SUBL $0x4, t4
PUTL t4, %ESP
MOVL $0x0, t6
STL t6, (t4)
INCEIPo $2
becomes:
movl 0x10(%ebp), %ecx
subl $0x4, %ecx
movl %ecx, 0x10(%ebp)
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
A lot of time is spent moving values to and from %ESP. I was wondering
about the possibility of "de-virtualising" %esp -- ie. store %ESP in %esp.
Then maybe the UCode translation could be as simple as "PUSHL 0x0".
I'm sure to have overlooked several bazillion complications here. For
example, %esp is used inside generated code, but I think that might be ok:
- for transferring %eflags to %EFLAGS -- just pushes then pops off the
top of the stack, should be fine
- for CCALLs -- saves caller-save regs, pushes args, clears args,
restores regs, should be fine?
There are sure to be other problems, though. And I have no idea how it
would affect the MemCheck instrumentation; already the code produced
contains redundant "PUTL tX, %ESP" instructions because MemCheck always
needs %ESP to be up-to-date.
I don't know, maybe it's totally impossible, but maybe it's not.
N
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 17:20:13
|
On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> There are sure to be other problems, though. And I have no idea how it
> would affect the MemCheck instrumentation; already the code produced
> contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> needs %ESP to be up-to-date.
Perhaps a slightly simpler approach would be to dedicate a register to
ESP rather than a memory location, so that it is persistently in
register across basic blocks. With any luck the register allocator
could then remove redundant moves, and turn (assuming we reserve %edi
for %ESP):
movl %edi, %ecx
subl $0x4, %ecx
movl %ecx, %edi
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
into
subl $0x4, %edi
xorl %edx, %edx
movl %edx, (%edi)
addl $2, 36(%ebp)
which, while not being push 0x0, isn't too far off. Will still need to
work out appropriate times to save ESP into the baseblock for when
control enters C code.
It seems to me that these optimisations we're talking about all feed
into each other, so there's a synergistic effect:
1. The reason we've got so much memory traffic to the baseBlock is
that we don't have enough registers to keep the simulated
registers in real registers while having enough working
registers for generated code
2. The reason we need lots of working registers is because we break
all the x86 instructions up, and therefore need more temps
3. Even so, we can't possibly really fit all the simulated
registers into real registers, so we're always going to be
saving them off to memory, but we'd like to keep to a minimum
4. We can't do a reasonable global register allocation to try and
keep the working set in registers as much as possible, because
we have a largely local, basic-block, view of the world (except,
perhaps, for special registers, like %esp which may well be used
in every basic block and is therefore worth keeping around).
therefore:
1. Using more compact instructions will cut down on our use of
temps, leaving more registers available for caching architecture
registers during a basic block (as well as making our
instruction footprint smaller)
2. Using trace caching will lengthen the effective size of a basic
block, and thereby amortize the register load/save over more
instructions
In other news:
I got basic block chaining working last night. I got about 25%
improvement (which is nice, but I was hoping for more) in the particular
benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
whole, the performance was pretty dismal: the native run took about 4.6
seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
chained-bb nulgrind took 60 seconds. I haven't looked into it further:
I was hoping it would be a largely CPU-bound test, but maybe its
actually spending all its time in malloc or something.
I also haven't measured how many jumps actually get chained (statically
and dynamically). I only bother with direct jumps and calls, so all
indirects and rets still go through the dispatch loop. Also, some of
the functionality in the dispatch loop needs to be compiled into each
basic block (EIP and VG_(dispatch_ctr) update), which expands the
generated code size and probably detracts from the performance wins.
My next experiment might be to try some rudimentary trace caching by
doing as Josef suggested and having vg_to_ucode simply follow
unconditional jumps and thereby create large basic blocks (hm, better be
careful about infinite loops...). It might do nothing other than
massively increase the dynamic compiler costs...
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 19:03:51
|
On Tuesday 19 November 2002 5:20 pm, Jeremy Fitzhardinge wrote:
> On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> > There are sure to be other problems, though. And I have no idea how it
> > would affect the MemCheck instrumentation; already the code produced
> > contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> > needs %ESP to be up-to-date.
>
> Perhaps a slightly simpler approach would be to dedicate a register to
> ESP rather than a memory location, so that it is persistently in
> register across basic blocks. With any luck the register allocator
> could then remove redundant moves, and turn (assuming we reserve %edi
> for %ESP):
Yes, this seems a good compromise. We already played with the issue of
5 vs 6 allocatable registers and it makes very little difference at least
to memcheck, so perhaps it would be ok to give one to %ESP.
Nick, one thing you seemed to not comment on is ...
> addl $2, 36(%ebp) (the INCEIP)
ie, we keep exact track of %eip; does DynamoRIO ? What sort of
exception model at they presenting?
> I got basic block chaining working last night. I got about 25%
> improvement (which is nice, but I was hoping for more) in the particular
> benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> whole, the performance was pretty dismal: the native run took about 4.6
> seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> I was hoping it would be a largely CPU-bound test, but maybe its
> actually spending all its time in malloc or something.
Strange it's so slow. I think you should try bzip2; that is almost
completely compute bound and does something like 7 malloc calls per
file processed. Here's what I have:
time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> /dev/null
(valgrind startup msgs deleted)
(stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
real 0m7.760s
user 0m7.670s
sys 0m0.030s
time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
(stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
real 0m0.738s
user 0m0.680s
sys 0m0.050s
So more like a 11 - 12 x slowdown than the 35 x you get.
> My next experiment might be [...]
Let me encourage you to make measurements (direct vs indirect jump counts)
to gain insight into your current hackery, before embarking on more. I for
one would like to be assured with numbers that the Right Thing is happening
and that our assumptions about costs, event frequencies, etc, are justified.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 23:14:13
|
On Tue, 2002-11-19 at 11:10, Julian Seward wrote:
> > I got basic block chaining working last night. I got about 25%
> > improvement (which is nice, but I was hoping for more) in the particular
> > benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> > whole, the performance was pretty dismal: the native run took about 4.6
> > seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> > chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> > I was hoping it would be a largely CPU-bound test, but maybe its
> > actually spending all its time in malloc or something.
>
> Strange it's so slow. I think you should try bzip2; that is almost
> completely compute bound and does something like 7 malloc calls per
> file processed. Here's what I have:
>
> time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> > /dev/null
> (valgrind startup msgs deleted)
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m7.760s
> user 0m7.670s
> sys 0m0.030s
>
> time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m0.738s
> user 0m0.680s
> sys 0m0.050s
>
> So more like a 11 - 12 x slowdown than the 35 x you get.
Well, I found that there is an easy way of getting V to generate
extended basic blocks: simply allow conditional jumps to not end the
basic block (they code was all set up to do it, complete with
almost-accurate comment). I haven't done on overall measurement of what
the average increase in BB length is, but from looking a the output of
--trace-codegen, they can get quite long. It certainly allows us to see
what the effect of having long register lifetimes is vs.
indescriminately over-compiling things. The results are far from
conclusive: it seems to vary between a few percent better to a few
percent worse (and it also seems to depend on the skin, though I haven't
really tested that yet).
These are the results I'm seeing:
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no bzip2 < TAGS > /dev/null
time=6.15s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=no bzip2 < TAGS > /dev/null
time=4.95s ratio:10.1
valgrind --skin=none --chain-bb=no --extended-bb=yes bzip2 < TAGS > /dev/null
time=6.17s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=yes bzip2 < TAGS > /dev/null
time=5.48s ratio:11.1
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i \
-quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.44s
valgrind --skin=none --chain-bb=no --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=79.88s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=59.14s ratio:13.3
valgrind --skin=none --chain-bb=no --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=73.77s ratio:16.6
valgrind --skin=none --chain-bb=yes --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.16s ratio:13.0
> > My next experiment might be [...]
>
> Let me encourage you to make measurements (direct vs indirect jump counts)
> to gain insight into your current hackery, before embarking on more. I for
> one would like to be assured with numbers that the Right Thing is happening
> and that our assumptions about costs, event frequencies, etc, are justified.
Oh, yes, I quite agree. The results above make me think that doing
naive trace caching probably won't help, but eliminating dispatch-loop
calls probably will.
I've put the patch up in the usual place.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 12:23:57
|
On 19 Nov 2002, Jeremy Fitzhardinge wrote: > Well, I found that there is an easy way of getting V to generate > extended basic blocks: simply allow conditional jumps to not end the > basic block (they code was all set up to do it, complete with > almost-accurate comment). I haven't done on overall measurement of what > the average increase in BB length is, but from looking a the output of > --trace-codegen, they can get quite long. It certainly allows us to see > what the effect of having long register lifetimes is vs. > indescriminately over-compiling things. The results are far from > conclusive: it seems to vary between a few percent better to a few > percent worse (and it also seems to depend on the skin, though I haven't > really tested that yet). Does this mean that the fall-through JMP back to the dispatcher is replaced by the following basic block? ie. you're allowing multiple exits from a basic block? ie. [code] Jcc X JMP 0xY becomes [code] Jcc ... [code for block at 0xY] ? Interesting that this doesn't make a definite improvement. ---- Another way to extend basic blocks is to inline direct calls in the same way, eg: [code] JMP-c 0xX becomes [code] [code for block at 0xX] Problem with that is that it screws up function-entry detection for skins -- currently to detect function entry they can call VG_(get_fnname_if_entry)() at the start of a basic block to determine if it's the first basic block in a function. With this optimisation they'd have to do it for every x86 instruction, urgh. ---- Similarly direct JMPs seem pretty common, perhaps they too could be inlined: [code] JMP 0xY becomes [code] [code for block at 0xY] ---- This sort of inlining seems nicer than chaining. Would it be hard? Is it any different in principle to what Jeremy's already implemented? N |
|
From: Josef W. <Jos...@gm...> - 2002-11-20 15:20:22
|
On Wednesday 20 November 2002 13:23, Nicholas Nethercote wrote: > ... > Another way to extend basic blocks is to inline direct calls in the sam= e > way, eg: > > [code] > JMP-c 0xX > > becomes > > [code] > [code for block at 0xX] > > Problem with that is that it screws up function-entry detection for ski= ns > -- currently to detect function entry they can call > VG_(get_fnname_if_entry)() at the start of a basic block to determine i= f > it's the first basic block in a function. With this optimisation they'= d > have to do it for every x86 instruction, urgh. Is this really a problem? I thought VG_(get_fnname_if_entry)() is to be called at instrumentation t= ime? And BB chaining should be done *after* skin intrumentation? If a skin is "self-contained" in the instrumentations it's doing, there s= hould=20 be no problem at all with this fancy BB chaining, trace cache etc. Example: In my calltree-patch, I add a helper function when I detect a CALL UInstr= =2E Thus, the skin instrumentation has to modify the UInstr of a BB *before* = any=20 chaining is done.=20 But afterwards, the UInstr's of the BB contain the call to my helper, tog= ether=20 with the jump target as argument. There's no problem with getting rid of = the=20 CALL UInstr. itself for BB chaining. Josef |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 15:28:03
|
On Wed, 20 Nov 2002, Josef Weidendorfer wrote: > > Another way to extend basic blocks is to inline direct calls in the same > > way, eg: > > > > [code] > > JMP-c 0xX > > > > becomes > > > > [code] > > [code for block at 0xX] > > > > Problem with that is that it screws up function-entry detection for skins > > -- currently to detect function entry they can call > > VG_(get_fnname_if_entry)() at the start of a basic block to determine if > > it's the first basic block in a function. With this optimisation they'd > > have to do it for every x86 instruction, urgh. > > Is this really a problem? > I thought VG_(get_fnname_if_entry)() is to be called at instrumentation time? > And BB chaining should be done *after* skin intrumentation? I'm talking about merging together more than one basic block into a larger basic block. In this new larger basic block, the instructions from the start of the called function will appear in the middle of the basic block. Does that make the problem clearer? N |
|
From: Jeremy F. <je...@go...> - 2002-11-20 16:46:31
|
On Wed, 2002-11-20 at 04:23, Nicholas Nethercote wrote:
> Does this mean that the fall-through JMP back to the dispatcher is
> replaced by the following basic block? ie. you're allowing multiple exits
> from a basic block?
>
> ie.
>
> [code]
> Jcc X
> JMP 0xY
>
> becomes
>
> [code]
> Jcc ...
> [code for block at 0xY]
>
> ?
>
> Interesting that this doesn't make a definite improvement.
That in itself should be a definite improvement. The problem is that
its going to increase the rate of redundant translations. If you have
the extended basic block:
1: ...
jnz 3
2: ...
jeq 2
3: ...
jz 5
4: ...
5: ...
jmp 4
Then you're going to end up translating some instructions 5 times.
> Another way to extend basic blocks is to inline direct calls in the same
> way, eg:
>
> [code]
> JMP-c 0xX
>
> becomes
>
> [code]
> [code for block at 0xX]
Right. So you'd want to do some branch-weight measurements first to see
which way around it should go.
> Problem with that is that it screws up function-entry detection for skins
> -- currently to detect function entry they can call
> VG_(get_fnname_if_entry)() at the start of a basic block to determine if
> it's the first basic block in a function. With this optimisation they'd
> have to do it for every x86 instruction, urgh.
To do this, you'd need a SETEIP UInstr to keep the EIP up to date, even
if there isn't a jump. Even if you weren't keeping EIP up to date, you
could have a UInstr for representing the start of a basic block.
> Similarly direct JMPs seem pretty common, perhaps they too could be
> inlined:
>
> [code]
> JMP 0xY
>
> becomes
>
> [code]
> [code for block at 0xY]
>
> ----
>
> This sort of inlining seems nicer than chaining. Would it be hard?
This is what I've been referring to as trace caching. The problem is
that its hard to know when to stop, as a tradeoff between runtime
efficiency and repeated compilation. For example, this is the schematic
of a switch in an infinite loop:
1: cmp ...
jeq 5
cmp ...
jle 4
cmp ...
jgt 3
cmp ...
jne 2
...
jmp 6
2: ...
jmp 6
3: ...
jmp 6
4: ...
jmp 6
5: ...
6: ...
jmp 1
It seems to me that you're going to have to work very hard to amortise
the cost of doing all that compilation.
Dynamo does this, but only after running the code for a while to see
that it's worth the cost of extra compilation. We don't have the
machinery to do "regenerate after instrumenting it for a while", and
that's where the complexity in this idea is. The actual following-jmps
is easy.
> Is it any different in principle to what Jeremy's already implemented?
Chaining is almost as efficient as this, but it has the advantage of not
incurring as much compile-time overhead and only incremental runtime
overhead.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 10:09:02
|
On Tue, 19 Nov 2002, Julian Seward wrote: > > Perhaps a slightly simpler approach would be to dedicate a register to > > ESP rather than a memory location, so that it is persistently in > > register across basic blocks. > > [...] > > Will still need to work out appropriate times to save ESP into the > > baseblock for when control enters C code. > Yes, this seems a good compromise. We already played with the issue of > 5 vs 6 allocatable registers and it makes very little difference at least > to memcheck, so perhaps it would be ok to give one to %ESP. I agree in principle. One question -- how would this affect the UCode? Would it look the same (eg. with all the %ESP PUTs and GETs) and then the code generator would combine multiple UInstrs into single x86 instrs? Or would the UCode look different. If the latter, how would this interact with MemCheck's instrumentation? It feels to me like this would be a good thing to keep within the code generator if possible, so that %ESP is not treated any differently from a UCode point-of-view, which makes life easier for skins. As for saving/restoring %ESP on C code entry/exit, that would make each CCALL two instructions more expensive. For skins doing lots of CCALLs (eg. cachegrind) that might be a loss -- the CCALL extra expense could outweight the %ESP savings. Not sure. I suppose you could have a skin "lots_of_CCALLs" `need' that switches between the two approaches, but that would be a pain. > > 2. Using trace caching will lengthen the effective size of a basic > > block, and thereby amortize the register load/save over more > > instructions A related point: I've noticed that for skins inserting CCALLs, often a register is set and not touched for quite some time -- it's live range is big. If the register is a caller-save one (%e[acd]x) and one or more CCALLs occur between the set and use points, it would be better to spill it. For example: set %eax ... code not using %eax, with multiple CCALLs ... use %eax For each CCALL, %eax has to be saved/restored. If there are 2 or more CCALLs between %eax set and use, it would be better to set %eax and then immediately spill it, and then unspill it immediately before use. I'm not even sure if this happens a lot, but if it is maybe something could be done in register allocation to avoid it. ---- > Nick, one thing you seemed to not comment on is ... > > > addl $2, 36(%ebp) (the INCEIP) > > ie, we keep exact track of %eip; does DynamoRIO ? What sort of > exception model at they presenting? I'm not sure. They have a "VPC" which I assume is a "virtual PC" in some of their diagrams. There's lots of detail in one of the papers about handling unusual control flow in Windows (callbacks, asynchronous procedure calls, exceptions, setjmp/longjmp) but I don't understand that stuff much at all. I'm not much use, sorry. N |
|
From: Jeremy F. <je...@go...> - 2002-11-20 17:05:25
|
On Wed, 2002-11-20 at 02:08, Nicholas Nethercote wrote:
> I agree in principle. One question -- how would this affect the UCode?
> Would it look the same (eg. with all the %ESP PUTs and GETs) and then the
> code generator would combine multiple UInstrs into single x86 instrs? Or
> would the UCode look different.
I would definitely go with making it a special case of GET/PUT (ie, it
looks the same to the skins, and only register allocation/codegen cares
about the difference).
> As for saving/restoring %ESP on C code entry/exit, that would make each
> CCALL two instructions more expensive. For skins doing lots of CCALLs
> (eg. cachegrind) that might be a loss -- the CCALL extra expense could
> outweight the %ESP savings. Not sure. I suppose you could have a skin
> "lots_of_CCALLs" `need' that switches between the two approaches, but that
> would be a pain.
Cachegrind does a CCALL for every original instruction? In that case it
might get a bit slow. One way to fix this is to consider that ESP has
one of two homes: a set register and VGOFF_(m_esp)(%ebp). So long as we
guarantee that esp is in the right place at the end of a basic block, we
can do anything we like within it. So we can apply a lazy move strategy
to it, so if we save it in the baseBlock for a CCALL, we need only
retrieve it again if it gets modified or we're leaving the basic block.
> A related point: I've noticed that for skins inserting CCALLs, often a
> register is set and not touched for quite some time -- it's live range is
> big. If the register is a caller-save one (%e[acd]x) and one or more
> CCALLs occur between the set and use points, it would be better to spill
> it.
>
> For example:
> ...
> For each CCALL, %eax has to be saved/restored. If there are 2 or more
> CCALLs between %eax set and use, it would be better to set %eax and then
> immediately spill it, and then unspill it immediately before use.
Even better: If you've gone to the effort of spilling/saving on the
first ccall, then don't restore it until its actually needed. That is,
rather than pushing the live regs, just save them back into baseBlock,
and reload them when they're needed.
A harder problem is deciding when to do that in the absence of CCALLs,
since a long-life unused register is just causing pressure.
> I'm not sure. They have a "VPC" which I assume is a "virtual PC" in some
> of their diagrams. There's lots of detail in one of the papers about
> handling unusual control flow in Windows (callbacks, asynchronous
> procedure calls, exceptions, setjmp/longjmp) but I don't understand that
> stuff much at all. I'm not much use, sorry.
Last night I convinced myself that it is necessary to maintain the
virtual EIP while running, but now I can't think why. I think it can
always be computed on demand rather than needing to be maintained.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-20 09:01:58
|
[... numbers re translation chaining and trace cacheing ...] > Oh, yes, I quite agree. The results above make me think that doing > naive trace caching probably won't help, but eliminating dispatch-loop > calls probably will. Good work. It's pretty clear that t-chaining is worth having. Not sure about the others. I'd still like to know why bzip2 runs at approx a 10x slowdown, if the translations are chained, there is no callouts to helpers with --skin=none. So the only thing happening is jumping between translations. In which case I'd expect a 10 x code size increase, but it's more like 4:1. So what's going on, I wonder? It might be helpful to do this benchmarking with a simple microbenchmark with only a couple of hot bbs in the inner loop and no I/O, to factor out those effects. ----------- Another obvious lemon is the INCEIP nonsense (of my own devising :) For every insn executed, except for the last in each bb, there is a corresponding INCEIP, which becomes something like addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops for modern CPUs (load-op-store). All because we need an up-to-date %EIP in some very rare circumstances: when taking a snapshot of the stack that might conceivably get passed to the user, and when delivering signals. These are relatively rare. I wonder if we could dispense with the eip and instead associate with each bb a small table of offsets which indicate how to calculate the simulated %EIP from the value it was set at at the start of the block and the current distance that the real %eip is inside the translation now. If you see what I mean. Easy to test the net effect; disable %EIP generation altogether (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. I would try it now but I have to rush off and duke it out with ARM code all day :-) J |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 09:23:52
|
On Wed, 20 Nov 2002, Julian Seward wrote: > Another obvious lemon is the INCEIP nonsense (of my own devising :) > For every insn executed, except for the last in each bb, there is a > corresponding INCEIP, which becomes something like > addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops > for modern CPUs (load-op-store). All because we need an up-to-date %EIP > in some very rare circumstances: when taking a snapshot of the stack that > might conceivably get passed to the user, and when delivering signals. > These are relatively rare. I wonder if we could dispense with the eip > and instead associate with each bb a small table of offsets which indicate > how to calculate the simulated %EIP from the value it was set at at the > start of the block and the current distance that the real %eip is inside > the translation now. If you see what I mean. I was fiddling with the redundant-INCEIP-removal code yesterday (which is currently commented out). But this table-of-EIP-offsets sounds like a much better idea. (And because of my interface changes, it should be possible to add an extra field to UCodeBlock without breaking binary compatibility :) > Easy to test the net effect; disable %EIP generation altogether > (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. > I would try it now but I have to rush off and duke it out with ARM > code all day :-) I did this yesterday. For gzip with --skin=none, the time difference was smallish (16.95s --> 16.13s, about 5%) but the code size difference was big (4.6x --> 3.6x, about 22%). For MemCheck and Cachegrind, INCEIP instructions account for about 10% and 12% of code size respectively. I think the space savings are representative for a wide range of programs. So, to summarise: INCEIP removal via a table of offsets shouldn't be too hard (I think?) and would definitely be worth it even if only for the space savings. N |
|
From: Jeremy F. <je...@go...> - 2002-11-20 09:57:11
|
On Wed, 2002-11-20 at 01:08, Julian Seward wrote:
> I'd still like to know why bzip2 runs at approx a 10x slowdown, if the
> translations are chained, there is no callouts to helpers with
> --skin=none. So the only thing happening is jumping between translations.
> In which case I'd expect a 10 x code size increase, but it's more like
> 4:1. So what's going on, I wonder?
Well, any indirect jumps (which includes calls into dynamic libraries
and returns) are still going through the dispatch loop. I think I've
worked out how to extend the chaining mechanism to handle indirect jumps
(though I think returns are still likely to be tricky, unless you've got
functions with a limited number of call-sites).
> It might be helpful to do this benchmarking with a simple microbenchmark
> with only a couple of hot bbs in the inner loop and no I/O, to factor out
> those effects.
I'm measuring user time, so it should factor out time waiting for I/O
and time spent in the kernel.
> Easy to test the net effect; disable %EIP generation altogether
> (one-liner in vg_to_ucode.c). I bet it gives another 10% or so.
> I would try it now but I have to rush off and duke it out with ARM
> code all day :-)
Good guess - it gets 20% on bzip2, 10% on cc1 (that's still keeping EIP
up to date once per basic block). I think you're right about computing
the EIP as we need it rather than keeping it up to date. I guess we
could tack a table onto the generated code itself or something.
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=yes bzip2 < TAGS > /dev/null
time=6.09s ratio:12.4
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=yes bzip2 < TAGS > /dev/null
time=4.85s ratio:9.8
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=no bzip2 < TAGS > /dev/null
time=5.29s ratio:10.7
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=no bzip2 < TAGS > /dev/null
time=4.04s ratio:8.2
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.47s
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=80.11s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.96s ratio:13.1
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=74.03s ratio:16.5
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=53.51s ratio:11.9
Bedtime.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 19:08:35
|
> > 1. Lack of translation chaining. Has been discussed a lot already, and I > > think has reasonable probability of getting addressed at some point, > > since we're now floating detailed ideas for doing it. > > I think this is major, and we should be able to tell soon: I hacked up > most of an implementation on the train this afternoon. Doesn't quite > work yet, but it looks convincing. Cool! Keep on hacking. I'd be very interested to hear the outcome of this. > > 2. The inability to cache simulated regs in real ones across bb > > boundaries (they all get flushed with PUTs at the end). I think JF > > mentioned this at some point. I don't know how much effect this has. > > I think this is relatively minor compared to chaining, Possibly true. I think trace cacheing would break cachegrind, which makes some assumptions about uniqueness of bb translations (or something? Nick? but that can't be right, because it copes with multiple tail translations of bbs, as we discussed once). Unfortunately there is no easy way I can see to assess the performance loss from this inability to hold values in regs for long. > > 3. (nobody mentioned this, but I think it is very significant): the > > inability of the code generator to consider groups of UInstrs together > > when generating code. Specifically, this has the following bad effect. > > [...] > > This is surely something worth looking into. An interesting mechanism > > to create would be one which measures the (dynamic) frequency of > > adjacent uinstr pairs, to search for common sequences (pairs, to start > > with) which would be worth putting special code-generation effort into. > > Of course the results would depend on what instrumentation had been added > > by the skin. > > I'm not sure its all that important. It would reduce the actual > instruction count and register pressure, but I'm not sure it would do > all that much for modern CPUs (the Via C3 being the exception, and it > does need all the help it can get). Um, yes, I guess because it merely produces insns which will turn mostly into single micro-ops in the P6/P7/K6/K7 cores anyway, yes? I'd forgotten about that. What it does mean is that the translated code is more likely to stall due to lack of insn decoder bandwidth, compared to "normal" code. J |