|
From: Jeremy F. <je...@go...> - 2002-11-19 17:20:13
|
On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> There are sure to be other problems, though. And I have no idea how it
> would affect the MemCheck instrumentation; already the code produced
> contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> needs %ESP to be up-to-date.
Perhaps a slightly simpler approach would be to dedicate a register to
ESP rather than a memory location, so that it is persistently in
register across basic blocks. With any luck the register allocator
could then remove redundant moves, and turn (assuming we reserve %edi
for %ESP):
movl %edi, %ecx
subl $0x4, %ecx
movl %ecx, %edi
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
into
subl $0x4, %edi
xorl %edx, %edx
movl %edx, (%edi)
addl $2, 36(%ebp)
which, while not being push 0x0, isn't too far off. Will still need to
work out appropriate times to save ESP into the baseblock for when
control enters C code.
It seems to me that these optimisations we're talking about all feed
into each other, so there's a synergistic effect:
1. The reason we've got so much memory traffic to the baseBlock is
that we don't have enough registers to keep the simulated
registers in real registers while having enough working
registers for generated code
2. The reason we need lots of working registers is because we break
all the x86 instructions up, and therefore need more temps
3. Even so, we can't possibly really fit all the simulated
registers into real registers, so we're always going to be
saving them off to memory, but we'd like to keep to a minimum
4. We can't do a reasonable global register allocation to try and
keep the working set in registers as much as possible, because
we have a largely local, basic-block, view of the world (except,
perhaps, for special registers, like %esp which may well be used
in every basic block and is therefore worth keeping around).
therefore:
1. Using more compact instructions will cut down on our use of
temps, leaving more registers available for caching architecture
registers during a basic block (as well as making our
instruction footprint smaller)
2. Using trace caching will lengthen the effective size of a basic
block, and thereby amortize the register load/save over more
instructions
In other news:
I got basic block chaining working last night. I got about 25%
improvement (which is nice, but I was hoping for more) in the particular
benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
whole, the performance was pretty dismal: the native run took about 4.6
seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
chained-bb nulgrind took 60 seconds. I haven't looked into it further:
I was hoping it would be a largely CPU-bound test, but maybe its
actually spending all its time in malloc or something.
I also haven't measured how many jumps actually get chained (statically
and dynamically). I only bother with direct jumps and calls, so all
indirects and rets still go through the dispatch loop. Also, some of
the functionality in the dispatch loop needs to be compiled into each
basic block (EIP and VG_(dispatch_ctr) update), which expands the
generated code size and probably detracts from the performance wins.
My next experiment might be to try some rudimentary trace caching by
doing as Josef suggested and having vg_to_ucode simply follow
unconditional jumps and thereby create large basic blocks (hm, better be
careful about infinite loops...). It might do nothing other than
massively increase the dynamic compiler costs...
J
|