|
From: Julian S. <js...@ac...> - 2002-11-22 05:07:01
|
So, I've read all todays email messages, I think.
Nick, it looks like you've picked up some good issues in today's hackery.
Let me try and re-draw the picture in light of them.
We're trying to fix 3 performance-sapping problems:
1. too many INCEIPs
2. too many eflags save/restores
3. too many fpu save/restores
3 is pretty much solved already by Jeremy's work. 1 and 2 are still open.
1. INCEIPs
~~~~~~~~~~~
The best way to handle the INCEIP expense is replace it all with a
table lookup. As Jeremy suggests, it's probably simpler to have one
big table than one table per bb, and I'm probably inclined in favour
of that.
The problem appears to be how to get hold of %eip (the real one) when
a skin calls a helper function, in a way which is not skin-specific.
I thought about this this evening for some considerable time, but I
can't think of a clean solution which doesn't involve some amount of
performance loss. The least-worst thing I thought of was to associate
with CCALL uinstrs a boolean flag, which when set automagically causes
the current %eip to get passed to the called function as an extra
parameter. This at least makes it skin-independent.
Unfortunately, it adds extra expense for almost all of the calls made
by cachegrind, memcheck and addrcheck, since most of those can potentially
report an error and so need %eip. This is not good. So I don't like
this idea.
2. eflags save/restores
~~~~~~~~~~~~~~~~~~~~~~~~
The second thing Nick has unearthed is that the lazy eflag optimisation
will often not apply, due to intervening flag-mangling operations.
[Nick: yes, your understanding of how this is supposed to work appears
to be correct]. This is bad; but I'd like to do better, because we're
really getting clobbered by these pushf/popf insns.
So here's ...
Yet another proposal
~~~~~~~~~~~~~~~~~~~~
If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
should back off from the %EIP-by-tables story; in effect fall back to
Variant (1) in this mornings proposal. That means retaining INCEIP, at
least when we can't get rid of redundant updates.
I've realised that perhaps INCEIP isn't a clever semantics. Problem is
that (1) it turns into (eg) addl $2, 36(%ebp), which means a load, ALU op,
and store, and (2) it trashes the flags.
[S1: suggestion of PUTEIP]
How about if we had instead PUTEIP, which simply sets the value of %EIP
to some literal value? So, we just emit PUTEIPs with absolute addresses
where currently we create INCEIPs with deltas. PUTEIP then becomes
something like
movl $0x8048517, 36(%ebp)
which is a longer insn (7 bytes?), but is still a fastpath decode at least
on Athlon. It has the advantage of not requiring a load or an ALU op, so is
less resource-hungry. And best of all, it is flag-unaffecting, so that we
can now do the lazy eflags stuff without getting messed up by them.
[S2: suggestion of smarter code for Jz et al]
Another thing which would surely help is to generate smarter code for
the Jz / Jnz UINSTRS. Observation to be made here is that at least for
the Zero flag, we might as well directly test the bit in %EFLAGS in
memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
%eflags: [note; this illustration is without t-chaining, but is equally
valid in t-chaining's presence]
Jzo $addr
-->
testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
-- and N is the offset of the Z flag in
-- %EFLAGS
jz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
in contrast to the current scheme
pushl 32(%ebp)
popfl
jnz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
In fact the new scheme not only avoids the horrible popfl; it also
saves an insn!
Some complicated tests (not-below, etc) would still have to be done the
old way, but we could cover the tests associated with single flags
(zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
catch most _uses_ of the condition codes.
[S3: suggestion of uinstr reordering]
Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
created by memcheck and cachegrind will often get in the way of the lazy
eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
specifically) to generatee instrumentation a bit more cleverly so as to try
and keep condition code generators and users together more often. Dunno if
this will help.
J
|