|
From: Julian S. <js...@ac...> - 2002-11-20 23:45:05
|
This is the second of two messages about t-chaining. Get a coffee
before reading further; you'll need it.
I found it pretty surprising that bzip2 runs at a 10x slowdown
on nulgrind, even with t-chaining engaged. I know that program
pretty well :) and it's completely compute-bound, no I/O, no malloc,
no weird stuff.
So I've just started messing with the attached program, designed
to compute, have an easily-located inner loop :) and to minimise
absolutely all other effects.
On my PIII-1.13 GHz box, compiled -O -g, it takes 0.250 s natively
and 1.56 seconds on nulgrind, with bb chaining on and INCEIPs
not generated (i.e., our max-performance settings so far).
It runs for 25.2 million bbs, btw.
[Attached below are: the C code, the original, Ucode, and generated
code for the inner loop, and the same with intervening junk removed
for easier reading].
That's a lossage factor of 6.2:1. This _seems_ quite good, until
we start looking at the translation of the inner loop. The original
is 12 (action-packed) x86 insns. The expected-case through the
translation is 44 x86 insns + ud2; ud2; nop, possibly about 42
in total once the patching has been done.
Now I'm at a bit of a loss. Although insn counts are notoriously
crappy for predicting execution times, it seems a little wierd that
increasing the number of insns by a factor of 3.5 increases the
cycle count by 6.2 times. The program should run, both natively
and on V, with zero misses in any caches. And, once chained, this
bb is jumping back to itself directly, so nothing else is going on.
So where the @*%£&?! are my missing cycles?
Particularly in view of the fact that the generated code is very
much in x86 RISC subset, and so should generate fewer uops/insn
than the original. For example, it contains several "movl (%esi), %esi".
Let's recap.
- There are no cache misses.
- There are no unpredictable branches.
- Each insn [in the resulting translation] needs either one
Address-Generation-Unit (leal), one Load-Store-Unit (movl) or one
ALU (add, sub, xor). So the bulk of it is unlikely to be stalling
due to a shortage of functional units.
- Perhaps stalls due to result forwarding? But the problem with
that is that (translation)
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and original
addl 0xC(%reg1,%reg2),%reg3
I'd guess require the same functional units chained in the same order
(AGU -> LSU -> ALU) ... who knows?
- Decode limitations? PIII (P6 core) has this wierd 4-1-1 rule, I don't
know if that's a problem. Surely, there are 8 groups of
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and each would fit in a 4-1-1 uop pattern?
- So perhaps there are some rare insns which completely serialise the
machine? We've got two pushfls and two popfls (partially redundant, too)
and they are pretty rare, hence might get executed out of ucode ROM.
I'm mystified. Someone with a cycle-accurate PIII model needs to run this
and see where it stalls.
Nick, what sort of time ratios do you get on your Athlonic device for
this?
[useless numbers: 25.0 million trips thru this loop.
Natively, this is 12 orig insns/iteration, = 300 million
insns. In 0.250 seconds, this is 1200 MIPS; at 1133 MHz,
that's a CPI of 0.94, which is pretty much Not Bad.
Translated, this is approx 42 insns/iteration, = 1050
million insns. In 1.56 seconds, this is 673 MIPS; at 1133 MHz,
that's a CPI of 1.68. This is significantly worse for no
apparent reason.
]
I reckon there's a large heavy object lurking in the undergrowth
here. Now we just need to find it. Anyone fancy lifting out
the translated loop, giving it a suitable context and running it
on VTune, to find out? Is oprofile able to tell us exactly
which insn is stalling?
Congratulations to anyone who bothered to read this far.
J
---------------------------------------------------------------------------
Original C code:
#include <stdio.h>
int i_fit_comfortably_in_D1[1000];
int timewaster ( void )
{
int i, j, sum;
for (i = 0; i < 1000; i++) {
i_fit_comfortably_in_D1[i] = i * 654321;
}
sum = 0;
for (j = 0; j < 200000; j++) {
for (i = 0; i < 1000; i += 8) {
sum += i_fit_comfortably_in_D1[i];
sum ^= i_fit_comfortably_in_D1[i+1];
sum -= i_fit_comfortably_in_D1[i+2];
sum += i_fit_comfortably_in_D1[i+3];
sum += i_fit_comfortably_in_D1[i+4];
sum ^= i_fit_comfortably_in_D1[i+5];
sum -= i_fit_comfortably_in_D1[i+6];
sum += i_fit_comfortably_in_D1[i+7];
}
}
return sum;
}
int main ( void )
{
printf ("the answer, you will be fascinated to know, is ... ");
fflush(stdout);
printf("0x%x\n", timewaster() );
return 0;
}
---------------------------------------------------------------------------
Original x86 code for inner loop:
0x8048508: leal 0x0(,%ecx,4), %eax
0x804850F: addl (%eax,%ebx,1),%edx
0x8048512: xorl 4(%ebx,%eax,1),%edx
0x8048516: subl 8(%ebx,%eax,1),%edx
0x804851A: addl 12(%ebx,%eax,1),%edx
0x804851E: addl 16(%ebx,%eax,1),%edx
0x8048522: xorl 20(%ebx,%eax,1),%edx
0x8048526: subl 24(%ebx,%eax,1),%edx
0x804852A: addl 28(%ebx,%eax,1),%edx
0x804852E: addl $0x8, %ecx
0x8048531: cmpl $0x3E7, %ecx
0x8048537: jle-8 0x8048508
---------------------------------------------------------------------------
Cleaned-up ucode for inner loop:
0: GETL %ECX, t2
1: MOVL $0x0, t0
2: LEA2L 0(t0,t2,4), t0
3: PUTL t0, %EAX
4: GETL %EBX, t6
6: LEA2L 0(t0,t6,1), t4
7: LDL (t4), t4
8: ADDL %EDX, t4
12: LEA2L 4(t6,t0,1), t10
13: LDL (t10), t10
14: XORL t4, t10
18: LEA2L 8(t6,t0,1), t16
19: LDL (t16), t16
21: SUBL t16, t10
25: LEA2L 12(t6,t0,1), t24
26: LDL (t24), t24
27: ADDL t10, t24
31: LEA2L 16(t6,t0,1), t30
32: LDL (t30), t30
33: ADDL t24, t30
37: LEA2L 20(t6,t0,1), t36
38: LDL (t36), t36
39: XORL t30, t36
43: LEA2L 24(t6,t0,1), t42
44: LDL (t42), t42
46: SUBL t42, t36
50: LEA2L 28(t6,t0,1), t50
51: LDL (t50), t50
52: ADDL t36, t50
53: PUTL t50, %EDX
55: ADDL $0x8, t2
56: PUTL t2, %ECX
58: SUBL $0x3E7, t2 (-wOSZACP)
59: Jleo $0x8048508 (-rOSZACP)
60: JMPo $0x8048539 ($2)
---------------------------------------------------------------------------
Generated x86 code for inner loop:
0: FF 0D 0C FD 0D 40
decl (0x400DFD0C)
6: 75 06
jnz-8 %eip+6
8: BD 1D 00 00 00
movl $0x1D, %ebp
13: C3
ret
0: GETL %ECX, %eax [a-----]
14: 8B 45 04
movl 0x4(%ebp), %eax
1: MOVL $0x0, %ebx [ab----]
17: 31 DB
xorl %ebx, %ebx
2: LEA2L 0(%ebx,%eax,4), %ebx [ab----]
19: 8D 5C 83 00
leal 0x0(%ebx,%eax,4), %ebx
3: PUTL %ebx, %EAX [ab----]
23: 89 5D 00
movl %ebx, 0x0(%ebp)
4: GETL %EBX, %ecx [abc---]
26: 8B 4D 0C
movl 0xC(%ebp), %ecx
5: LEA2L 0(%ebx,%ecx,1), %edx [abcd--]
29: 8D 54 0B 00
leal 0x0(%ebx,%ecx,1), %edx
6: LDL (%edx), %edx [abcd--]
33: 8B 12
movl (%edx), %edx
7: ADDL %EDX, %edx [abcd--]
35: 03 55 08
addl 0x8(%ebp), %edx
8: LEA2L 4(%ecx,%ebx,1), %esi [abcdS-]
38: 8D 74 19 04
leal 0x4(%ecx,%ebx,1), %esi
9: LDL (%esi), %esi [abcdS-]
42: 8B 36
movl (%esi), %esi
10: XORL %edx, %esi [abc-S-]
44: 31 D6
xorl %edx, %esi
11: LEA2L 8(%ecx,%ebx,1), %edi [abc-SD]
46: 8D 7C 19 08
leal 0x8(%ecx,%ebx,1), %edi
12: LDL (%edi), %edi [abc-SD]
50: 8B 3F
movl (%edi), %edi
13: SUBL %edi, %esi [abc-S-]
52: 29 FE
subl %edi, %esi
14: LEA2L 12(%ecx,%ebx,1), %edx [abcdS-]
54: 8D 54 19 0C
leal 0xC(%ecx,%ebx,1), %edx
15: LDL (%edx), %edx [abcdS-]
58: 8B 12
movl (%edx), %edx
16: ADDL %esi, %edx [abcd--]
60: 01 F2
addl %esi, %edx
17: LEA2L 16(%ecx,%ebx,1), %esi [abcdS-]
62: 8D 74 19 10
leal 0x10(%ecx,%ebx,1), %esi
18: LDL (%esi), %esi [abcdS-]
66: 8B 36
movl (%esi), %esi
19: ADDL %edx, %esi [abc-S-]
68: 01 D6
addl %edx, %esi
20: LEA2L 20(%ecx,%ebx,1), %edx [abcdS-]
70: 8D 54 19 14
leal 0x14(%ecx,%ebx,1), %edx
21: LDL (%edx), %edx [abcdS-]
74: 8B 12
movl (%edx), %edx
22: XORL %esi, %edx [abcd--]
76: 31 F2
xorl %esi, %edx
23: LEA2L 24(%ecx,%ebx,1), %esi [abcdS-]
78: 8D 74 19 18
leal 0x18(%ecx,%ebx,1), %esi
24: LDL (%esi), %esi [abcdS-]
82: 8B 36
movl (%esi), %esi
25: SUBL %esi, %edx [abcd--]
84: 29 F2
subl %esi, %edx
26: LEA2L 28(%ecx,%ebx,1), %esi [a--dS-]
86: 8D 74 19 1C
leal 0x1C(%ecx,%ebx,1), %esi
27: LDL (%esi), %esi [a--dS-]
90: 8B 36
movl (%esi), %esi
28: ADDL %edx, %esi [a---S-]
92: 01 D6
addl %edx, %esi
29: PUTL %esi, %EDX [a-----]
94: 89 75 08
movl %esi, 0x8(%ebp)
30: ADDL $0x8, %eax [a-----]
97: 83 C0 08
addl $0x8, %eax
31: PUTL %eax, %ECX [a-----]
100: 89 45 04
movl %eax, 0x4(%ebp)
32: SUBL $0x3E7, %eax (-wOSZACP) [------]
103: FF 75 20 9D
pushl 32(%ebp) ; popfl
107: 81 E8 E7 03 00 00
subl $0x3E7, %eax
113: 9C 8F 45 20
pushfl ; popl 32(%ebp)
33: Jleo $0x8048508 (-rOSZACP) [------]
117: FF 75 20 9D
pushl 32(%ebp) ; popfl
121: 7F 0D
jnle-8 %eip+13
123: B8 08 85 04 08
movl $0x8048508, %eax
128: 89 45 24
movl %eax, 0x24(%ebp)
131: 0F 0B 0F 0B 90
ud2; ud2; nop
34: JMPo $0x8048539 ($2) [------]
136: B8 39 85 04 08
movl $0x8048539, %eax
141: 89 45 24
movl %eax, 0x24(%ebp)
144: 0F 0B 0F 0B 90
ud2; ud2; nop
---------------------------------------------------------------------------
Generated x86 code for inner loop, with crud removed, and bb start points
distinguished by a blank line.
decl (0x400DFD0C)
jnz-8 %eip+6
movl $0x1D, %ebp
ret
movl 0x4(%ebp), %eax
xorl %ebx, %ebx
leal 0x0(%ebx,%eax,4), %ebx
movl %ebx, 0x0(%ebp)
movl 0xC(%ebp), %ecx
leal 0x0(%ebx,%ecx,1), %edx
movl (%edx), %edx
addl 0x8(%ebp), %edx
leal 0x4(%ecx,%ebx,1), %esi
movl (%esi), %esi
xorl %edx, %esi
leal 0x8(%ecx,%ebx,1), %edi
movl (%edi), %edi
subl %edi, %esi
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
leal 0x14(%ecx,%ebx,1), %edx
movl (%edx), %edx
xorl %esi, %edx
leal 0x18(%ecx,%ebx,1), %esi
movl (%esi), %esi
subl %esi, %edx
leal 0x1C(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
movl %esi, 0x8(%ebp)
addl $0x8, %eax
movl %eax, 0x4(%ebp)
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
movl $0x8048508, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
movl $0x8048539, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
|
|
From: Julian S. <js...@ac...> - 2002-11-21 09:01:36
|
It's clear we need to do something to fix this. This is a proposal with two variants; a less ambitious variant (first) and a more ambitious one (second). Variant 1 (less ambitious) ~~~~~~~~~~~~~~~~~~~~~~~~~~ We create three new functions, all of them predicates on UInstrs: 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* ); 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* ); 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* ); which tell you whether or not a uinstr (more correctly, the code generated for that uinstr) **might** (1) alter the real machine's %eflags, (2) alter the real machine's FPU state, and (3) require to see the %EIP of the simulated machine. Getting a safe-if-conservative result for (1) and (2) is critical for translation correctness. For (3) it just effects the accuracy of stack traces. So the safe thing to return is True for all functions, yet we want to return False as often as we safely can. Classical compiler-analysis conservative-estimate stuff. Further, for all skins which extend ucode, we add suitable functions to the core/skin iface so that the same questions can be asked of the extended ucodes. With fn (3), the redundant INCEIP removal phase can be reinstated for all skins, and should work correctly. The reason it's commented out is precisely because at present it can't reliably ask this question. Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU state save/restore, which I think is rolled into the code generation loop. I would like to be able to ship this optimisation with confidence that I understand the ramifications. Fn (3) would dually be used to support lazy %EFLAG save restore in the same manner as the FPU. None of this would be hard to implement, and they would support a useful bunch of optimisations. Variant 2 (more ambitious) ~~~~~~~~~~~~~~~~~~~~~~~~~~ We create two new functions, all of them predicates on UInstrs: 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* ); 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* ); We do the lazy FPU and EFLAGS save/restore optimisation exactly as above. Difference here is we do table-based EIP reconstruction as needed and completely nuke INCEIP (yay!) So, precisely how to reconstruct %EIP from %eip? Like this. Firstly %EIP must be made up-to-date at the start of each bb, but it is anyway, so that's free. Now suppose we are in the middle of a bb and want to know the current %EIP. We actually know the block-start %EIP and the current %eip. Using the block-start %EIP we can look in the translation table (TT) to find info about this block: start address of its translation, and presumably whereabouts it's %eip->%EIP mapping table is. Knowning where the translation starts (%eip for the start of the translated block) and knowing the current %eip, we can figure out how far inside the translation we are. That offset can then be used to index (somehow) the mapping table for this block, to give the corresponding %EIP offset, which, when added to the block-start %EIP, gives the current %EIP. Makes sense? So the only question is how to compactly encode the mapping table, and perhaps where to store them, but that's not a big deal. Comments? If we could do variant 2 rather than variant 1 that would be cool, considering it would give a bigger speedup overall. J |
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 10:26:01
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> Fn (1) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
Can I just confirm my understanding of the lazy %EFLAG optimisation?
Currently, AIUI, a lot of basic blocks (pretty much all those that end in
a conditional jump) end like this:
37: ANDL %eax, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %eax, %ebx
pushfl ; popl 32(%ebp) ***
38: INCEIPo $2
addl $2, 36(%ebp)
39: Jzo $0x402047AC (-rOSZACP)
pushl 32(%ebp) ; popfl ***
jnz-8 %eip+6
movl $0x402047AC, %eax
ret
If-and-only-if the INCEIP addl is removed, then the 4 instructions in the
two lines marked *** can be trivially removed.
Looking through --skin=none code, the only place where this seems
applicable is for conditional jumps -- arithmetic ops in the middle of
basic blocks don't look like they can have their %EFLAGs fiddling avoided.
One downer: I'm not sure if this will work for MemCheck. Consider the
instrumented code in the same situation:
30: ANDL %edx, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %edx, %ebx
pushfl ; popl 32(%ebp)
31: INCEIPo $2
addl $2, 104(%ebp)
32: GETVFo %eax
movl 0x44(%ebp), %eax
33: TESTVo %eax
testb $0x1, %al
jz-8 %eip+3
call * 76(%ebp)
34: Jzo $0x4000D2C4 (-rOSZACP)
pushl 32(%ebp) ; popfl
jnz-8 %eip+6
movl $0x4000D2C4, %eax
ret
Ignoring the INCEIP, the arithmetic op and the conditional jump are no
longer adjacent.
Here's the equivalent code for Cachegrind:
9: ANDL %edi, %edx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %edi, %edx
pushfl ; popl 32(%ebp)
10: MOVL $0x40269DC0, %eax
movl $0x40269DC0, %eax
11: CCALLo 0x4001C170(%eax)
call * 36(%ebp)
12: INCEIPo $2
addl $2, 60(%ebp)
13: MOVL $0x40269DE0, %eax
movl $0x40269DE0, %eax
14: CCALLo 0x4001C170(%eax)
call * 36(%ebp)
15: Jnzo $0x4000D190 (-rOSZACP)
pushl 32(%ebp) ; popfl
jz-8 %eip+6
movl $0x4000D190, %eax
ret
Again the optimisation is foiled.
It would work for AddrCheck and Helgrind though.
It seems that we would be getting into a situation where the
instrumentation you add could affect what optimisations can occur; in one
way this is reasonable but in another way it kind of sucks.
Or have I misunderstood something?
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
> [...]
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
I support this. The complexity will be localised pretty well, which is
good.
BTW, Am I right in thinking that the INCEIP instruction will stay in
UCode, but that it will just be removed during the codegen phase?
Because skins (eg. Cachegrind) use it as a way of determining the
boundaries of the original x86 instructions.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-22 05:07:01
|
So, I've read all todays email messages, I think.
Nick, it looks like you've picked up some good issues in today's hackery.
Let me try and re-draw the picture in light of them.
We're trying to fix 3 performance-sapping problems:
1. too many INCEIPs
2. too many eflags save/restores
3. too many fpu save/restores
3 is pretty much solved already by Jeremy's work. 1 and 2 are still open.
1. INCEIPs
~~~~~~~~~~~
The best way to handle the INCEIP expense is replace it all with a
table lookup. As Jeremy suggests, it's probably simpler to have one
big table than one table per bb, and I'm probably inclined in favour
of that.
The problem appears to be how to get hold of %eip (the real one) when
a skin calls a helper function, in a way which is not skin-specific.
I thought about this this evening for some considerable time, but I
can't think of a clean solution which doesn't involve some amount of
performance loss. The least-worst thing I thought of was to associate
with CCALL uinstrs a boolean flag, which when set automagically causes
the current %eip to get passed to the called function as an extra
parameter. This at least makes it skin-independent.
Unfortunately, it adds extra expense for almost all of the calls made
by cachegrind, memcheck and addrcheck, since most of those can potentially
report an error and so need %eip. This is not good. So I don't like
this idea.
2. eflags save/restores
~~~~~~~~~~~~~~~~~~~~~~~~
The second thing Nick has unearthed is that the lazy eflag optimisation
will often not apply, due to intervening flag-mangling operations.
[Nick: yes, your understanding of how this is supposed to work appears
to be correct]. This is bad; but I'd like to do better, because we're
really getting clobbered by these pushf/popf insns.
So here's ...
Yet another proposal
~~~~~~~~~~~~~~~~~~~~
If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
should back off from the %EIP-by-tables story; in effect fall back to
Variant (1) in this mornings proposal. That means retaining INCEIP, at
least when we can't get rid of redundant updates.
I've realised that perhaps INCEIP isn't a clever semantics. Problem is
that (1) it turns into (eg) addl $2, 36(%ebp), which means a load, ALU op,
and store, and (2) it trashes the flags.
[S1: suggestion of PUTEIP]
How about if we had instead PUTEIP, which simply sets the value of %EIP
to some literal value? So, we just emit PUTEIPs with absolute addresses
where currently we create INCEIPs with deltas. PUTEIP then becomes
something like
movl $0x8048517, 36(%ebp)
which is a longer insn (7 bytes?), but is still a fastpath decode at least
on Athlon. It has the advantage of not requiring a load or an ALU op, so is
less resource-hungry. And best of all, it is flag-unaffecting, so that we
can now do the lazy eflags stuff without getting messed up by them.
[S2: suggestion of smarter code for Jz et al]
Another thing which would surely help is to generate smarter code for
the Jz / Jnz UINSTRS. Observation to be made here is that at least for
the Zero flag, we might as well directly test the bit in %EFLAGS in
memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
%eflags: [note; this illustration is without t-chaining, but is equally
valid in t-chaining's presence]
Jzo $addr
-->
testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
-- and N is the offset of the Z flag in
-- %EFLAGS
jz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
in contrast to the current scheme
pushl 32(%ebp)
popfl
jnz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
In fact the new scheme not only avoids the horrible popfl; it also
saves an insn!
Some complicated tests (not-below, etc) would still have to be done the
old way, but we could cover the tests associated with single flags
(zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
catch most _uses_ of the condition codes.
[S3: suggestion of uinstr reordering]
Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
created by memcheck and cachegrind will often get in the way of the lazy
eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
specifically) to generatee instrumentation a bit more cleverly so as to try
and keep condition code generators and users together more often. Dunno if
this will help.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 02:41:14
|
On Thu, 2002-11-21 at 16:08, Julian Seward wrote:
> 1. INCEIPs
> ~~~~~~~~~~~
> The best way to handle the INCEIP expense is replace it all with a
> table lookup. As Jeremy suggests, it's probably simpler to have one
> big table than one table per bb, and I'm probably inclined in favour
> of that.
Eh? No I think one table per BB is easiest to handle, particularly if
we have EIP at least up to date to the current basic block.
> The problem appears to be how to get hold of %eip (the real one) when
> a skin calls a helper function, in a way which is not skin-specific.
> I thought about this this evening for some considerable time, but I
> can't think of a clean solution which doesn't involve some amount of
> performance loss. The least-worst thing I thought of was to associate
> with CCALL uinstrs a boolean flag, which when set automagically causes
> the current %eip to get passed to the called function as an extra
> parameter. This at least makes it skin-independent.
But we have a VG_(get_EIP)() function. Why can't skins just use that,
and we hide all the complexity in that function? Why does a skin need
to know about the hardware %eip?
> Unfortunately, it adds extra expense for almost all of the calls made
> by cachegrind, memcheck and addrcheck, since most of those can potentially
> report an error and so need %eip. This is not good. So I don't like
> this idea.
Memcheck only needs %EIP when actually reporting an error though.
> 2. eflags save/restores
> ~~~~~~~~~~~~~~~~~~~~~~~~
> The second thing Nick has unearthed is that the lazy eflag optimisation
> will often not apply, due to intervening flag-mangling operations.
> [Nick: yes, your understanding of how this is supposed to work appears
> to be correct]. This is bad; but I'd like to do better, because we're
> really getting clobbered by these pushf/popf insns.
I had an idea about restoring eflags without pushf/popf: just rerun the
setting instruction. Since (almost?) all the instructions which
generate eflags are otherwise side-effect free and don't depend on
memory, why not just rerun them on the same values just before the
jump? (Or perhaps move it altogether, but that might be tricker than
just copying the instruction.)
I haven't looked at this in detail, but it's an amusing idea...
Hm, I guess making sure the flags are correct for the end of the BB
still needs them to be explicitly saved. Which I suppose we could do in
the form of saving a piece of code which has the right effect, but
that's probably going a bit far...
> If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
> should back off from the %EIP-by-tables story; in effect fall back to
> Variant (1) in this mornings proposal. That means retaining INCEIP, at
> least when we can't get rid of redundant updates.
I still don't think its hard to efficiently implement an EIP table
lookup.
> How about if we had instead PUTEIP, which simply sets the value of %EIP
> to some literal value? So, we just emit PUTEIPs with absolute addresses
> where currently we create INCEIPs with deltas. PUTEIP then becomes
> something like[...]
I do this in the t-chaining patch. Just before each jump, it generates:
movl $target-addr, %eax
movl %eax, 36(%ebp)
It means that if the chaining patch fails, it can still just ret into
the dispatch loop, and %eax is set up properly, as well as having %EIP
up to date in case there's a context switch.
I currently generate this inline, but I was thinking of adding a SETEIP
with these semantics. This would also be useful to distinguish basic
blocks in an extended basic block when doing trace caching.
> [S2: suggestion of smarter code for Jz et al]
>
> Another thing which would surely help is to generate smarter code for
> the Jz / Jnz UINSTRS. Observation to be made here is that at least for
> the Zero flag, we might as well directly test the bit in %EFLAGS in
> memory [...]
> Some complicated tests (not-below, etc) would still have to be done the
> old way, but we could cover the tests associated with single flags
> (zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
> catch most _uses_ of the condition codes.
That's nice and simple to implement. It's a good start. Has anyone
looked to see what the proportions of the different comparisons are? I
would expect Z and NZ would be way up there...
Hm, looking at the definitions of the tests in the manual, it seems that
it would be pretty easy to generate open-coded equivalents for most of
them in much less than the 12 cycles to do popf...
> Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
> created by memcheck and cachegrind will often get in the way of the lazy
> eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
> specifically) to generatee instrumentation a bit more cleverly so as to try
> and keep condition code generators and users together more often. Dunno if
> this will help.
This is along the lines of my suggestion above of duplicating the
flags-setting instruction to just before the test. It's a little bit
like Shade's use of carefully chosen instructions to set the flags to
the desired value.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 12:37:23
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> [S2: suggestion of smarter code for Jz et al]
>
> Another thing which would surely help is to generate smarter code for
> the Jz / Jnz UINSTRS. Observation to be made here is that at least for
> the Zero flag, we might as well directly test the bit in %EFLAGS in
> memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
> %eflags: [note; this illustration is without t-chaining, but is equally
> valid in t-chaining's presence]
>
> Jzo $addr
> -->
> testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
> -- and N is the offset of the Z flag in
> -- %EFLAGS
> jz-8 %eip+6
> movl $addr, %eax
> ret
> %eip+6:
>
> in contrast to the current scheme
>
> pushl 32(%ebp)
> popfl
> jnz-8 %eip+6
> movl $addr, %eax
> ret
> %eip+6:
>
> In fact the new scheme not only avoids the horrible popfl; it also
> saves an insn!
Nice idea. But doesn't it clash with the lazy EFLAGs idea? viz:
37: ANDL %eax, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %eax, %ebx
pushfl ; popl 32(%ebp) ***
39: Jzo $0x402047AC (-rOSZACP)
pushl 32(%ebp) ; popfl ***
jnz-8 %eip+6
movl $0x402047AC, %eax
ret
If Jzo doesn't have the "pushl 32(%ebp) ; popfl" then you can't remove the
"pushfl ; popl 32(%ebp)" from ANDL -- instead of cutting 4 instructions,
it only cuts 2.
Another thing: will the ANDL's "pushfl ; popl 32(%ebp)" always be
necessary? Ie. do the condition codes have to be saved in EFLAGs before
the end of the BB? If so, then lazy EFLAG updating can only save two
instructions, not four, in which case the new Jz/Jnz idea seems better
because it saves another instruction anyway...
> Some complicated tests (not-below, etc) would still have to be done the
> old way, but we could cover the tests associated with single flags
> (zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
> catch most _uses_ of the condition codes.
FYI, from "valgrind --skin=none --trace-codegen=00001 true":
all jumps 1908
JMPo 916
Jzo 407
Jnzo 150
JMPo-c 177
JMPo-r 121
Jle 20
Jnbo 25
Jbo 13
Jnbeo 12
Jns 16
Jnleo 9
JMPo-sys 7
sum of those listed: 1873 (35 not mentioned)
Of the 726 Jccs, 77% are Jz/Jnz.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-22 07:00:18
|
> > The problem appears to be how to get hold of %eip (the real one) when > > a skin calls a helper function, in a way which is not skin-specific. > > I thought about this this evening for some considerable time, but I > > can't think of a clean solution which doesn't involve some amount of > > performance loss. The least-worst thing I thought of was to associate > > with CCALL uinstrs a boolean flag, which when set automagically causes > > the current %eip to get passed to the called function as an extra > > parameter. This at least makes it skin-independent. > > But we have a VG_(get_EIP)() function. Why can't skins just use that, > and we hide all the complexity in that function? Why does a skin need > to know about the hardware %eip? Um, I'm confused. I don't understand how to solve the following problem: memcheck calls a helper function to do a check, concludes there's an error and needs %EIP. How exactly do you propose to generate it if you don't know %eip at the point where the helper was called? > That's nice and simple to implement. It's a good start. Has anyone > looked to see what the proportions of the different comparisons are? I > would expect Z and NZ would be way up there... I bet Hennessy and Patterson has this info. I'll look. J |
|
From: Jeremy F. <je...@go...> - 2002-11-22 07:08:50
|
On Thu, 2002-11-21 at 23:07, Julian Seward wrote:
> Um, I'm confused. I don't understand how to solve the following problem:
> memcheck calls a helper function to do a check, concludes there's an error
> and needs %EIP. How exactly do you propose to generate it if you don't
> know %eip at the point where the helper was called?
Oh, sorry. I was assuming you'd store that in a baseblock field before
the call. Hm, I guess that would need a trampoline to do it properly:
movl %ccall-func, %eax
call ccall-tramp
[...]
ccall-tramp:
movl (%esp), %ebx
movl %ebx, XX(%ebp)
jmp *%eax
Or perhaps do it completely inline. Maybe it's all a bit clumsy.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 07:51:04
|
On Friday 22 November 2002 7:08 am, Jeremy Fitzhardinge wrote: > On Thu, 2002-11-21 at 23:07, Julian Seward wrote: > > Um, I'm confused. I don't understand how to solve the following problem: > > memcheck calls a helper function to do a check, concludes there's an > > error and needs %EIP. How exactly do you propose to generate it if you > > don't know %eip at the point where the helper was called? > > Oh, sorry. I was assuming you'd store that in a baseblock field before > the call. Hm, I guess that would need a trampoline to do it properly: > > movl %ccall-func, %eax > call ccall-tramp > [...] > > > ccall-tramp: > movl (%esp), %ebx > movl %ebx, XX(%ebp) > jmp *%eax > > Or perhaps do it completely inline. Maybe it's all a bit clumsy. Sorry to be stupid. I'm still confused. In that case, wouldn't be simpler just to set %EIP to the correct value before the call, and completely avoid all the hassle of establishing %eip and converting it to %EIP ? I think I must be missing something. J |
|
From: Jeremy F. <je...@go...> - 2002-11-22 16:06:39
|
On Thu, 2002-11-21 at 23:58, Julian Seward wrote:
> Sorry to be stupid. I'm still confused. In that case, wouldn't be simpler
> just to set %EIP to the correct value before the call, and completely avoid
> all the hassle of establishing %eip and converting it to %EIP ? I think
> I must be missing something.
No, you're right, I'm being silly.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 15:43:27
|
Hi,
I started having a go at the INCEIP table lookup thing, but I got stuck...
On Thu, 21 Nov 2002, Julian Seward wrote:
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free. Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip.
How do we know what the current %eip is?
Consider an example. When a 'value' error occurs in the middle of a
BB under MemCheck:
gen code calls the asm routine MC_(helper_value_check4_fail)
which calls the C function MC_(helperc_value_check4_fail)()
which calls MC_(record_value_error)()
which eventually calls VG_(maybe_record_error)()
which eventually calls VG_(construct_error)()
which (as tst==NULL) records VG_(baseBlock)[VGOFF_(m_eip)]
If we remove the INCEIPs, VG_(baseBlock)[VGOFF_(m_eip)] will be out of
date -- it will point to the start of the original BB rather than the
exact problem instruction.
So how do we find out %eip in order to look up the snazzy new eip-->EIP
table? Well, it's buried deep down in the stack as the return value
(minus one instruction) for MC_(helper_value_check4_fail). But that's not
a good way to find it out.
Alternatively, during translation, the original %eip is known for each
original instruction and so it could be passed as an arg to
MC_(helper_value_check4_fail). But that would bloat code for TESTV (would
need to push the arg and then clear it after the call) and it doesn't feel
nice. But maybe it has to be done this way?
Or am I missing something obvious here?
N
|
|
From: Jeremy F. <je...@go...> - 2002-11-21 17:42:28
|
On Thu, 2002-11-21 at 01:08, Julian Seward wrote:
> It's clear we need to do something to fix this. This is a proposal
> with two variants; a less ambitious variant (first) and a more ambitious
> one (second).
>
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> which tell you whether or not a uinstr (more correctly, the code
> generated for that uinstr) **might** (1) alter the real machine's
> %eflags, (2) alter the real machine's FPU state, and (3) require
> to see the %EIP of the simulated machine.
>
> Getting a safe-if-conservative result for (1) and (2) is critical
> for translation correctness. For (3) it just effects the accuracy
> of stack traces. So the safe thing to return is True for all functions,
> yet we want to return False as often as we safely can. Classical
> compiler-analysis conservative-estimate stuff.
>
> Further, for all skins which extend ucode, we add suitable functions
> to the core/skin iface so that the same questions can be asked of the
> extended ucodes.
>
> With fn (3), the redundant INCEIP removal phase can be reinstated
> for all skins, and should work correctly. The reason it's commented
> out is precisely because at present it can't reliably ask this question.
>
> Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU
> state save/restore, which I think is rolled into the code generation
> loop. I would like to be able to ship this optimisation with confidence
> that I understand the ramifications.
>
> Fn (3) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
>
> None of this would be hard to implement, and they would support a useful
> bunch of optimisations.
>
>
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create two new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
>
> We do the lazy FPU and EFLAGS save/restore optimisation exactly as
> above.
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free.
Only when using the dispatch loop. When using t-chaining, I currently
manually update EIP to the target EIP just before each chained jump. I
haven't measured how much this costs (because removing it completely
breaks context switches and makes things crash), but it would be nice to
get rid of it. It would need a separate eip->EIP translation table for
the start of basic blocks, then a second process to find the offset
within a basic block. If we don't too it too often, then simply
scanning the TTE table would be a reasonable way of doing the mapping
without extra datastructures.
> Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip. Using the block-start %EIP we can look in the translation
> table (TT) to find info about this block: start address of its translation,
> and presumably whereabouts it's %eip->%EIP mapping table is.
>
> Knowning where the translation starts (%eip for the start of the
> translated block) and knowing the current %eip, we can figure out how
> far inside the translation we are. That offset can then be used to
> index (somehow) the mapping table for this block, to give the corresponding
> %EIP offset, which, when added to the block-start %EIP, gives the current
> %EIP.
>
> Makes sense?
Yes. The interesting question is how far to go. If we want the ideal
of never having generated code update EIP, then everywhere where we use
%EIP we'd either need to use %eip or convert %eip to %EIP.
Since we already have a VG_(get_EIP) function, making that the clearing
house for all eip->EIP conversions should be pretty easy.
The interesting question is representing eip and EIP in the ThreadState
table. Unless we want to do the mapping on every context switch, the
ThreadStates would need to hold eip and compute EIP on demand. Of
course, since the eips can be invalidated when the translation cache is
cleaned up, we'd need to map all the eips to EIPs. Then, when
scheduling a new thread, we need to start it running at %eip (if valid)
or else %EIP.
Then there's the tricky case of the dispatch loop working in terms of
EIP (as it has to, given its the first person to see code that's never
been seen before) and everything else working in eip. Have to be sure
to get all those transitions right.
Maybe we should at least keep EIP up to date at the basic block level,
so we only need to deal with the intra-BB case.
> So the only question is how to compactly encode the mapping table, and
> perhaps where to store them, but that's not a big deal.
I was thinking of a table with one entry per original instruction, which
contains the length of the original instruction and the length of the
corresponding generated code. It may be possible to pack this into 2
nibbles per instruction, with an escape mechanism to deal with rare
cases (ie, if length == 0xF then get the details from the following
byte). That could be built as the code is generated then tacked onto
the end of the generated code (the TTE would contain the generated code
length plus the length of the whole translation, including the
instruction mapping table). That way doing the mapping is a simple
linear scan:
cur_eip = tte->trans_addr;
cur_EIP = tte->orig_addr;
entry = 0;
while(eip > cur_eip) {
cur_eip += tab[entry].trans_len;
cur_EIP += tab[entry].orig_len;
entry++'
}
> Comments?
>
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
Dropping EIP updates is a big win. Bigger than chaining indirect jumps,
according to some initial measurements (though I haven't tested anything
which I'd expect to have lots of indirect jumps).
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 09:01:44
|
> Maybe we should at least keep EIP up to date at the basic block level,
> so we only need to deal with the intra-BB case.
As per recent msg to Nick, doing so at least makes it easy to find the
%eip (and hence %EIP) by scanning the stack for helper-function
return addresses. One PUTEIP per bb doesn't seem too onerous.
> I was thinking of a table with one entry per original instruction, which
> contains the length of the original instruction and the length of the
> corresponding generated code. It may be possible to pack this into 2
> nibbles per instruction, with an escape mechanism to deal with rare
> cases (ie, if length == 0xF then get the details from the following
> byte). That could be built as the code is generated then tacked onto
> the end of the generated code (the TTE would contain the generated code
> length plus the length of the whole translation, including the
> instruction mapping table). That way doing the mapping is a simple
> linear scan:
>
> cur_eip = tte->trans_addr;
> cur_EIP = tte->orig_addr;
> entry = 0;
> while(eip > cur_eip) {
> cur_eip += tab[entry].trans_len;
> cur_EIP += tab[entry].orig_len;
> entry++'
> }
I was thinking of the same thing. We could be simpler and just use a pair
of bytes (orig_insn_len, trans_insn_len) for each orig insn; this gives an
overhead of 2 bytes per orig insn, which is not too bad. (About 2 mbyte for
a run of Mozilla).
We can easily do a bit better with some kind of cleverer
encoding scheme as you mention, if that's needed [probably an array of
12-bit entities, 4 bits for orig len (they are all in the range 1 .. 17)
and 8 for translated length, with perhaps (N, 0xFF) indicating that the
entire next 12-bit group is a translated length, in the (appalling) case
that one orig insn generates more than 254 bytes of translation.
One comment is that it might be worth having in each TT entry a small number
of such slots, say 8. That way most basic blocks can store their table in
the TT entry. For the occasional large bb, TT will have to contain a ptr to
a table allocated in VG_(malloc)ville. This seems like a good idea because
there's a considerable space overhead using VG's malloc stuff -- about 6 words
per alloc -- so doing a large number of small malloc's is quite wasteful of
space.
Adding 8 slots to each TT covers most bbs, and would take 4 words
with the (byte,byte) encoding scheme and 3 words with the (4,8)
scheme.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 17:04:29
|
On Fri, 2002-11-22 at 01:08, Julian Seward wrote:
> One comment is that it might be worth having in each TT entry a small number
> of such slots, say 8. That way most basic blocks can store their table in
> the TT entry. For the occasional large bb, TT will have to contain a ptr to
> a table allocated in VG_(malloc)ville. This seems like a good idea because
> there's a considerable space overhead using VG's malloc stuff -- about 6 words
> per alloc -- so doing a large number of small malloc's is quite wasteful of
> space.
I was thinking about tacking the table onto the end of the generated
code, so it just lives in the translation cache. Unfortunately that
would break a nice property the current cache has: many chained jumps
are short, and their target is within the same cache line (because the
translation cache tends to lay basic blocks out in the order they were
executed). Still, I don't think that's a huge loss (definitely not
compared to removing INCEIP).
J
|