|
From: Julian S. <js...@ac...> - 2002-11-20 23:45:05
|
This is the second of two messages about t-chaining. Get a coffee
before reading further; you'll need it.
I found it pretty surprising that bzip2 runs at a 10x slowdown
on nulgrind, even with t-chaining engaged. I know that program
pretty well :) and it's completely compute-bound, no I/O, no malloc,
no weird stuff.
So I've just started messing with the attached program, designed
to compute, have an easily-located inner loop :) and to minimise
absolutely all other effects.
On my PIII-1.13 GHz box, compiled -O -g, it takes 0.250 s natively
and 1.56 seconds on nulgrind, with bb chaining on and INCEIPs
not generated (i.e., our max-performance settings so far).
It runs for 25.2 million bbs, btw.
[Attached below are: the C code, the original, Ucode, and generated
code for the inner loop, and the same with intervening junk removed
for easier reading].
That's a lossage factor of 6.2:1. This _seems_ quite good, until
we start looking at the translation of the inner loop. The original
is 12 (action-packed) x86 insns. The expected-case through the
translation is 44 x86 insns + ud2; ud2; nop, possibly about 42
in total once the patching has been done.
Now I'm at a bit of a loss. Although insn counts are notoriously
crappy for predicting execution times, it seems a little wierd that
increasing the number of insns by a factor of 3.5 increases the
cycle count by 6.2 times. The program should run, both natively
and on V, with zero misses in any caches. And, once chained, this
bb is jumping back to itself directly, so nothing else is going on.
So where the @*%£&?! are my missing cycles?
Particularly in view of the fact that the generated code is very
much in x86 RISC subset, and so should generate fewer uops/insn
than the original. For example, it contains several "movl (%esi), %esi".
Let's recap.
- There are no cache misses.
- There are no unpredictable branches.
- Each insn [in the resulting translation] needs either one
Address-Generation-Unit (leal), one Load-Store-Unit (movl) or one
ALU (add, sub, xor). So the bulk of it is unlikely to be stalling
due to a shortage of functional units.
- Perhaps stalls due to result forwarding? But the problem with
that is that (translation)
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and original
addl 0xC(%reg1,%reg2),%reg3
I'd guess require the same functional units chained in the same order
(AGU -> LSU -> ALU) ... who knows?
- Decode limitations? PIII (P6 core) has this wierd 4-1-1 rule, I don't
know if that's a problem. Surely, there are 8 groups of
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and each would fit in a 4-1-1 uop pattern?
- So perhaps there are some rare insns which completely serialise the
machine? We've got two pushfls and two popfls (partially redundant, too)
and they are pretty rare, hence might get executed out of ucode ROM.
I'm mystified. Someone with a cycle-accurate PIII model needs to run this
and see where it stalls.
Nick, what sort of time ratios do you get on your Athlonic device for
this?
[useless numbers: 25.0 million trips thru this loop.
Natively, this is 12 orig insns/iteration, = 300 million
insns. In 0.250 seconds, this is 1200 MIPS; at 1133 MHz,
that's a CPI of 0.94, which is pretty much Not Bad.
Translated, this is approx 42 insns/iteration, = 1050
million insns. In 1.56 seconds, this is 673 MIPS; at 1133 MHz,
that's a CPI of 1.68. This is significantly worse for no
apparent reason.
]
I reckon there's a large heavy object lurking in the undergrowth
here. Now we just need to find it. Anyone fancy lifting out
the translated loop, giving it a suitable context and running it
on VTune, to find out? Is oprofile able to tell us exactly
which insn is stalling?
Congratulations to anyone who bothered to read this far.
J
---------------------------------------------------------------------------
Original C code:
#include <stdio.h>
int i_fit_comfortably_in_D1[1000];
int timewaster ( void )
{
int i, j, sum;
for (i = 0; i < 1000; i++) {
i_fit_comfortably_in_D1[i] = i * 654321;
}
sum = 0;
for (j = 0; j < 200000; j++) {
for (i = 0; i < 1000; i += 8) {
sum += i_fit_comfortably_in_D1[i];
sum ^= i_fit_comfortably_in_D1[i+1];
sum -= i_fit_comfortably_in_D1[i+2];
sum += i_fit_comfortably_in_D1[i+3];
sum += i_fit_comfortably_in_D1[i+4];
sum ^= i_fit_comfortably_in_D1[i+5];
sum -= i_fit_comfortably_in_D1[i+6];
sum += i_fit_comfortably_in_D1[i+7];
}
}
return sum;
}
int main ( void )
{
printf ("the answer, you will be fascinated to know, is ... ");
fflush(stdout);
printf("0x%x\n", timewaster() );
return 0;
}
---------------------------------------------------------------------------
Original x86 code for inner loop:
0x8048508: leal 0x0(,%ecx,4), %eax
0x804850F: addl (%eax,%ebx,1),%edx
0x8048512: xorl 4(%ebx,%eax,1),%edx
0x8048516: subl 8(%ebx,%eax,1),%edx
0x804851A: addl 12(%ebx,%eax,1),%edx
0x804851E: addl 16(%ebx,%eax,1),%edx
0x8048522: xorl 20(%ebx,%eax,1),%edx
0x8048526: subl 24(%ebx,%eax,1),%edx
0x804852A: addl 28(%ebx,%eax,1),%edx
0x804852E: addl $0x8, %ecx
0x8048531: cmpl $0x3E7, %ecx
0x8048537: jle-8 0x8048508
---------------------------------------------------------------------------
Cleaned-up ucode for inner loop:
0: GETL %ECX, t2
1: MOVL $0x0, t0
2: LEA2L 0(t0,t2,4), t0
3: PUTL t0, %EAX
4: GETL %EBX, t6
6: LEA2L 0(t0,t6,1), t4
7: LDL (t4), t4
8: ADDL %EDX, t4
12: LEA2L 4(t6,t0,1), t10
13: LDL (t10), t10
14: XORL t4, t10
18: LEA2L 8(t6,t0,1), t16
19: LDL (t16), t16
21: SUBL t16, t10
25: LEA2L 12(t6,t0,1), t24
26: LDL (t24), t24
27: ADDL t10, t24
31: LEA2L 16(t6,t0,1), t30
32: LDL (t30), t30
33: ADDL t24, t30
37: LEA2L 20(t6,t0,1), t36
38: LDL (t36), t36
39: XORL t30, t36
43: LEA2L 24(t6,t0,1), t42
44: LDL (t42), t42
46: SUBL t42, t36
50: LEA2L 28(t6,t0,1), t50
51: LDL (t50), t50
52: ADDL t36, t50
53: PUTL t50, %EDX
55: ADDL $0x8, t2
56: PUTL t2, %ECX
58: SUBL $0x3E7, t2 (-wOSZACP)
59: Jleo $0x8048508 (-rOSZACP)
60: JMPo $0x8048539 ($2)
---------------------------------------------------------------------------
Generated x86 code for inner loop:
0: FF 0D 0C FD 0D 40
decl (0x400DFD0C)
6: 75 06
jnz-8 %eip+6
8: BD 1D 00 00 00
movl $0x1D, %ebp
13: C3
ret
0: GETL %ECX, %eax [a-----]
14: 8B 45 04
movl 0x4(%ebp), %eax
1: MOVL $0x0, %ebx [ab----]
17: 31 DB
xorl %ebx, %ebx
2: LEA2L 0(%ebx,%eax,4), %ebx [ab----]
19: 8D 5C 83 00
leal 0x0(%ebx,%eax,4), %ebx
3: PUTL %ebx, %EAX [ab----]
23: 89 5D 00
movl %ebx, 0x0(%ebp)
4: GETL %EBX, %ecx [abc---]
26: 8B 4D 0C
movl 0xC(%ebp), %ecx
5: LEA2L 0(%ebx,%ecx,1), %edx [abcd--]
29: 8D 54 0B 00
leal 0x0(%ebx,%ecx,1), %edx
6: LDL (%edx), %edx [abcd--]
33: 8B 12
movl (%edx), %edx
7: ADDL %EDX, %edx [abcd--]
35: 03 55 08
addl 0x8(%ebp), %edx
8: LEA2L 4(%ecx,%ebx,1), %esi [abcdS-]
38: 8D 74 19 04
leal 0x4(%ecx,%ebx,1), %esi
9: LDL (%esi), %esi [abcdS-]
42: 8B 36
movl (%esi), %esi
10: XORL %edx, %esi [abc-S-]
44: 31 D6
xorl %edx, %esi
11: LEA2L 8(%ecx,%ebx,1), %edi [abc-SD]
46: 8D 7C 19 08
leal 0x8(%ecx,%ebx,1), %edi
12: LDL (%edi), %edi [abc-SD]
50: 8B 3F
movl (%edi), %edi
13: SUBL %edi, %esi [abc-S-]
52: 29 FE
subl %edi, %esi
14: LEA2L 12(%ecx,%ebx,1), %edx [abcdS-]
54: 8D 54 19 0C
leal 0xC(%ecx,%ebx,1), %edx
15: LDL (%edx), %edx [abcdS-]
58: 8B 12
movl (%edx), %edx
16: ADDL %esi, %edx [abcd--]
60: 01 F2
addl %esi, %edx
17: LEA2L 16(%ecx,%ebx,1), %esi [abcdS-]
62: 8D 74 19 10
leal 0x10(%ecx,%ebx,1), %esi
18: LDL (%esi), %esi [abcdS-]
66: 8B 36
movl (%esi), %esi
19: ADDL %edx, %esi [abc-S-]
68: 01 D6
addl %edx, %esi
20: LEA2L 20(%ecx,%ebx,1), %edx [abcdS-]
70: 8D 54 19 14
leal 0x14(%ecx,%ebx,1), %edx
21: LDL (%edx), %edx [abcdS-]
74: 8B 12
movl (%edx), %edx
22: XORL %esi, %edx [abcd--]
76: 31 F2
xorl %esi, %edx
23: LEA2L 24(%ecx,%ebx,1), %esi [abcdS-]
78: 8D 74 19 18
leal 0x18(%ecx,%ebx,1), %esi
24: LDL (%esi), %esi [abcdS-]
82: 8B 36
movl (%esi), %esi
25: SUBL %esi, %edx [abcd--]
84: 29 F2
subl %esi, %edx
26: LEA2L 28(%ecx,%ebx,1), %esi [a--dS-]
86: 8D 74 19 1C
leal 0x1C(%ecx,%ebx,1), %esi
27: LDL (%esi), %esi [a--dS-]
90: 8B 36
movl (%esi), %esi
28: ADDL %edx, %esi [a---S-]
92: 01 D6
addl %edx, %esi
29: PUTL %esi, %EDX [a-----]
94: 89 75 08
movl %esi, 0x8(%ebp)
30: ADDL $0x8, %eax [a-----]
97: 83 C0 08
addl $0x8, %eax
31: PUTL %eax, %ECX [a-----]
100: 89 45 04
movl %eax, 0x4(%ebp)
32: SUBL $0x3E7, %eax (-wOSZACP) [------]
103: FF 75 20 9D
pushl 32(%ebp) ; popfl
107: 81 E8 E7 03 00 00
subl $0x3E7, %eax
113: 9C 8F 45 20
pushfl ; popl 32(%ebp)
33: Jleo $0x8048508 (-rOSZACP) [------]
117: FF 75 20 9D
pushl 32(%ebp) ; popfl
121: 7F 0D
jnle-8 %eip+13
123: B8 08 85 04 08
movl $0x8048508, %eax
128: 89 45 24
movl %eax, 0x24(%ebp)
131: 0F 0B 0F 0B 90
ud2; ud2; nop
34: JMPo $0x8048539 ($2) [------]
136: B8 39 85 04 08
movl $0x8048539, %eax
141: 89 45 24
movl %eax, 0x24(%ebp)
144: 0F 0B 0F 0B 90
ud2; ud2; nop
---------------------------------------------------------------------------
Generated x86 code for inner loop, with crud removed, and bb start points
distinguished by a blank line.
decl (0x400DFD0C)
jnz-8 %eip+6
movl $0x1D, %ebp
ret
movl 0x4(%ebp), %eax
xorl %ebx, %ebx
leal 0x0(%ebx,%eax,4), %ebx
movl %ebx, 0x0(%ebp)
movl 0xC(%ebp), %ecx
leal 0x0(%ebx,%ecx,1), %edx
movl (%edx), %edx
addl 0x8(%ebp), %edx
leal 0x4(%ecx,%ebx,1), %esi
movl (%esi), %esi
xorl %edx, %esi
leal 0x8(%ecx,%ebx,1), %edi
movl (%edi), %edi
subl %edi, %esi
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
leal 0x14(%ecx,%ebx,1), %edx
movl (%edx), %edx
xorl %esi, %edx
leal 0x18(%ecx,%ebx,1), %esi
movl (%esi), %esi
subl %esi, %edx
leal 0x1C(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
movl %esi, 0x8(%ebp)
addl $0x8, %eax
movl %eax, 0x4(%ebp)
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
movl $0x8048508, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
movl $0x8048539, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
|