|
From: Julian S. <js...@ac...> - 2002-11-21 00:08:20
|
This is the third of two messages about t-chaining. Get a stiff
whisky before reading further; you'll need it.
Using the attached prog, it's easy to show that on my P3, each
(pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
although verifying that. Considering that the P6 pipe is alleged
to be about 10 stages, I strongly suspect they both cause a pipe
flush. Ie, each is as bad as a mispredicted jump.
#include <rude_words.h>
This is bad news. Or a big opportunity to do better.
I have to go to bed now, but from a mental back-of-the-envelope
calculation I'd guess this easily explains the cycle loss from
the inner loop of the previous message.
J
#include <stdio.h>
int main ( int argc, char** argv )
{
int i;
for (i = 0; i < 1000 * 1000 * 1000; i++) {
asm volatile ("pushfl ; popfl");
asm volatile ("pushfl ; popfl");
}
return 0;
}
|
|
From: Jeremy F. <je...@go...> - 2002-11-21 01:34:09
|
On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> This is the third of two messages about t-chaining. Get a stiff
> whisky before reading further; you'll need it.
>
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
Yes, the P3 optimisation guide says that pushf/popf are complex
instructions (ie, microcoded). They don't stall the pipeline per-se,
but they limit the decode rate and serialise stuff which needn't be
serialised.
Given gems such as:
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
what was the problem with lazy save/restore of the flags again?
Still, is this really enough to slow down the whole block? What about
AGI stalls?
J
|
|
From: Julian S. <js...@ac...> - 2002-11-21 08:43:19
|
On Thursday 21 November 2002 1:34 am, Jeremy Fitzhardinge wrote:
> On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> > This is the third of two messages about t-chaining. Get a stiff
> > whisky before reading further; you'll need it.
> >
> > Using the attached prog, it's easy to show that on my P3, each
> > (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> > although verifying that. Considering that the P6 pipe is alleged
> > to be about 10 stages, I strongly suspect they both cause a pipe
> > flush. Ie, each is as bad as a mispredicted jump.
>
> Yes, the P3 optimisation guide says that pushf/popf are complex
> instructions (ie, microcoded). They don't stall the pipeline per-se,
> but they limit the decode rate and serialise stuff which needn't be
> serialised.
True. Nevertheless, if they are merely vector-decode insns and
don't flush the pipe, they are darned expensive. Perhaps there is
a constant set-up cost of a few cycles for a vector-decode insn?
Does the P3 opt guide say anything about that? I have the P4 one
to hand and read that, but it's pretty unhelpful for the P3 :)
> Given gems such as:
>
> pushl 32(%ebp)
> popfl
> subl $0x3E7, %eax
> pushfl
> popl 32(%ebp)
> pushl 32(%ebp)
> popfl
> jnle-8 %eip+13
This really is a lemon, insn't it. One good thing is that (a) lazy save/
restore of the flags will nuke 4 of the 6 {push,pop}fl ops, and (b)
this is a very common idiom, arising from the original
"do-alu-op-and-set-flags ; conditional jump".
> what was the problem with lazy save/restore of the flags again?
Only really my paranoia about getting it right. Clearly you haven't
spent enough hours chasing wierd bugs to do with state leakage from
the real machine to the simulated one :)
> Still, is this really enough to slow down the whole block? What about
> AGI stalls?
Well, that's a good question. Firstly, I don't know the rules on P3 for
AGI stalls, but looking at the main part of it, 8 copies of this:
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
an AGU is only needed by the leal (and the movl?), but at most 2 x every 3
insns, so I reckon there's enough slack there to run these groups at a CPI
of 1.0 or better.
Let's re-consider the numbers. The loop runs 25.0 million times in 1.56
seconds, which is 16.02 million/second. At 1133 MHz, that's 70.7 cycles
per iteration, give or take. And the loop is about 42 insns long.
Also, there are 6 pushf/popfs ("*fs") in the loop.
In the worst case, those *f would occupy 6 * 12 cycles (12 as from my
mini-experiment last night), = 72. Clearly this isn't possible since
the loop only takes 71 cycles and contains 36 other insns too.
However, even if we conservatively assume that each *f insn takes half
what I measured, ie 6 cycles, that leaves 35 cycles for the remaining
36 insns, which is a CPI of about 0.97, close to the CPI of the original
loop. That sounds more realistic. So I'd say it's not implausible to
argue that those 6 *f insns consume half the entire running time.
Proposal in the next message.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 09:21:26
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
>
> #include <stdio.h>
>
> int main ( int argc, char** argv )
> {
> int i;
> for (i = 0; i < 1000 * 1000 * 1000; i++) {
> asm volatile ("pushfl ; popfl");
> asm volatile ("pushfl ; popfl");
> }
> return 0;
> }
Here's what Rabbit reports for my 1400 MHz Athlon:
Event Events
Events/sec
---------------------------------------- ---------------- ----------------
0x40 64 L1_data_cache_access 1675390681 365358668.02
0x41 65 L1_data_cache_miss 224993 49065.06
0x42 66 L1_data_cache_refill_from_L2 222561 48534.70
0x43 67 L1_data_cache_refill_from_syst 127562 27817.92
0x44 68 L1_data_cache_writeback 227126 49681.16
0x45 69 L1_DTLB_miss_and_L2_DTLB_hit 101621 22228.41
0x46 70 L1_and_L2_DTLB_miss 70174 15349.74
0x47 71 misaligned_data_references 15663 3426.10
0x80 128 L1_instr_cache_fetch 341826476 74679679.25
0x81 129 L1_instr_cache_miss 312554 68284.45
0x84 132 L1_ITLB_miss_and_L2_ITLB_hit 194970 42595.58
0x85 133 L1_and_L2_ITLB_miss 34231 7478.53
0xc0 192 retired_instructions 1514793469 330718191.35
0xc1 193 retired_ops 10899888249 2379724630.07
0xc6 198 retired_far_control_transfers 6116 1335.28
0xc7 199 retired_resync_branches 3630 792.52
0xc2 194 retired_branches 335619550 73423023.82
0xc3 195 retired_branches_mispredicted 743534 162661.90
0xc4 196 retired_taken_branches 334559598 73191139.59
0xc5 197 retired_taken_branches_mispred 689051 150742.73
0xcd 205 interrupts_masked_cycles 12881304 2818557.60
0xce 206 interrupts_masked_while_pendin 1376 301.08
0xcf 207 taken_hardware_interrupts 520 113.78
0xcf 207 taken_hardware_interrupts 520 113.78
0x00 0 Mcycles 6419847560 1404726582.63
resource usage:
time = 27.43 sec user, 0.01 sec sys, 27.46 sec real, 99.91% of cpu
page reclaims, faults = 12, 76
Points to note:
- (instrs : ops) ratio = 1 : 7.2
- CPI = 4.24
- CPop = 0.59
The Athlon optimisation docs say that pushf/popf are VectorPath
instructions with a best-case execution latency of 4 cycles.
(DirectPath instructions are decoded into 1 or 2 "MacroOPs". VectorPath
instructions are decoded into 3+ MacroOPs using some on-chip ROM.
Decoding a VectorPath instruction can block decoding of a DirectPath
instruction.)
Really basic ops (eg. add, inc on registers) are DirectPath and take >= 1
cycle.
Interestingly, all the normal register push/pop instructions are
also VectorPath and take >= 4 cycles. But when I replaced the inner loop
in the above program with
asm volatile ("pushl %esi ; popl %edi");
asm volatile ("pushl %edi ; popl %esi");
it executes in 5.28 seconds with CPI=0.82 and CPop=0.6
- (instrs : ops) ratio = 1 : 1.2
- CPI = 0.82
- CPop = 0.67
so the implementation of pushf/popf clearly sucks.
N
|