|
From: Julian S. <js...@ac...> - 2002-11-21 08:43:19
|
On Thursday 21 November 2002 1:34 am, Jeremy Fitzhardinge wrote:
> On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> > This is the third of two messages about t-chaining. Get a stiff
> > whisky before reading further; you'll need it.
> >
> > Using the attached prog, it's easy to show that on my P3, each
> > (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> > although verifying that. Considering that the P6 pipe is alleged
> > to be about 10 stages, I strongly suspect they both cause a pipe
> > flush. Ie, each is as bad as a mispredicted jump.
>
> Yes, the P3 optimisation guide says that pushf/popf are complex
> instructions (ie, microcoded). They don't stall the pipeline per-se,
> but they limit the decode rate and serialise stuff which needn't be
> serialised.
True. Nevertheless, if they are merely vector-decode insns and
don't flush the pipe, they are darned expensive. Perhaps there is
a constant set-up cost of a few cycles for a vector-decode insn?
Does the P3 opt guide say anything about that? I have the P4 one
to hand and read that, but it's pretty unhelpful for the P3 :)
> Given gems such as:
>
> pushl 32(%ebp)
> popfl
> subl $0x3E7, %eax
> pushfl
> popl 32(%ebp)
> pushl 32(%ebp)
> popfl
> jnle-8 %eip+13
This really is a lemon, insn't it. One good thing is that (a) lazy save/
restore of the flags will nuke 4 of the 6 {push,pop}fl ops, and (b)
this is a very common idiom, arising from the original
"do-alu-op-and-set-flags ; conditional jump".
> what was the problem with lazy save/restore of the flags again?
Only really my paranoia about getting it right. Clearly you haven't
spent enough hours chasing wierd bugs to do with state leakage from
the real machine to the simulated one :)
> Still, is this really enough to slow down the whole block? What about
> AGI stalls?
Well, that's a good question. Firstly, I don't know the rules on P3 for
AGI stalls, but looking at the main part of it, 8 copies of this:
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
an AGU is only needed by the leal (and the movl?), but at most 2 x every 3
insns, so I reckon there's enough slack there to run these groups at a CPI
of 1.0 or better.
Let's re-consider the numbers. The loop runs 25.0 million times in 1.56
seconds, which is 16.02 million/second. At 1133 MHz, that's 70.7 cycles
per iteration, give or take. And the loop is about 42 insns long.
Also, there are 6 pushf/popfs ("*fs") in the loop.
In the worst case, those *f would occupy 6 * 12 cycles (12 as from my
mini-experiment last night), = 72. Clearly this isn't possible since
the loop only takes 71 cycles and contains 36 other insns too.
However, even if we conservatively assume that each *f insn takes half
what I measured, ie 6 cycles, that leaves 35 cycles for the remaining
36 insns, which is a CPI of about 0.97, close to the CPI of the original
loop. That sounds more realistic. So I'd say it's not implausible to
argue that those 6 *f insns consume half the entire running time.
Proposal in the next message.
J
|