|
From: Nicholas N. <nj...@ca...> - 2002-11-21 09:21:26
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
>
> #include <stdio.h>
>
> int main ( int argc, char** argv )
> {
> int i;
> for (i = 0; i < 1000 * 1000 * 1000; i++) {
> asm volatile ("pushfl ; popfl");
> asm volatile ("pushfl ; popfl");
> }
> return 0;
> }
Here's what Rabbit reports for my 1400 MHz Athlon:
Event Events
Events/sec
---------------------------------------- ---------------- ----------------
0x40 64 L1_data_cache_access 1675390681 365358668.02
0x41 65 L1_data_cache_miss 224993 49065.06
0x42 66 L1_data_cache_refill_from_L2 222561 48534.70
0x43 67 L1_data_cache_refill_from_syst 127562 27817.92
0x44 68 L1_data_cache_writeback 227126 49681.16
0x45 69 L1_DTLB_miss_and_L2_DTLB_hit 101621 22228.41
0x46 70 L1_and_L2_DTLB_miss 70174 15349.74
0x47 71 misaligned_data_references 15663 3426.10
0x80 128 L1_instr_cache_fetch 341826476 74679679.25
0x81 129 L1_instr_cache_miss 312554 68284.45
0x84 132 L1_ITLB_miss_and_L2_ITLB_hit 194970 42595.58
0x85 133 L1_and_L2_ITLB_miss 34231 7478.53
0xc0 192 retired_instructions 1514793469 330718191.35
0xc1 193 retired_ops 10899888249 2379724630.07
0xc6 198 retired_far_control_transfers 6116 1335.28
0xc7 199 retired_resync_branches 3630 792.52
0xc2 194 retired_branches 335619550 73423023.82
0xc3 195 retired_branches_mispredicted 743534 162661.90
0xc4 196 retired_taken_branches 334559598 73191139.59
0xc5 197 retired_taken_branches_mispred 689051 150742.73
0xcd 205 interrupts_masked_cycles 12881304 2818557.60
0xce 206 interrupts_masked_while_pendin 1376 301.08
0xcf 207 taken_hardware_interrupts 520 113.78
0xcf 207 taken_hardware_interrupts 520 113.78
0x00 0 Mcycles 6419847560 1404726582.63
resource usage:
time = 27.43 sec user, 0.01 sec sys, 27.46 sec real, 99.91% of cpu
page reclaims, faults = 12, 76
Points to note:
- (instrs : ops) ratio = 1 : 7.2
- CPI = 4.24
- CPop = 0.59
The Athlon optimisation docs say that pushf/popf are VectorPath
instructions with a best-case execution latency of 4 cycles.
(DirectPath instructions are decoded into 1 or 2 "MacroOPs". VectorPath
instructions are decoded into 3+ MacroOPs using some on-chip ROM.
Decoding a VectorPath instruction can block decoding of a DirectPath
instruction.)
Really basic ops (eg. add, inc on registers) are DirectPath and take >= 1
cycle.
Interestingly, all the normal register push/pop instructions are
also VectorPath and take >= 4 cycles. But when I replaced the inner loop
in the above program with
asm volatile ("pushl %esi ; popl %edi");
asm volatile ("pushl %edi ; popl %esi");
it executes in 5.28 seconds with CPI=0.82 and CPop=0.6
- (instrs : ops) ratio = 1 : 1.2
- CPI = 0.82
- CPop = 0.67
so the implementation of pushf/popf clearly sucks.
N
|