You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Julian S. <js...@ac...> - 2002-11-21 23:26:07
|
> > > A questions: What's the deal on the LRU mechanism when > > > chaining is going, now? > > > > I did notice the problem and made a small attempt to address it: > > patch_me updates the epoch field in the same way as the dispatch loop. > > It doesn't actually help much, but it looks like I paid attention to it. > > > > I was thinking of a scheme like an OS VM scheme: have a loop go through > > the TTEs at some low rate and unchaining blocks; if on the 2nd pass the > > block is still unchained and still hasn't updated the LRU entry, then > > consider it old. Ha, that sounds like it might work well enough. I sometimes wondered how OSs implement the pseudo-LRU stuff for VMs. Is there a standard description of this, that you know of? > It might be worth re-mentioning the approach used by Dynamo, Walkabout, > Strata and seemingly all the other dynamic translators for handling this > situation -- they just flush the code cache. > > But they have smaller translations and smaller code caches, and spend less > time on translations. And I think Julian's LRU mechanism is extremely > fast. So maybe just ignore this. Yes, that is the simplest thing, but I fear it would be very expensive. A less radical similar scheme is to randomly discard some fraction of the translations. Then you get into schemes where you randomly move some fraction of the translations into a "victimisation cache" (in hardware speak). If they turn out to be needed again soon, they can be moved back from the cache back into the main set of active translations. If the V-cache gets full _it_ then is flushed. This scheme has the effect of somewhat mitigating the situation where a translation is randomly selected for demotion and then used again soon afterwards. But it's all a bit complex. J |
|
From: Jeremy F. <je...@go...> - 2002-11-21 17:42:28
|
On Thu, 2002-11-21 at 01:08, Julian Seward wrote:
> It's clear we need to do something to fix this. This is a proposal
> with two variants; a less ambitious variant (first) and a more ambitious
> one (second).
>
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> which tell you whether or not a uinstr (more correctly, the code
> generated for that uinstr) **might** (1) alter the real machine's
> %eflags, (2) alter the real machine's FPU state, and (3) require
> to see the %EIP of the simulated machine.
>
> Getting a safe-if-conservative result for (1) and (2) is critical
> for translation correctness. For (3) it just effects the accuracy
> of stack traces. So the safe thing to return is True for all functions,
> yet we want to return False as often as we safely can. Classical
> compiler-analysis conservative-estimate stuff.
>
> Further, for all skins which extend ucode, we add suitable functions
> to the core/skin iface so that the same questions can be asked of the
> extended ucodes.
>
> With fn (3), the redundant INCEIP removal phase can be reinstated
> for all skins, and should work correctly. The reason it's commented
> out is precisely because at present it can't reliably ask this question.
>
> Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU
> state save/restore, which I think is rolled into the code generation
> loop. I would like to be able to ship this optimisation with confidence
> that I understand the ramifications.
>
> Fn (3) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
>
> None of this would be hard to implement, and they would support a useful
> bunch of optimisations.
>
>
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create two new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
>
> We do the lazy FPU and EFLAGS save/restore optimisation exactly as
> above.
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free.
Only when using the dispatch loop. When using t-chaining, I currently
manually update EIP to the target EIP just before each chained jump. I
haven't measured how much this costs (because removing it completely
breaks context switches and makes things crash), but it would be nice to
get rid of it. It would need a separate eip->EIP translation table for
the start of basic blocks, then a second process to find the offset
within a basic block. If we don't too it too often, then simply
scanning the TTE table would be a reasonable way of doing the mapping
without extra datastructures.
> Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip. Using the block-start %EIP we can look in the translation
> table (TT) to find info about this block: start address of its translation,
> and presumably whereabouts it's %eip->%EIP mapping table is.
>
> Knowning where the translation starts (%eip for the start of the
> translated block) and knowing the current %eip, we can figure out how
> far inside the translation we are. That offset can then be used to
> index (somehow) the mapping table for this block, to give the corresponding
> %EIP offset, which, when added to the block-start %EIP, gives the current
> %EIP.
>
> Makes sense?
Yes. The interesting question is how far to go. If we want the ideal
of never having generated code update EIP, then everywhere where we use
%EIP we'd either need to use %eip or convert %eip to %EIP.
Since we already have a VG_(get_EIP) function, making that the clearing
house for all eip->EIP conversions should be pretty easy.
The interesting question is representing eip and EIP in the ThreadState
table. Unless we want to do the mapping on every context switch, the
ThreadStates would need to hold eip and compute EIP on demand. Of
course, since the eips can be invalidated when the translation cache is
cleaned up, we'd need to map all the eips to EIPs. Then, when
scheduling a new thread, we need to start it running at %eip (if valid)
or else %EIP.
Then there's the tricky case of the dispatch loop working in terms of
EIP (as it has to, given its the first person to see code that's never
been seen before) and everything else working in eip. Have to be sure
to get all those transitions right.
Maybe we should at least keep EIP up to date at the basic block level,
so we only need to deal with the intra-BB case.
> So the only question is how to compactly encode the mapping table, and
> perhaps where to store them, but that's not a big deal.
I was thinking of a table with one entry per original instruction, which
contains the length of the original instruction and the length of the
corresponding generated code. It may be possible to pack this into 2
nibbles per instruction, with an escape mechanism to deal with rare
cases (ie, if length == 0xF then get the details from the following
byte). That could be built as the code is generated then tacked onto
the end of the generated code (the TTE would contain the generated code
length plus the length of the whole translation, including the
instruction mapping table). That way doing the mapping is a simple
linear scan:
cur_eip = tte->trans_addr;
cur_EIP = tte->orig_addr;
entry = 0;
while(eip > cur_eip) {
cur_eip += tab[entry].trans_len;
cur_EIP += tab[entry].orig_len;
entry++'
}
> Comments?
>
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
Dropping EIP updates is a big win. Bigger than chaining indirect jumps,
according to some initial measurements (though I haven't tested anything
which I'd expect to have lots of indirect jumps).
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 15:43:27
|
Hi,
I started having a go at the INCEIP table lookup thing, but I got stuck...
On Thu, 21 Nov 2002, Julian Seward wrote:
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free. Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip.
How do we know what the current %eip is?
Consider an example. When a 'value' error occurs in the middle of a
BB under MemCheck:
gen code calls the asm routine MC_(helper_value_check4_fail)
which calls the C function MC_(helperc_value_check4_fail)()
which calls MC_(record_value_error)()
which eventually calls VG_(maybe_record_error)()
which eventually calls VG_(construct_error)()
which (as tst==NULL) records VG_(baseBlock)[VGOFF_(m_eip)]
If we remove the INCEIPs, VG_(baseBlock)[VGOFF_(m_eip)] will be out of
date -- it will point to the start of the original BB rather than the
exact problem instruction.
So how do we find out %eip in order to look up the snazzy new eip-->EIP
table? Well, it's buried deep down in the stack as the return value
(minus one instruction) for MC_(helper_value_check4_fail). But that's not
a good way to find it out.
Alternatively, during translation, the original %eip is known for each
original instruction and so it could be passed as an arg to
MC_(helper_value_check4_fail). But that would bloat code for TESTV (would
need to push the arg and then clear it after the call) and it doesn't feel
nice. But maybe it has to be done this way?
Or am I missing something obvious here?
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 10:26:01
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> Fn (1) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
Can I just confirm my understanding of the lazy %EFLAG optimisation?
Currently, AIUI, a lot of basic blocks (pretty much all those that end in
a conditional jump) end like this:
37: ANDL %eax, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %eax, %ebx
pushfl ; popl 32(%ebp) ***
38: INCEIPo $2
addl $2, 36(%ebp)
39: Jzo $0x402047AC (-rOSZACP)
pushl 32(%ebp) ; popfl ***
jnz-8 %eip+6
movl $0x402047AC, %eax
ret
If-and-only-if the INCEIP addl is removed, then the 4 instructions in the
two lines marked *** can be trivially removed.
Looking through --skin=none code, the only place where this seems
applicable is for conditional jumps -- arithmetic ops in the middle of
basic blocks don't look like they can have their %EFLAGs fiddling avoided.
One downer: I'm not sure if this will work for MemCheck. Consider the
instrumented code in the same situation:
30: ANDL %edx, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %edx, %ebx
pushfl ; popl 32(%ebp)
31: INCEIPo $2
addl $2, 104(%ebp)
32: GETVFo %eax
movl 0x44(%ebp), %eax
33: TESTVo %eax
testb $0x1, %al
jz-8 %eip+3
call * 76(%ebp)
34: Jzo $0x4000D2C4 (-rOSZACP)
pushl 32(%ebp) ; popfl
jnz-8 %eip+6
movl $0x4000D2C4, %eax
ret
Ignoring the INCEIP, the arithmetic op and the conditional jump are no
longer adjacent.
Here's the equivalent code for Cachegrind:
9: ANDL %edi, %edx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %edi, %edx
pushfl ; popl 32(%ebp)
10: MOVL $0x40269DC0, %eax
movl $0x40269DC0, %eax
11: CCALLo 0x4001C170(%eax)
call * 36(%ebp)
12: INCEIPo $2
addl $2, 60(%ebp)
13: MOVL $0x40269DE0, %eax
movl $0x40269DE0, %eax
14: CCALLo 0x4001C170(%eax)
call * 36(%ebp)
15: Jnzo $0x4000D190 (-rOSZACP)
pushl 32(%ebp) ; popfl
jz-8 %eip+6
movl $0x4000D190, %eax
ret
Again the optimisation is foiled.
It would work for AddrCheck and Helgrind though.
It seems that we would be getting into a situation where the
instrumentation you add could affect what optimisations can occur; in one
way this is reasonable but in another way it kind of sucks.
Or have I misunderstood something?
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
> [...]
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
I support this. The complexity will be localised pretty well, which is
good.
BTW, Am I right in thinking that the INCEIP instruction will stay in
UCode, but that it will just be removed during the codegen phase?
Because skins (eg. Cachegrind) use it as a way of determining the
boundaries of the original x86 instructions.
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 10:03:28
|
On 20 Nov 2002, Jeremy Fitzhardinge wrote: > > A questions: What's the deal on the LRU mechanism when > > chaining is going, now? > > I did notice the problem and made a small attempt to address it: > patch_me updates the epoch field in the same way as the dispatch loop. > It doesn't actually help much, but it looks like I paid attention to it. > > I was thinking of a scheme like an OS VM scheme: have a loop go through > the TTEs at some low rate and unchaining blocks; if on the 2nd pass the > block is still unchained and still hasn't updated the LRU entry, then > consider it old. It might be worth re-mentioning the approach used by Dynamo, Walkabout, Strata and seemingly all the other dynamic translators for handling this situation -- they just flush the code cache. But they have smaller translations and smaller code caches, and spend less time on translations. And I think Julian's LRU mechanism is extremely fast. So maybe just ignore this. N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 09:21:26
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
>
> #include <stdio.h>
>
> int main ( int argc, char** argv )
> {
> int i;
> for (i = 0; i < 1000 * 1000 * 1000; i++) {
> asm volatile ("pushfl ; popfl");
> asm volatile ("pushfl ; popfl");
> }
> return 0;
> }
Here's what Rabbit reports for my 1400 MHz Athlon:
Event Events
Events/sec
---------------------------------------- ---------------- ----------------
0x40 64 L1_data_cache_access 1675390681 365358668.02
0x41 65 L1_data_cache_miss 224993 49065.06
0x42 66 L1_data_cache_refill_from_L2 222561 48534.70
0x43 67 L1_data_cache_refill_from_syst 127562 27817.92
0x44 68 L1_data_cache_writeback 227126 49681.16
0x45 69 L1_DTLB_miss_and_L2_DTLB_hit 101621 22228.41
0x46 70 L1_and_L2_DTLB_miss 70174 15349.74
0x47 71 misaligned_data_references 15663 3426.10
0x80 128 L1_instr_cache_fetch 341826476 74679679.25
0x81 129 L1_instr_cache_miss 312554 68284.45
0x84 132 L1_ITLB_miss_and_L2_ITLB_hit 194970 42595.58
0x85 133 L1_and_L2_ITLB_miss 34231 7478.53
0xc0 192 retired_instructions 1514793469 330718191.35
0xc1 193 retired_ops 10899888249 2379724630.07
0xc6 198 retired_far_control_transfers 6116 1335.28
0xc7 199 retired_resync_branches 3630 792.52
0xc2 194 retired_branches 335619550 73423023.82
0xc3 195 retired_branches_mispredicted 743534 162661.90
0xc4 196 retired_taken_branches 334559598 73191139.59
0xc5 197 retired_taken_branches_mispred 689051 150742.73
0xcd 205 interrupts_masked_cycles 12881304 2818557.60
0xce 206 interrupts_masked_while_pendin 1376 301.08
0xcf 207 taken_hardware_interrupts 520 113.78
0xcf 207 taken_hardware_interrupts 520 113.78
0x00 0 Mcycles 6419847560 1404726582.63
resource usage:
time = 27.43 sec user, 0.01 sec sys, 27.46 sec real, 99.91% of cpu
page reclaims, faults = 12, 76
Points to note:
- (instrs : ops) ratio = 1 : 7.2
- CPI = 4.24
- CPop = 0.59
The Athlon optimisation docs say that pushf/popf are VectorPath
instructions with a best-case execution latency of 4 cycles.
(DirectPath instructions are decoded into 1 or 2 "MacroOPs". VectorPath
instructions are decoded into 3+ MacroOPs using some on-chip ROM.
Decoding a VectorPath instruction can block decoding of a DirectPath
instruction.)
Really basic ops (eg. add, inc on registers) are DirectPath and take >= 1
cycle.
Interestingly, all the normal register push/pop instructions are
also VectorPath and take >= 4 cycles. But when I replaced the inner loop
in the above program with
asm volatile ("pushl %esi ; popl %edi");
asm volatile ("pushl %edi ; popl %esi");
it executes in 5.28 seconds with CPI=0.82 and CPop=0.6
- (instrs : ops) ratio = 1 : 1.2
- CPI = 0.82
- CPop = 0.67
so the implementation of pushf/popf clearly sucks.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-21 09:01:36
|
It's clear we need to do something to fix this. This is a proposal with two variants; a less ambitious variant (first) and a more ambitious one (second). Variant 1 (less ambitious) ~~~~~~~~~~~~~~~~~~~~~~~~~~ We create three new functions, all of them predicates on UInstrs: 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* ); 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* ); 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* ); which tell you whether or not a uinstr (more correctly, the code generated for that uinstr) **might** (1) alter the real machine's %eflags, (2) alter the real machine's FPU state, and (3) require to see the %EIP of the simulated machine. Getting a safe-if-conservative result for (1) and (2) is critical for translation correctness. For (3) it just effects the accuracy of stack traces. So the safe thing to return is True for all functions, yet we want to return False as often as we safely can. Classical compiler-analysis conservative-estimate stuff. Further, for all skins which extend ucode, we add suitable functions to the core/skin iface so that the same questions can be asked of the extended ucodes. With fn (3), the redundant INCEIP removal phase can be reinstated for all skins, and should work correctly. The reason it's commented out is precisely because at present it can't reliably ask this question. Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU state save/restore, which I think is rolled into the code generation loop. I would like to be able to ship this optimisation with confidence that I understand the ramifications. Fn (3) would dually be used to support lazy %EFLAG save restore in the same manner as the FPU. None of this would be hard to implement, and they would support a useful bunch of optimisations. Variant 2 (more ambitious) ~~~~~~~~~~~~~~~~~~~~~~~~~~ We create two new functions, all of them predicates on UInstrs: 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* ); 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* ); We do the lazy FPU and EFLAGS save/restore optimisation exactly as above. Difference here is we do table-based EIP reconstruction as needed and completely nuke INCEIP (yay!) So, precisely how to reconstruct %EIP from %eip? Like this. Firstly %EIP must be made up-to-date at the start of each bb, but it is anyway, so that's free. Now suppose we are in the middle of a bb and want to know the current %EIP. We actually know the block-start %EIP and the current %eip. Using the block-start %EIP we can look in the translation table (TT) to find info about this block: start address of its translation, and presumably whereabouts it's %eip->%EIP mapping table is. Knowning where the translation starts (%eip for the start of the translated block) and knowing the current %eip, we can figure out how far inside the translation we are. That offset can then be used to index (somehow) the mapping table for this block, to give the corresponding %EIP offset, which, when added to the block-start %EIP, gives the current %EIP. Makes sense? So the only question is how to compactly encode the mapping table, and perhaps where to store them, but that's not a big deal. Comments? If we could do variant 2 rather than variant 1 that would be cool, considering it would give a bigger speedup overall. J |
|
From: Julian S. <js...@ac...> - 2002-11-21 08:43:19
|
On Thursday 21 November 2002 1:34 am, Jeremy Fitzhardinge wrote:
> On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> > This is the third of two messages about t-chaining. Get a stiff
> > whisky before reading further; you'll need it.
> >
> > Using the attached prog, it's easy to show that on my P3, each
> > (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> > although verifying that. Considering that the P6 pipe is alleged
> > to be about 10 stages, I strongly suspect they both cause a pipe
> > flush. Ie, each is as bad as a mispredicted jump.
>
> Yes, the P3 optimisation guide says that pushf/popf are complex
> instructions (ie, microcoded). They don't stall the pipeline per-se,
> but they limit the decode rate and serialise stuff which needn't be
> serialised.
True. Nevertheless, if they are merely vector-decode insns and
don't flush the pipe, they are darned expensive. Perhaps there is
a constant set-up cost of a few cycles for a vector-decode insn?
Does the P3 opt guide say anything about that? I have the P4 one
to hand and read that, but it's pretty unhelpful for the P3 :)
> Given gems such as:
>
> pushl 32(%ebp)
> popfl
> subl $0x3E7, %eax
> pushfl
> popl 32(%ebp)
> pushl 32(%ebp)
> popfl
> jnle-8 %eip+13
This really is a lemon, insn't it. One good thing is that (a) lazy save/
restore of the flags will nuke 4 of the 6 {push,pop}fl ops, and (b)
this is a very common idiom, arising from the original
"do-alu-op-and-set-flags ; conditional jump".
> what was the problem with lazy save/restore of the flags again?
Only really my paranoia about getting it right. Clearly you haven't
spent enough hours chasing wierd bugs to do with state leakage from
the real machine to the simulated one :)
> Still, is this really enough to slow down the whole block? What about
> AGI stalls?
Well, that's a good question. Firstly, I don't know the rules on P3 for
AGI stalls, but looking at the main part of it, 8 copies of this:
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
an AGU is only needed by the leal (and the movl?), but at most 2 x every 3
insns, so I reckon there's enough slack there to run these groups at a CPI
of 1.0 or better.
Let's re-consider the numbers. The loop runs 25.0 million times in 1.56
seconds, which is 16.02 million/second. At 1133 MHz, that's 70.7 cycles
per iteration, give or take. And the loop is about 42 insns long.
Also, there are 6 pushf/popfs ("*fs") in the loop.
In the worst case, those *f would occupy 6 * 12 cycles (12 as from my
mini-experiment last night), = 72. Clearly this isn't possible since
the loop only takes 71 cycles and contains 36 other insns too.
However, even if we conservatively assume that each *f insn takes half
what I measured, ie 6 cycles, that leaves 35 cycles for the remaining
36 insns, which is a CPI of about 0.97, close to the CPI of the original
loop. That sounds more realistic. So I'd say it's not implausible to
argue that those 6 *f insns consume half the entire running time.
Proposal in the next message.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-21 01:34:09
|
On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> This is the third of two messages about t-chaining. Get a stiff
> whisky before reading further; you'll need it.
>
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
Yes, the P3 optimisation guide says that pushf/popf are complex
instructions (ie, microcoded). They don't stall the pipeline per-se,
but they limit the decode rate and serialise stuff which needn't be
serialised.
Given gems such as:
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
what was the problem with lazy save/restore of the flags again?
Still, is this really enough to slow down the whole block? What about
AGI stalls?
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-21 00:48:29
|
On Wed, 2002-11-20 at 14:20, Julian Seward wrote:
> A questions: What's the deal on the LRU mechanism when
> chaining is going, now? Do chained translations get marked when they
> get used, now? Not afaics ... but this is going to present something
> of a problem when LRUing because chained ones will appear to not have
> been used, and so will be much more likely to be dumped. Perhaps we can
> do something clever based on the idea that if a T has been chained
> it must have been used at least once since the last LRU pass.
I did notice the problem and made a small attempt to address it:
patch_me updates the epoch field in the same way as the dispatch loop.
It doesn't actually help much, but it looks like I paid attention to it.
I was thinking of a scheme like an OS VM scheme: have a loop go through
the TTEs at some low rate and unchaining blocks; if on the 2nd pass the
block is still unchained and still hasn't updated the LRU entry, then
consider it old.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-21 00:08:20
|
This is the third of two messages about t-chaining. Get a stiff
whisky before reading further; you'll need it.
Using the attached prog, it's easy to show that on my P3, each
(pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
although verifying that. Considering that the P6 pipe is alleged
to be about 10 stages, I strongly suspect they both cause a pipe
flush. Ie, each is as bad as a mispredicted jump.
#include <rude_words.h>
This is bad news. Or a big opportunity to do better.
I have to go to bed now, but from a mental back-of-the-envelope
calculation I'd guess this easily explains the cycle loss from
the inner loop of the previous message.
J
#include <stdio.h>
int main ( int argc, char** argv )
{
int i;
for (i = 0; i < 1000 * 1000 * 1000; i++) {
asm volatile ("pushfl ; popfl");
asm volatile ("pushfl ; popfl");
}
return 0;
}
|