You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Jeremy F. <je...@go...> - 2002-11-22 16:06:39
|
On Thu, 2002-11-21 at 23:58, Julian Seward wrote:
> Sorry to be stupid. I'm still confused. In that case, wouldn't be simpler
> just to set %EIP to the correct value before the call, and completely avoid
> all the hassle of establishing %eip and converting it to %EIP ? I think
> I must be missing something.
No, you're right, I'm being silly.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 12:37:23
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> [S2: suggestion of smarter code for Jz et al]
>
> Another thing which would surely help is to generate smarter code for
> the Jz / Jnz UINSTRS. Observation to be made here is that at least for
> the Zero flag, we might as well directly test the bit in %EFLAGS in
> memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
> %eflags: [note; this illustration is without t-chaining, but is equally
> valid in t-chaining's presence]
>
> Jzo $addr
> -->
> testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
> -- and N is the offset of the Z flag in
> -- %EFLAGS
> jz-8 %eip+6
> movl $addr, %eax
> ret
> %eip+6:
>
> in contrast to the current scheme
>
> pushl 32(%ebp)
> popfl
> jnz-8 %eip+6
> movl $addr, %eax
> ret
> %eip+6:
>
> In fact the new scheme not only avoids the horrible popfl; it also
> saves an insn!
Nice idea. But doesn't it clash with the lazy EFLAGs idea? viz:
37: ANDL %eax, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %eax, %ebx
pushfl ; popl 32(%ebp) ***
39: Jzo $0x402047AC (-rOSZACP)
pushl 32(%ebp) ; popfl ***
jnz-8 %eip+6
movl $0x402047AC, %eax
ret
If Jzo doesn't have the "pushl 32(%ebp) ; popfl" then you can't remove the
"pushfl ; popl 32(%ebp)" from ANDL -- instead of cutting 4 instructions,
it only cuts 2.
Another thing: will the ANDL's "pushfl ; popl 32(%ebp)" always be
necessary? Ie. do the condition codes have to be saved in EFLAGs before
the end of the BB? If so, then lazy EFLAG updating can only save two
instructions, not four, in which case the new Jz/Jnz idea seems better
because it saves another instruction anyway...
> Some complicated tests (not-below, etc) would still have to be done the
> old way, but we could cover the tests associated with single flags
> (zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
> catch most _uses_ of the condition codes.
FYI, from "valgrind --skin=none --trace-codegen=00001 true":
all jumps 1908
JMPo 916
Jzo 407
Jnzo 150
JMPo-c 177
JMPo-r 121
Jle 20
Jnbo 25
Jbo 13
Jnbeo 12
Jns 16
Jnleo 9
JMPo-sys 7
sum of those listed: 1873 (35 not mentioned)
Of the 726 Jccs, 77% are Jz/Jnz.
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 12:03:05
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> Second part of the trick is that we can probably constrain ourselves to
> scanning about a dozen or so stack locations; the RA must be in that area.
> Consider what the stack looks like immediately after the doing the call
> insn which leads to the helper (stack grows down):
>
> RA for VG_(run_innerloop)
> saved ebx, ecx, edx, esi, edi, ebp pushed at start of VG_(run_innerloop)
> RA for where VG_(run_innerloop) calls translation
> /* stuff below is created by translation and helpers it calls */
> RA for call to helper ("MEMEME") (since all args are passed in regs)
> /* stuff pushed by stack of helper functions */
> /* top-of-stack; %esp points here (or one word above) */
>
> Imagine now we grab the value of %esp at entry to VG_(run_innerloop) and
> park it in some global variable. If my picture of the stack is right, we
> can find the relevant RA ("MEMEME") by scanning downwards from that value
> for 8 or 9 words, looking for a word satisfying the constraints mentioned
> above. If we have to go even a few words further, we've probably missed it.
>
> The combination of value constraint and location constraint should make
> this fairly good at finding the correct RA. The main danger AFAICS is
> that the 6 regs saved at the start of VG_(run_innerloop) might hold
> an spuriously matching value. Even that can be avoided by snapshotting
> %eip after those are pushed, in which case we're even more strongly
> location-constrained: there can be only about 3 or 4 words to search
> before declaring that we've missed it somehow.
Do you mean "%esp after those are pushed"?
Actually, won't we know exactly which stack location it should be in, ie.
immediately after the RA of VG_(run_innerloop)? Or can extra stuff get
put between the two RAs?
> How does that sound? Does it make sense? Is it utterly horrible?
> I'm not sure if something this fragile is really a good idea, but still
> it might be worth trying ...
Another thing is that we have to subtract 6 (or whatever) from the RA to
get the right %eip before doing the eip-to-EIP lookup, but that's fine.
I'm starting to come around to this idea now...
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 11:33:21
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> I have thought of a way to get hold of %eip in helper fns, skin independently
> and without slowing down the normal I-don't-want-to-know-%eip case, but it has
> a high coefficient of horribleness [although is easy to implement].
>
> Basic idea is that we can probably find the return address for the call to the
> helper, lying round on the stack, without outside assistance.
>
> Suppose (as is indeed the case) that %EIP is brought up to date at the
> start of each bb. Now the bb calls a helper fn and you want to know
> what %eip is at the point of the call.
>
> So we're looking for a word on the stack (the RA) that points back into
> the translation. Since we have %EIP for the bb, we can look up in TT
> the location and size of the translation. The word we're looking for
> must have a value in the range: location .. location + size - 1.
> Furthermore, 6 (?) bytes before the word must also be a valid location in the
> translation -- the call insn -- and it must contain the opcode for "call".
> [...]
> Actually I'd guess it's pretty robust, if the above analysis is correct.
Wow. I think it's the sort of idea that if I'd thought of it I would
think it was really cool and worth doing, but because someone else thought
of it I take two steps back and think hmm...
----
Ideally, we want these characteristics:
1a. simple idea
1b. simple effect on existing code (eg. affects only one place)
2. entirely skin-independent
3. maximally efficient, ie. only updates EIP when EIP is actually
recorded (I'm assuming we update EIP between BBs, though)
4. no extra space required
Your suggestion, assuming it works effectively, meets 2 and 3. It fails
1a; as for 1b, the finding-%eip complexity would all be in the one spot,
although it still needs the eip-to-EIP table.
----
Before you mentioned this finding-a-needle-in-a-call-stack idea, I was
leaning back towards the update-EIP-only-when-necessary approach.
%EIP only needs to be up-to-date for a small number of UInstrs:
a. CCALL/CALLM, if the called code records EIP
b. PUT %ESP, if stack change/post_mem_write events are tracked and the
called function records EIP
c. Any extended UInstr that records EIP
There are only two ways skin code can record EIP:
1. Call VG_(get_ExeContext)()
2. Call VG_(get_EIP)()
VG_(get_ExeContext)() could be changed so that it calls VG_(get_EIP)(),
which would essentially reduce the problem to a single point of code.
Using the latest ideas:
a. add a 1-bit field `records_eip' to CCALL and CALLM instructions that
a skin must turn on if the called code can record EIP
b. add a bool argument to the functions VG_(track_new_mem_stack)() etc,
which the skin makes `True' if the called code can record EIP
c. add another function that must be implemented by skins using extended
UCode, SK_(UInstr_record_EIP)()
a,b,c in this list correspond directly to a,b,c the list above.
(Actually, c could be handled by a's mechanism too.)
Any UInstr which could cause an EIP-record would have to be preceded with
a PUTEIP instruction (I agree that sounds better than INCEIP). Actually,
we could keep INCEIP instructions (necessary for determining x86
instruction boundaries), generate no code for them, and the codegen phase
could effectively add the PUTEIP code for UInstrs when the relevant
boolean is set without actually needing an explicit PUTEIP instruction.
This satisfies 1a and 4, isn't too bad for 1b, but fails the others. It
also makes the VG_(track_*) functions non-uniform, which is unfortunate.
----
So, I've criticised the flaws in the current proposals, but haven't come
up with anything better. Harumph.
----
A couple of other random thing that occurred to me:
i. If we do use an eip-to-EIP table, it doesn't need to record entries
for every instruction, just those that can cause an EIP-recording
(CCALL/CALLM/PUT ESP/extended UInstrs).
ii. None of this affects Cachegrind, since it records %eip for every x86
instruction in its cost-centre anyway. At one point I took out all
INCEIPs for Cachegrind, but then put them back in because of the
weird-signals-sometimes issue.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-22 09:01:44
|
> Maybe we should at least keep EIP up to date at the basic block level,
> so we only need to deal with the intra-BB case.
As per recent msg to Nick, doing so at least makes it easy to find the
%eip (and hence %EIP) by scanning the stack for helper-function
return addresses. One PUTEIP per bb doesn't seem too onerous.
> I was thinking of a table with one entry per original instruction, which
> contains the length of the original instruction and the length of the
> corresponding generated code. It may be possible to pack this into 2
> nibbles per instruction, with an escape mechanism to deal with rare
> cases (ie, if length == 0xF then get the details from the following
> byte). That could be built as the code is generated then tacked onto
> the end of the generated code (the TTE would contain the generated code
> length plus the length of the whole translation, including the
> instruction mapping table). That way doing the mapping is a simple
> linear scan:
>
> cur_eip = tte->trans_addr;
> cur_EIP = tte->orig_addr;
> entry = 0;
> while(eip > cur_eip) {
> cur_eip += tab[entry].trans_len;
> cur_EIP += tab[entry].orig_len;
> entry++'
> }
I was thinking of the same thing. We could be simpler and just use a pair
of bytes (orig_insn_len, trans_insn_len) for each orig insn; this gives an
overhead of 2 bytes per orig insn, which is not too bad. (About 2 mbyte for
a run of Mozilla).
We can easily do a bit better with some kind of cleverer
encoding scheme as you mention, if that's needed [probably an array of
12-bit entities, 4 bits for orig len (they are all in the range 1 .. 17)
and 8 for translated length, with perhaps (N, 0xFF) indicating that the
entire next 12-bit group is a translated length, in the (appalling) case
that one orig insn generates more than 254 bytes of translation.
One comment is that it might be worth having in each TT entry a small number
of such slots, say 8. That way most basic blocks can store their table in
the TT entry. For the occasional large bb, TT will have to contain a ptr to
a table allocated in VG_(malloc)ville. This seems like a good idea because
there's a considerable space overhead using VG's malloc stuff -- about 6 words
per alloc -- so doing a large number of small malloc's is quite wasteful of
space.
Adding 8 slots to each TT covers most bbs, and would take 4 words
with the (byte,byte) encoding scheme and 3 words with the (4,8)
scheme.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 08:24:08
|
Nick,
I have thought of a way to get hold of %eip in helper fns, skin independently
and without slowing down the normal I-don't-want-to-know-%eip case, but it has
a high coefficient of horribleness [although is easy to implement].
Basic idea is that we can probably find the return address for the call to the
helper, lying round on the stack, without outside assistance.
Suppose (as is indeed the case) that %EIP is brought up to date at the
start of each bb. Now the bb calls a helper fn and you want to know
what %eip is at the point of the call.
So we're looking for a word on the stack (the RA) that points back into
the translation. Since we have %EIP for the bb, we can look up in TT
the location and size of the translation. The word we're looking for
must have a value in the range: location .. location + size - 1.
Furthermore, 6 (?) bytes before the word must also be a valid location in the
translation -- the call insn -- and it must contain the opcode for "call".
So you're heavily constrained as to the value of the word being searched
for.
Second part of the trick is that we can probably constrain ourselves to
scanning about a dozen or so stack locations; the RA must be in that area.
Consider what the stack looks like immediately after the doing the call
insn which leads to the helper (stack grows down):
RA for VG_(run_innerloop)
saved ebx, ecx, edx, esi, edi, ebp pushed at start of VG_(run_innerloop)
RA for where VG_(run_innerloop) calls translation
/* stuff below is created by translation and helpers it calls */
RA for call to helper ("MEMEME") (since all args are passed in regs)
/* stuff pushed by stack of helper functions */
/* top-of-stack; %esp points here (or one word above) */
Imagine now we grab the value of %esp at entry to VG_(run_innerloop) and
park it in some global variable. If my picture of the stack is right, we
can find the relevant RA ("MEMEME") by scanning downwards from that value
for 8 or 9 words, looking for a word satisfying the constraints mentioned
above. If we have to go even a few words further, we've probably missed it.
The combination of value constraint and location constraint should make
this fairly good at finding the correct RA. The main danger AFAICS is
that the 6 regs saved at the start of VG_(run_innerloop) might hold
an spuriously matching value. Even that can be avoided by snapshotting
%eip after those are pushed, in which case we're even more strongly
location-constrained: there can be only about 3 or 4 words to search
before declaring that we've missed it somehow.
How does that sound? Does it make sense? Is it utterly horrible?
I'm not sure if something this fragile is really a good idea, but still
it might be worth trying ...
Actually I'd guess it's pretty robust, if the above analysis is correct.
And even if it calculates a wrong %EIP once in 10000 calls, do we care?
[perhaps cachegrind does? I don't know]. The calculated %EIP can be
sanity checked against the %EIP-saved-at-bb-start and the bb's known
original length as extracted from TT. If the %EIP calculated via this
method points outside that range, something's clearly wrong and so we
can ignore it and fall back to the %EIP-saved-at-bb-start.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 07:51:04
|
On Friday 22 November 2002 7:08 am, Jeremy Fitzhardinge wrote: > On Thu, 2002-11-21 at 23:07, Julian Seward wrote: > > Um, I'm confused. I don't understand how to solve the following problem: > > memcheck calls a helper function to do a check, concludes there's an > > error and needs %EIP. How exactly do you propose to generate it if you > > don't know %eip at the point where the helper was called? > > Oh, sorry. I was assuming you'd store that in a baseblock field before > the call. Hm, I guess that would need a trampoline to do it properly: > > movl %ccall-func, %eax > call ccall-tramp > [...] > > > ccall-tramp: > movl (%esp), %ebx > movl %ebx, XX(%ebp) > jmp *%eax > > Or perhaps do it completely inline. Maybe it's all a bit clumsy. Sorry to be stupid. I'm still confused. In that case, wouldn't be simpler just to set %EIP to the correct value before the call, and completely avoid all the hassle of establishing %eip and converting it to %EIP ? I think I must be missing something. J |
|
From: Jeremy F. <je...@go...> - 2002-11-22 07:08:50
|
On Thu, 2002-11-21 at 23:07, Julian Seward wrote:
> Um, I'm confused. I don't understand how to solve the following problem:
> memcheck calls a helper function to do a check, concludes there's an error
> and needs %EIP. How exactly do you propose to generate it if you don't
> know %eip at the point where the helper was called?
Oh, sorry. I was assuming you'd store that in a baseblock field before
the call. Hm, I guess that would need a trampoline to do it properly:
movl %ccall-func, %eax
call ccall-tramp
[...]
ccall-tramp:
movl (%esp), %ebx
movl %ebx, XX(%ebp)
jmp *%eax
Or perhaps do it completely inline. Maybe it's all a bit clumsy.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 07:00:18
|
> > The problem appears to be how to get hold of %eip (the real one) when > > a skin calls a helper function, in a way which is not skin-specific. > > I thought about this this evening for some considerable time, but I > > can't think of a clean solution which doesn't involve some amount of > > performance loss. The least-worst thing I thought of was to associate > > with CCALL uinstrs a boolean flag, which when set automagically causes > > the current %eip to get passed to the called function as an extra > > parameter. This at least makes it skin-independent. > > But we have a VG_(get_EIP)() function. Why can't skins just use that, > and we hide all the complexity in that function? Why does a skin need > to know about the hardware %eip? Um, I'm confused. I don't understand how to solve the following problem: memcheck calls a helper function to do a check, concludes there's an error and needs %EIP. How exactly do you propose to generate it if you don't know %eip at the point where the helper was called? > That's nice and simple to implement. It's a good start. Has anyone > looked to see what the proportions of the different comparisons are? I > would expect Z and NZ would be way up there... I bet Hennessy and Patterson has this info. I'll look. J |
|
From: Julian S. <js...@ac...> - 2002-11-22 05:07:01
|
So, I've read all todays email messages, I think.
Nick, it looks like you've picked up some good issues in today's hackery.
Let me try and re-draw the picture in light of them.
We're trying to fix 3 performance-sapping problems:
1. too many INCEIPs
2. too many eflags save/restores
3. too many fpu save/restores
3 is pretty much solved already by Jeremy's work. 1 and 2 are still open.
1. INCEIPs
~~~~~~~~~~~
The best way to handle the INCEIP expense is replace it all with a
table lookup. As Jeremy suggests, it's probably simpler to have one
big table than one table per bb, and I'm probably inclined in favour
of that.
The problem appears to be how to get hold of %eip (the real one) when
a skin calls a helper function, in a way which is not skin-specific.
I thought about this this evening for some considerable time, but I
can't think of a clean solution which doesn't involve some amount of
performance loss. The least-worst thing I thought of was to associate
with CCALL uinstrs a boolean flag, which when set automagically causes
the current %eip to get passed to the called function as an extra
parameter. This at least makes it skin-independent.
Unfortunately, it adds extra expense for almost all of the calls made
by cachegrind, memcheck and addrcheck, since most of those can potentially
report an error and so need %eip. This is not good. So I don't like
this idea.
2. eflags save/restores
~~~~~~~~~~~~~~~~~~~~~~~~
The second thing Nick has unearthed is that the lazy eflag optimisation
will often not apply, due to intervening flag-mangling operations.
[Nick: yes, your understanding of how this is supposed to work appears
to be correct]. This is bad; but I'd like to do better, because we're
really getting clobbered by these pushf/popf insns.
So here's ...
Yet another proposal
~~~~~~~~~~~~~~~~~~~~
If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
should back off from the %EIP-by-tables story; in effect fall back to
Variant (1) in this mornings proposal. That means retaining INCEIP, at
least when we can't get rid of redundant updates.
I've realised that perhaps INCEIP isn't a clever semantics. Problem is
that (1) it turns into (eg) addl $2, 36(%ebp), which means a load, ALU op,
and store, and (2) it trashes the flags.
[S1: suggestion of PUTEIP]
How about if we had instead PUTEIP, which simply sets the value of %EIP
to some literal value? So, we just emit PUTEIPs with absolute addresses
where currently we create INCEIPs with deltas. PUTEIP then becomes
something like
movl $0x8048517, 36(%ebp)
which is a longer insn (7 bytes?), but is still a fastpath decode at least
on Athlon. It has the advantage of not requiring a load or an ALU op, so is
less resource-hungry. And best of all, it is flag-unaffecting, so that we
can now do the lazy eflags stuff without getting messed up by them.
[S2: suggestion of smarter code for Jz et al]
Another thing which would surely help is to generate smarter code for
the Jz / Jnz UINSTRS. Observation to be made here is that at least for
the Zero flag, we might as well directly test the bit in %EFLAGS in
memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
%eflags: [note; this illustration is without t-chaining, but is equally
valid in t-chaining's presence]
Jzo $addr
-->
testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
-- and N is the offset of the Z flag in
-- %EFLAGS
jz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
in contrast to the current scheme
pushl 32(%ebp)
popfl
jnz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
In fact the new scheme not only avoids the horrible popfl; it also
saves an insn!
Some complicated tests (not-below, etc) would still have to be done the
old way, but we could cover the tests associated with single flags
(zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
catch most _uses_ of the condition codes.
[S3: suggestion of uinstr reordering]
Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
created by memcheck and cachegrind will often get in the way of the lazy
eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
specifically) to generatee instrumentation a bit more cleverly so as to try
and keep condition code generators and users together more often. Dunno if
this will help.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 02:53:21
|
On Thu, 2002-11-21 at 15:31, Julian Seward wrote:
> Ha, that sounds like it might work well enough. I sometimes wondered
> how OSs implement the pseudo-LRU stuff for VMs. Is there a standard
> description of this, that you know of?
This is called the one-hand clock algorithm. The two-hand variant is
used when the scan over the whole of memory would take too long, and you
want to test before a complete pass has been made. I don't know what
the standard reference is, but I think pretty much all OS books talk
about it (Tannenbaum, for example).
> Yes, that is the simplest thing, but I fear it would be very expensive.
> A less radical similar scheme is to randomly discard some fraction of
> the translations. Then you get into schemes where you randomly move
> some fraction of the translations into a "victimisation cache" (in
> hardware speak). If they turn out to be needed again soon, they can
> be moved back from the cache back into the main set of active translations.
> If the V-cache gets full _it_ then is flushed. This scheme has the
> effect of somewhat mitigating the situation where a translation is
> randomly selected for demotion and then used again soon afterwards.
> But it's all a bit complex.
Dynamo doesn't quite dump everything when the cache gets full. It keeps
track of the rate of creation of new basic blocks, and dumps the cache
when this starts getting high, on the grounds that the program is
entering a new working set and the cache is stale anyway. You could
define "too high" in terms of how full the cache is in proportion to the
size you'd like it to be, since you'd obviously like a high cache fill
rate when the cache is empty.
A generational caching scheme sounds good, so long as it doesn't
actually involve physically moving the code.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 02:41:14
|
On Thu, 2002-11-21 at 16:08, Julian Seward wrote:
> 1. INCEIPs
> ~~~~~~~~~~~
> The best way to handle the INCEIP expense is replace it all with a
> table lookup. As Jeremy suggests, it's probably simpler to have one
> big table than one table per bb, and I'm probably inclined in favour
> of that.
Eh? No I think one table per BB is easiest to handle, particularly if
we have EIP at least up to date to the current basic block.
> The problem appears to be how to get hold of %eip (the real one) when
> a skin calls a helper function, in a way which is not skin-specific.
> I thought about this this evening for some considerable time, but I
> can't think of a clean solution which doesn't involve some amount of
> performance loss. The least-worst thing I thought of was to associate
> with CCALL uinstrs a boolean flag, which when set automagically causes
> the current %eip to get passed to the called function as an extra
> parameter. This at least makes it skin-independent.
But we have a VG_(get_EIP)() function. Why can't skins just use that,
and we hide all the complexity in that function? Why does a skin need
to know about the hardware %eip?
> Unfortunately, it adds extra expense for almost all of the calls made
> by cachegrind, memcheck and addrcheck, since most of those can potentially
> report an error and so need %eip. This is not good. So I don't like
> this idea.
Memcheck only needs %EIP when actually reporting an error though.
> 2. eflags save/restores
> ~~~~~~~~~~~~~~~~~~~~~~~~
> The second thing Nick has unearthed is that the lazy eflag optimisation
> will often not apply, due to intervening flag-mangling operations.
> [Nick: yes, your understanding of how this is supposed to work appears
> to be correct]. This is bad; but I'd like to do better, because we're
> really getting clobbered by these pushf/popf insns.
I had an idea about restoring eflags without pushf/popf: just rerun the
setting instruction. Since (almost?) all the instructions which
generate eflags are otherwise side-effect free and don't depend on
memory, why not just rerun them on the same values just before the
jump? (Or perhaps move it altogether, but that might be tricker than
just copying the instruction.)
I haven't looked at this in detail, but it's an amusing idea...
Hm, I guess making sure the flags are correct for the end of the BB
still needs them to be explicitly saved. Which I suppose we could do in
the form of saving a piece of code which has the right effect, but
that's probably going a bit far...
> If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
> should back off from the %EIP-by-tables story; in effect fall back to
> Variant (1) in this mornings proposal. That means retaining INCEIP, at
> least when we can't get rid of redundant updates.
I still don't think its hard to efficiently implement an EIP table
lookup.
> How about if we had instead PUTEIP, which simply sets the value of %EIP
> to some literal value? So, we just emit PUTEIPs with absolute addresses
> where currently we create INCEIPs with deltas. PUTEIP then becomes
> something like[...]
I do this in the t-chaining patch. Just before each jump, it generates:
movl $target-addr, %eax
movl %eax, 36(%ebp)
It means that if the chaining patch fails, it can still just ret into
the dispatch loop, and %eax is set up properly, as well as having %EIP
up to date in case there's a context switch.
I currently generate this inline, but I was thinking of adding a SETEIP
with these semantics. This would also be useful to distinguish basic
blocks in an extended basic block when doing trace caching.
> [S2: suggestion of smarter code for Jz et al]
>
> Another thing which would surely help is to generate smarter code for
> the Jz / Jnz UINSTRS. Observation to be made here is that at least for
> the Zero flag, we might as well directly test the bit in %EFLAGS in
> memory [...]
> Some complicated tests (not-below, etc) would still have to be done the
> old way, but we could cover the tests associated with single flags
> (zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
> catch most _uses_ of the condition codes.
That's nice and simple to implement. It's a good start. Has anyone
looked to see what the proportions of the different comparisons are? I
would expect Z and NZ would be way up there...
Hm, looking at the definitions of the tests in the manual, it seems that
it would be pretty easy to generate open-coded equivalents for most of
them in much less than the 12 cycles to do popf...
> Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
> created by memcheck and cachegrind will often get in the way of the lazy
> eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
> specifically) to generatee instrumentation a bit more cleverly so as to try
> and keep condition code generators and users together more often. Dunno if
> this will help.
This is along the lines of my suggestion above of duplicating the
flags-setting instruction to just before the test. It's a little bit
like Shade's use of carefully chosen instructions to set the flags to
the desired value.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-21 23:26:07
|
> > > A questions: What's the deal on the LRU mechanism when > > > chaining is going, now? > > > > I did notice the problem and made a small attempt to address it: > > patch_me updates the epoch field in the same way as the dispatch loop. > > It doesn't actually help much, but it looks like I paid attention to it. > > > > I was thinking of a scheme like an OS VM scheme: have a loop go through > > the TTEs at some low rate and unchaining blocks; if on the 2nd pass the > > block is still unchained and still hasn't updated the LRU entry, then > > consider it old. Ha, that sounds like it might work well enough. I sometimes wondered how OSs implement the pseudo-LRU stuff for VMs. Is there a standard description of this, that you know of? > It might be worth re-mentioning the approach used by Dynamo, Walkabout, > Strata and seemingly all the other dynamic translators for handling this > situation -- they just flush the code cache. > > But they have smaller translations and smaller code caches, and spend less > time on translations. And I think Julian's LRU mechanism is extremely > fast. So maybe just ignore this. Yes, that is the simplest thing, but I fear it would be very expensive. A less radical similar scheme is to randomly discard some fraction of the translations. Then you get into schemes where you randomly move some fraction of the translations into a "victimisation cache" (in hardware speak). If they turn out to be needed again soon, they can be moved back from the cache back into the main set of active translations. If the V-cache gets full _it_ then is flushed. This scheme has the effect of somewhat mitigating the situation where a translation is randomly selected for demotion and then used again soon afterwards. But it's all a bit complex. J |
|
From: Jeremy F. <je...@go...> - 2002-11-21 17:42:28
|
On Thu, 2002-11-21 at 01:08, Julian Seward wrote:
> It's clear we need to do something to fix this. This is a proposal
> with two variants; a less ambitious variant (first) and a more ambitious
> one (second).
>
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> which tell you whether or not a uinstr (more correctly, the code
> generated for that uinstr) **might** (1) alter the real machine's
> %eflags, (2) alter the real machine's FPU state, and (3) require
> to see the %EIP of the simulated machine.
>
> Getting a safe-if-conservative result for (1) and (2) is critical
> for translation correctness. For (3) it just effects the accuracy
> of stack traces. So the safe thing to return is True for all functions,
> yet we want to return False as often as we safely can. Classical
> compiler-analysis conservative-estimate stuff.
>
> Further, for all skins which extend ucode, we add suitable functions
> to the core/skin iface so that the same questions can be asked of the
> extended ucodes.
>
> With fn (3), the redundant INCEIP removal phase can be reinstated
> for all skins, and should work correctly. The reason it's commented
> out is precisely because at present it can't reliably ask this question.
>
> Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU
> state save/restore, which I think is rolled into the code generation
> loop. I would like to be able to ship this optimisation with confidence
> that I understand the ramifications.
>
> Fn (3) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
>
> None of this would be hard to implement, and they would support a useful
> bunch of optimisations.
>
>
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create two new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
>
> We do the lazy FPU and EFLAGS save/restore optimisation exactly as
> above.
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free.
Only when using the dispatch loop. When using t-chaining, I currently
manually update EIP to the target EIP just before each chained jump. I
haven't measured how much this costs (because removing it completely
breaks context switches and makes things crash), but it would be nice to
get rid of it. It would need a separate eip->EIP translation table for
the start of basic blocks, then a second process to find the offset
within a basic block. If we don't too it too often, then simply
scanning the TTE table would be a reasonable way of doing the mapping
without extra datastructures.
> Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip. Using the block-start %EIP we can look in the translation
> table (TT) to find info about this block: start address of its translation,
> and presumably whereabouts it's %eip->%EIP mapping table is.
>
> Knowning where the translation starts (%eip for the start of the
> translated block) and knowing the current %eip, we can figure out how
> far inside the translation we are. That offset can then be used to
> index (somehow) the mapping table for this block, to give the corresponding
> %EIP offset, which, when added to the block-start %EIP, gives the current
> %EIP.
>
> Makes sense?
Yes. The interesting question is how far to go. If we want the ideal
of never having generated code update EIP, then everywhere where we use
%EIP we'd either need to use %eip or convert %eip to %EIP.
Since we already have a VG_(get_EIP) function, making that the clearing
house for all eip->EIP conversions should be pretty easy.
The interesting question is representing eip and EIP in the ThreadState
table. Unless we want to do the mapping on every context switch, the
ThreadStates would need to hold eip and compute EIP on demand. Of
course, since the eips can be invalidated when the translation cache is
cleaned up, we'd need to map all the eips to EIPs. Then, when
scheduling a new thread, we need to start it running at %eip (if valid)
or else %EIP.
Then there's the tricky case of the dispatch loop working in terms of
EIP (as it has to, given its the first person to see code that's never
been seen before) and everything else working in eip. Have to be sure
to get all those transitions right.
Maybe we should at least keep EIP up to date at the basic block level,
so we only need to deal with the intra-BB case.
> So the only question is how to compactly encode the mapping table, and
> perhaps where to store them, but that's not a big deal.
I was thinking of a table with one entry per original instruction, which
contains the length of the original instruction and the length of the
corresponding generated code. It may be possible to pack this into 2
nibbles per instruction, with an escape mechanism to deal with rare
cases (ie, if length == 0xF then get the details from the following
byte). That could be built as the code is generated then tacked onto
the end of the generated code (the TTE would contain the generated code
length plus the length of the whole translation, including the
instruction mapping table). That way doing the mapping is a simple
linear scan:
cur_eip = tte->trans_addr;
cur_EIP = tte->orig_addr;
entry = 0;
while(eip > cur_eip) {
cur_eip += tab[entry].trans_len;
cur_EIP += tab[entry].orig_len;
entry++'
}
> Comments?
>
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
Dropping EIP updates is a big win. Bigger than chaining indirect jumps,
according to some initial measurements (though I haven't tested anything
which I'd expect to have lots of indirect jumps).
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 15:43:27
|
Hi,
I started having a go at the INCEIP table lookup thing, but I got stuck...
On Thu, 21 Nov 2002, Julian Seward wrote:
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free. Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip.
How do we know what the current %eip is?
Consider an example. When a 'value' error occurs in the middle of a
BB under MemCheck:
gen code calls the asm routine MC_(helper_value_check4_fail)
which calls the C function MC_(helperc_value_check4_fail)()
which calls MC_(record_value_error)()
which eventually calls VG_(maybe_record_error)()
which eventually calls VG_(construct_error)()
which (as tst==NULL) records VG_(baseBlock)[VGOFF_(m_eip)]
If we remove the INCEIPs, VG_(baseBlock)[VGOFF_(m_eip)] will be out of
date -- it will point to the start of the original BB rather than the
exact problem instruction.
So how do we find out %eip in order to look up the snazzy new eip-->EIP
table? Well, it's buried deep down in the stack as the return value
(minus one instruction) for MC_(helper_value_check4_fail). But that's not
a good way to find it out.
Alternatively, during translation, the original %eip is known for each
original instruction and so it could be passed as an arg to
MC_(helper_value_check4_fail). But that would bloat code for TESTV (would
need to push the arg and then clear it after the call) and it doesn't feel
nice. But maybe it has to be done this way?
Or am I missing something obvious here?
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 10:26:01
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> Fn (1) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
Can I just confirm my understanding of the lazy %EFLAG optimisation?
Currently, AIUI, a lot of basic blocks (pretty much all those that end in
a conditional jump) end like this:
37: ANDL %eax, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %eax, %ebx
pushfl ; popl 32(%ebp) ***
38: INCEIPo $2
addl $2, 36(%ebp)
39: Jzo $0x402047AC (-rOSZACP)
pushl 32(%ebp) ; popfl ***
jnz-8 %eip+6
movl $0x402047AC, %eax
ret
If-and-only-if the INCEIP addl is removed, then the 4 instructions in the
two lines marked *** can be trivially removed.
Looking through --skin=none code, the only place where this seems
applicable is for conditional jumps -- arithmetic ops in the middle of
basic blocks don't look like they can have their %EFLAGs fiddling avoided.
One downer: I'm not sure if this will work for MemCheck. Consider the
instrumented code in the same situation:
30: ANDL %edx, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %edx, %ebx
pushfl ; popl 32(%ebp)
31: INCEIPo $2
addl $2, 104(%ebp)
32: GETVFo %eax
movl 0x44(%ebp), %eax
33: TESTVo %eax
testb $0x1, %al
jz-8 %eip+3
call * 76(%ebp)
34: Jzo $0x4000D2C4 (-rOSZACP)
pushl 32(%ebp) ; popfl
jnz-8 %eip+6
movl $0x4000D2C4, %eax
ret
Ignoring the INCEIP, the arithmetic op and the conditional jump are no
longer adjacent.
Here's the equivalent code for Cachegrind:
9: ANDL %edi, %edx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %edi, %edx
pushfl ; popl 32(%ebp)
10: MOVL $0x40269DC0, %eax
movl $0x40269DC0, %eax
11: CCALLo 0x4001C170(%eax)
call * 36(%ebp)
12: INCEIPo $2
addl $2, 60(%ebp)
13: MOVL $0x40269DE0, %eax
movl $0x40269DE0, %eax
14: CCALLo 0x4001C170(%eax)
call * 36(%ebp)
15: Jnzo $0x4000D190 (-rOSZACP)
pushl 32(%ebp) ; popfl
jz-8 %eip+6
movl $0x4000D190, %eax
ret
Again the optimisation is foiled.
It would work for AddrCheck and Helgrind though.
It seems that we would be getting into a situation where the
instrumentation you add could affect what optimisations can occur; in one
way this is reasonable but in another way it kind of sucks.
Or have I misunderstood something?
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
> [...]
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
I support this. The complexity will be localised pretty well, which is
good.
BTW, Am I right in thinking that the INCEIP instruction will stay in
UCode, but that it will just be removed during the codegen phase?
Because skins (eg. Cachegrind) use it as a way of determining the
boundaries of the original x86 instructions.
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 10:03:28
|
On 20 Nov 2002, Jeremy Fitzhardinge wrote: > > A questions: What's the deal on the LRU mechanism when > > chaining is going, now? > > I did notice the problem and made a small attempt to address it: > patch_me updates the epoch field in the same way as the dispatch loop. > It doesn't actually help much, but it looks like I paid attention to it. > > I was thinking of a scheme like an OS VM scheme: have a loop go through > the TTEs at some low rate and unchaining blocks; if on the 2nd pass the > block is still unchained and still hasn't updated the LRU entry, then > consider it old. It might be worth re-mentioning the approach used by Dynamo, Walkabout, Strata and seemingly all the other dynamic translators for handling this situation -- they just flush the code cache. But they have smaller translations and smaller code caches, and spend less time on translations. And I think Julian's LRU mechanism is extremely fast. So maybe just ignore this. N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-21 09:21:26
|
On Thu, 21 Nov 2002, Julian Seward wrote:
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
>
> #include <stdio.h>
>
> int main ( int argc, char** argv )
> {
> int i;
> for (i = 0; i < 1000 * 1000 * 1000; i++) {
> asm volatile ("pushfl ; popfl");
> asm volatile ("pushfl ; popfl");
> }
> return 0;
> }
Here's what Rabbit reports for my 1400 MHz Athlon:
Event Events
Events/sec
---------------------------------------- ---------------- ----------------
0x40 64 L1_data_cache_access 1675390681 365358668.02
0x41 65 L1_data_cache_miss 224993 49065.06
0x42 66 L1_data_cache_refill_from_L2 222561 48534.70
0x43 67 L1_data_cache_refill_from_syst 127562 27817.92
0x44 68 L1_data_cache_writeback 227126 49681.16
0x45 69 L1_DTLB_miss_and_L2_DTLB_hit 101621 22228.41
0x46 70 L1_and_L2_DTLB_miss 70174 15349.74
0x47 71 misaligned_data_references 15663 3426.10
0x80 128 L1_instr_cache_fetch 341826476 74679679.25
0x81 129 L1_instr_cache_miss 312554 68284.45
0x84 132 L1_ITLB_miss_and_L2_ITLB_hit 194970 42595.58
0x85 133 L1_and_L2_ITLB_miss 34231 7478.53
0xc0 192 retired_instructions 1514793469 330718191.35
0xc1 193 retired_ops 10899888249 2379724630.07
0xc6 198 retired_far_control_transfers 6116 1335.28
0xc7 199 retired_resync_branches 3630 792.52
0xc2 194 retired_branches 335619550 73423023.82
0xc3 195 retired_branches_mispredicted 743534 162661.90
0xc4 196 retired_taken_branches 334559598 73191139.59
0xc5 197 retired_taken_branches_mispred 689051 150742.73
0xcd 205 interrupts_masked_cycles 12881304 2818557.60
0xce 206 interrupts_masked_while_pendin 1376 301.08
0xcf 207 taken_hardware_interrupts 520 113.78
0xcf 207 taken_hardware_interrupts 520 113.78
0x00 0 Mcycles 6419847560 1404726582.63
resource usage:
time = 27.43 sec user, 0.01 sec sys, 27.46 sec real, 99.91% of cpu
page reclaims, faults = 12, 76
Points to note:
- (instrs : ops) ratio = 1 : 7.2
- CPI = 4.24
- CPop = 0.59
The Athlon optimisation docs say that pushf/popf are VectorPath
instructions with a best-case execution latency of 4 cycles.
(DirectPath instructions are decoded into 1 or 2 "MacroOPs". VectorPath
instructions are decoded into 3+ MacroOPs using some on-chip ROM.
Decoding a VectorPath instruction can block decoding of a DirectPath
instruction.)
Really basic ops (eg. add, inc on registers) are DirectPath and take >= 1
cycle.
Interestingly, all the normal register push/pop instructions are
also VectorPath and take >= 4 cycles. But when I replaced the inner loop
in the above program with
asm volatile ("pushl %esi ; popl %edi");
asm volatile ("pushl %edi ; popl %esi");
it executes in 5.28 seconds with CPI=0.82 and CPop=0.6
- (instrs : ops) ratio = 1 : 1.2
- CPI = 0.82
- CPop = 0.67
so the implementation of pushf/popf clearly sucks.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-21 09:01:36
|
It's clear we need to do something to fix this. This is a proposal with two variants; a less ambitious variant (first) and a more ambitious one (second). Variant 1 (less ambitious) ~~~~~~~~~~~~~~~~~~~~~~~~~~ We create three new functions, all of them predicates on UInstrs: 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* ); 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* ); 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* ); which tell you whether or not a uinstr (more correctly, the code generated for that uinstr) **might** (1) alter the real machine's %eflags, (2) alter the real machine's FPU state, and (3) require to see the %EIP of the simulated machine. Getting a safe-if-conservative result for (1) and (2) is critical for translation correctness. For (3) it just effects the accuracy of stack traces. So the safe thing to return is True for all functions, yet we want to return False as often as we safely can. Classical compiler-analysis conservative-estimate stuff. Further, for all skins which extend ucode, we add suitable functions to the core/skin iface so that the same questions can be asked of the extended ucodes. With fn (3), the redundant INCEIP removal phase can be reinstated for all skins, and should work correctly. The reason it's commented out is precisely because at present it can't reliably ask this question. Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU state save/restore, which I think is rolled into the code generation loop. I would like to be able to ship this optimisation with confidence that I understand the ramifications. Fn (3) would dually be used to support lazy %EFLAG save restore in the same manner as the FPU. None of this would be hard to implement, and they would support a useful bunch of optimisations. Variant 2 (more ambitious) ~~~~~~~~~~~~~~~~~~~~~~~~~~ We create two new functions, all of them predicates on UInstrs: 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* ); 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* ); We do the lazy FPU and EFLAGS save/restore optimisation exactly as above. Difference here is we do table-based EIP reconstruction as needed and completely nuke INCEIP (yay!) So, precisely how to reconstruct %EIP from %eip? Like this. Firstly %EIP must be made up-to-date at the start of each bb, but it is anyway, so that's free. Now suppose we are in the middle of a bb and want to know the current %EIP. We actually know the block-start %EIP and the current %eip. Using the block-start %EIP we can look in the translation table (TT) to find info about this block: start address of its translation, and presumably whereabouts it's %eip->%EIP mapping table is. Knowning where the translation starts (%eip for the start of the translated block) and knowing the current %eip, we can figure out how far inside the translation we are. That offset can then be used to index (somehow) the mapping table for this block, to give the corresponding %EIP offset, which, when added to the block-start %EIP, gives the current %EIP. Makes sense? So the only question is how to compactly encode the mapping table, and perhaps where to store them, but that's not a big deal. Comments? If we could do variant 2 rather than variant 1 that would be cool, considering it would give a bigger speedup overall. J |
|
From: Julian S. <js...@ac...> - 2002-11-21 08:43:19
|
On Thursday 21 November 2002 1:34 am, Jeremy Fitzhardinge wrote:
> On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> > This is the third of two messages about t-chaining. Get a stiff
> > whisky before reading further; you'll need it.
> >
> > Using the attached prog, it's easy to show that on my P3, each
> > (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> > although verifying that. Considering that the P6 pipe is alleged
> > to be about 10 stages, I strongly suspect they both cause a pipe
> > flush. Ie, each is as bad as a mispredicted jump.
>
> Yes, the P3 optimisation guide says that pushf/popf are complex
> instructions (ie, microcoded). They don't stall the pipeline per-se,
> but they limit the decode rate and serialise stuff which needn't be
> serialised.
True. Nevertheless, if they are merely vector-decode insns and
don't flush the pipe, they are darned expensive. Perhaps there is
a constant set-up cost of a few cycles for a vector-decode insn?
Does the P3 opt guide say anything about that? I have the P4 one
to hand and read that, but it's pretty unhelpful for the P3 :)
> Given gems such as:
>
> pushl 32(%ebp)
> popfl
> subl $0x3E7, %eax
> pushfl
> popl 32(%ebp)
> pushl 32(%ebp)
> popfl
> jnle-8 %eip+13
This really is a lemon, insn't it. One good thing is that (a) lazy save/
restore of the flags will nuke 4 of the 6 {push,pop}fl ops, and (b)
this is a very common idiom, arising from the original
"do-alu-op-and-set-flags ; conditional jump".
> what was the problem with lazy save/restore of the flags again?
Only really my paranoia about getting it right. Clearly you haven't
spent enough hours chasing wierd bugs to do with state leakage from
the real machine to the simulated one :)
> Still, is this really enough to slow down the whole block? What about
> AGI stalls?
Well, that's a good question. Firstly, I don't know the rules on P3 for
AGI stalls, but looking at the main part of it, 8 copies of this:
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
an AGU is only needed by the leal (and the movl?), but at most 2 x every 3
insns, so I reckon there's enough slack there to run these groups at a CPI
of 1.0 or better.
Let's re-consider the numbers. The loop runs 25.0 million times in 1.56
seconds, which is 16.02 million/second. At 1133 MHz, that's 70.7 cycles
per iteration, give or take. And the loop is about 42 insns long.
Also, there are 6 pushf/popfs ("*fs") in the loop.
In the worst case, those *f would occupy 6 * 12 cycles (12 as from my
mini-experiment last night), = 72. Clearly this isn't possible since
the loop only takes 71 cycles and contains 36 other insns too.
However, even if we conservatively assume that each *f insn takes half
what I measured, ie 6 cycles, that leaves 35 cycles for the remaining
36 insns, which is a CPI of about 0.97, close to the CPI of the original
loop. That sounds more realistic. So I'd say it's not implausible to
argue that those 6 *f insns consume half the entire running time.
Proposal in the next message.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-21 01:34:09
|
On Wed, 2002-11-20 at 16:15, Julian Seward wrote:
> This is the third of two messages about t-chaining. Get a stiff
> whisky before reading further; you'll need it.
>
> Using the attached prog, it's easy to show that on my P3, each
> (pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
> although verifying that. Considering that the P6 pipe is alleged
> to be about 10 stages, I strongly suspect they both cause a pipe
> flush. Ie, each is as bad as a mispredicted jump.
Yes, the P3 optimisation guide says that pushf/popf are complex
instructions (ie, microcoded). They don't stall the pipeline per-se,
but they limit the decode rate and serialise stuff which needn't be
serialised.
Given gems such as:
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
what was the problem with lazy save/restore of the flags again?
Still, is this really enough to slow down the whole block? What about
AGI stalls?
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-21 00:48:29
|
On Wed, 2002-11-20 at 14:20, Julian Seward wrote:
> A questions: What's the deal on the LRU mechanism when
> chaining is going, now? Do chained translations get marked when they
> get used, now? Not afaics ... but this is going to present something
> of a problem when LRUing because chained ones will appear to not have
> been used, and so will be much more likely to be dumped. Perhaps we can
> do something clever based on the idea that if a T has been chained
> it must have been used at least once since the last LRU pass.
I did notice the problem and made a small attempt to address it:
patch_me updates the epoch field in the same way as the dispatch loop.
It doesn't actually help much, but it looks like I paid attention to it.
I was thinking of a scheme like an OS VM scheme: have a loop go through
the TTEs at some low rate and unchaining blocks; if on the 2nd pass the
block is still unchained and still hasn't updated the LRU entry, then
consider it old.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-21 00:08:20
|
This is the third of two messages about t-chaining. Get a stiff
whisky before reading further; you'll need it.
Using the attached prog, it's easy to show that on my P3, each
(pushf ; popf) pair takes 22 cycles. I assume that's 11 each,
although verifying that. Considering that the P6 pipe is alleged
to be about 10 stages, I strongly suspect they both cause a pipe
flush. Ie, each is as bad as a mispredicted jump.
#include <rude_words.h>
This is bad news. Or a big opportunity to do better.
I have to go to bed now, but from a mental back-of-the-envelope
calculation I'd guess this easily explains the cycle loss from
the inner loop of the previous message.
J
#include <stdio.h>
int main ( int argc, char** argv )
{
int i;
for (i = 0; i < 1000 * 1000 * 1000; i++) {
asm volatile ("pushfl ; popfl");
asm volatile ("pushfl ; popfl");
}
return 0;
}
|
|
From: Julian S. <js...@ac...> - 2002-11-20 23:45:05
|
This is the second of two messages about t-chaining. Get a coffee
before reading further; you'll need it.
I found it pretty surprising that bzip2 runs at a 10x slowdown
on nulgrind, even with t-chaining engaged. I know that program
pretty well :) and it's completely compute-bound, no I/O, no malloc,
no weird stuff.
So I've just started messing with the attached program, designed
to compute, have an easily-located inner loop :) and to minimise
absolutely all other effects.
On my PIII-1.13 GHz box, compiled -O -g, it takes 0.250 s natively
and 1.56 seconds on nulgrind, with bb chaining on and INCEIPs
not generated (i.e., our max-performance settings so far).
It runs for 25.2 million bbs, btw.
[Attached below are: the C code, the original, Ucode, and generated
code for the inner loop, and the same with intervening junk removed
for easier reading].
That's a lossage factor of 6.2:1. This _seems_ quite good, until
we start looking at the translation of the inner loop. The original
is 12 (action-packed) x86 insns. The expected-case through the
translation is 44 x86 insns + ud2; ud2; nop, possibly about 42
in total once the patching has been done.
Now I'm at a bit of a loss. Although insn counts are notoriously
crappy for predicting execution times, it seems a little wierd that
increasing the number of insns by a factor of 3.5 increases the
cycle count by 6.2 times. The program should run, both natively
and on V, with zero misses in any caches. And, once chained, this
bb is jumping back to itself directly, so nothing else is going on.
So where the @*%£&?! are my missing cycles?
Particularly in view of the fact that the generated code is very
much in x86 RISC subset, and so should generate fewer uops/insn
than the original. For example, it contains several "movl (%esi), %esi".
Let's recap.
- There are no cache misses.
- There are no unpredictable branches.
- Each insn [in the resulting translation] needs either one
Address-Generation-Unit (leal), one Load-Store-Unit (movl) or one
ALU (add, sub, xor). So the bulk of it is unlikely to be stalling
due to a shortage of functional units.
- Perhaps stalls due to result forwarding? But the problem with
that is that (translation)
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and original
addl 0xC(%reg1,%reg2),%reg3
I'd guess require the same functional units chained in the same order
(AGU -> LSU -> ALU) ... who knows?
- Decode limitations? PIII (P6 core) has this wierd 4-1-1 rule, I don't
know if that's a problem. Surely, there are 8 groups of
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and each would fit in a 4-1-1 uop pattern?
- So perhaps there are some rare insns which completely serialise the
machine? We've got two pushfls and two popfls (partially redundant, too)
and they are pretty rare, hence might get executed out of ucode ROM.
I'm mystified. Someone with a cycle-accurate PIII model needs to run this
and see where it stalls.
Nick, what sort of time ratios do you get on your Athlonic device for
this?
[useless numbers: 25.0 million trips thru this loop.
Natively, this is 12 orig insns/iteration, = 300 million
insns. In 0.250 seconds, this is 1200 MIPS; at 1133 MHz,
that's a CPI of 0.94, which is pretty much Not Bad.
Translated, this is approx 42 insns/iteration, = 1050
million insns. In 1.56 seconds, this is 673 MIPS; at 1133 MHz,
that's a CPI of 1.68. This is significantly worse for no
apparent reason.
]
I reckon there's a large heavy object lurking in the undergrowth
here. Now we just need to find it. Anyone fancy lifting out
the translated loop, giving it a suitable context and running it
on VTune, to find out? Is oprofile able to tell us exactly
which insn is stalling?
Congratulations to anyone who bothered to read this far.
J
---------------------------------------------------------------------------
Original C code:
#include <stdio.h>
int i_fit_comfortably_in_D1[1000];
int timewaster ( void )
{
int i, j, sum;
for (i = 0; i < 1000; i++) {
i_fit_comfortably_in_D1[i] = i * 654321;
}
sum = 0;
for (j = 0; j < 200000; j++) {
for (i = 0; i < 1000; i += 8) {
sum += i_fit_comfortably_in_D1[i];
sum ^= i_fit_comfortably_in_D1[i+1];
sum -= i_fit_comfortably_in_D1[i+2];
sum += i_fit_comfortably_in_D1[i+3];
sum += i_fit_comfortably_in_D1[i+4];
sum ^= i_fit_comfortably_in_D1[i+5];
sum -= i_fit_comfortably_in_D1[i+6];
sum += i_fit_comfortably_in_D1[i+7];
}
}
return sum;
}
int main ( void )
{
printf ("the answer, you will be fascinated to know, is ... ");
fflush(stdout);
printf("0x%x\n", timewaster() );
return 0;
}
---------------------------------------------------------------------------
Original x86 code for inner loop:
0x8048508: leal 0x0(,%ecx,4), %eax
0x804850F: addl (%eax,%ebx,1),%edx
0x8048512: xorl 4(%ebx,%eax,1),%edx
0x8048516: subl 8(%ebx,%eax,1),%edx
0x804851A: addl 12(%ebx,%eax,1),%edx
0x804851E: addl 16(%ebx,%eax,1),%edx
0x8048522: xorl 20(%ebx,%eax,1),%edx
0x8048526: subl 24(%ebx,%eax,1),%edx
0x804852A: addl 28(%ebx,%eax,1),%edx
0x804852E: addl $0x8, %ecx
0x8048531: cmpl $0x3E7, %ecx
0x8048537: jle-8 0x8048508
---------------------------------------------------------------------------
Cleaned-up ucode for inner loop:
0: GETL %ECX, t2
1: MOVL $0x0, t0
2: LEA2L 0(t0,t2,4), t0
3: PUTL t0, %EAX
4: GETL %EBX, t6
6: LEA2L 0(t0,t6,1), t4
7: LDL (t4), t4
8: ADDL %EDX, t4
12: LEA2L 4(t6,t0,1), t10
13: LDL (t10), t10
14: XORL t4, t10
18: LEA2L 8(t6,t0,1), t16
19: LDL (t16), t16
21: SUBL t16, t10
25: LEA2L 12(t6,t0,1), t24
26: LDL (t24), t24
27: ADDL t10, t24
31: LEA2L 16(t6,t0,1), t30
32: LDL (t30), t30
33: ADDL t24, t30
37: LEA2L 20(t6,t0,1), t36
38: LDL (t36), t36
39: XORL t30, t36
43: LEA2L 24(t6,t0,1), t42
44: LDL (t42), t42
46: SUBL t42, t36
50: LEA2L 28(t6,t0,1), t50
51: LDL (t50), t50
52: ADDL t36, t50
53: PUTL t50, %EDX
55: ADDL $0x8, t2
56: PUTL t2, %ECX
58: SUBL $0x3E7, t2 (-wOSZACP)
59: Jleo $0x8048508 (-rOSZACP)
60: JMPo $0x8048539 ($2)
---------------------------------------------------------------------------
Generated x86 code for inner loop:
0: FF 0D 0C FD 0D 40
decl (0x400DFD0C)
6: 75 06
jnz-8 %eip+6
8: BD 1D 00 00 00
movl $0x1D, %ebp
13: C3
ret
0: GETL %ECX, %eax [a-----]
14: 8B 45 04
movl 0x4(%ebp), %eax
1: MOVL $0x0, %ebx [ab----]
17: 31 DB
xorl %ebx, %ebx
2: LEA2L 0(%ebx,%eax,4), %ebx [ab----]
19: 8D 5C 83 00
leal 0x0(%ebx,%eax,4), %ebx
3: PUTL %ebx, %EAX [ab----]
23: 89 5D 00
movl %ebx, 0x0(%ebp)
4: GETL %EBX, %ecx [abc---]
26: 8B 4D 0C
movl 0xC(%ebp), %ecx
5: LEA2L 0(%ebx,%ecx,1), %edx [abcd--]
29: 8D 54 0B 00
leal 0x0(%ebx,%ecx,1), %edx
6: LDL (%edx), %edx [abcd--]
33: 8B 12
movl (%edx), %edx
7: ADDL %EDX, %edx [abcd--]
35: 03 55 08
addl 0x8(%ebp), %edx
8: LEA2L 4(%ecx,%ebx,1), %esi [abcdS-]
38: 8D 74 19 04
leal 0x4(%ecx,%ebx,1), %esi
9: LDL (%esi), %esi [abcdS-]
42: 8B 36
movl (%esi), %esi
10: XORL %edx, %esi [abc-S-]
44: 31 D6
xorl %edx, %esi
11: LEA2L 8(%ecx,%ebx,1), %edi [abc-SD]
46: 8D 7C 19 08
leal 0x8(%ecx,%ebx,1), %edi
12: LDL (%edi), %edi [abc-SD]
50: 8B 3F
movl (%edi), %edi
13: SUBL %edi, %esi [abc-S-]
52: 29 FE
subl %edi, %esi
14: LEA2L 12(%ecx,%ebx,1), %edx [abcdS-]
54: 8D 54 19 0C
leal 0xC(%ecx,%ebx,1), %edx
15: LDL (%edx), %edx [abcdS-]
58: 8B 12
movl (%edx), %edx
16: ADDL %esi, %edx [abcd--]
60: 01 F2
addl %esi, %edx
17: LEA2L 16(%ecx,%ebx,1), %esi [abcdS-]
62: 8D 74 19 10
leal 0x10(%ecx,%ebx,1), %esi
18: LDL (%esi), %esi [abcdS-]
66: 8B 36
movl (%esi), %esi
19: ADDL %edx, %esi [abc-S-]
68: 01 D6
addl %edx, %esi
20: LEA2L 20(%ecx,%ebx,1), %edx [abcdS-]
70: 8D 54 19 14
leal 0x14(%ecx,%ebx,1), %edx
21: LDL (%edx), %edx [abcdS-]
74: 8B 12
movl (%edx), %edx
22: XORL %esi, %edx [abcd--]
76: 31 F2
xorl %esi, %edx
23: LEA2L 24(%ecx,%ebx,1), %esi [abcdS-]
78: 8D 74 19 18
leal 0x18(%ecx,%ebx,1), %esi
24: LDL (%esi), %esi [abcdS-]
82: 8B 36
movl (%esi), %esi
25: SUBL %esi, %edx [abcd--]
84: 29 F2
subl %esi, %edx
26: LEA2L 28(%ecx,%ebx,1), %esi [a--dS-]
86: 8D 74 19 1C
leal 0x1C(%ecx,%ebx,1), %esi
27: LDL (%esi), %esi [a--dS-]
90: 8B 36
movl (%esi), %esi
28: ADDL %edx, %esi [a---S-]
92: 01 D6
addl %edx, %esi
29: PUTL %esi, %EDX [a-----]
94: 89 75 08
movl %esi, 0x8(%ebp)
30: ADDL $0x8, %eax [a-----]
97: 83 C0 08
addl $0x8, %eax
31: PUTL %eax, %ECX [a-----]
100: 89 45 04
movl %eax, 0x4(%ebp)
32: SUBL $0x3E7, %eax (-wOSZACP) [------]
103: FF 75 20 9D
pushl 32(%ebp) ; popfl
107: 81 E8 E7 03 00 00
subl $0x3E7, %eax
113: 9C 8F 45 20
pushfl ; popl 32(%ebp)
33: Jleo $0x8048508 (-rOSZACP) [------]
117: FF 75 20 9D
pushl 32(%ebp) ; popfl
121: 7F 0D
jnle-8 %eip+13
123: B8 08 85 04 08
movl $0x8048508, %eax
128: 89 45 24
movl %eax, 0x24(%ebp)
131: 0F 0B 0F 0B 90
ud2; ud2; nop
34: JMPo $0x8048539 ($2) [------]
136: B8 39 85 04 08
movl $0x8048539, %eax
141: 89 45 24
movl %eax, 0x24(%ebp)
144: 0F 0B 0F 0B 90
ud2; ud2; nop
---------------------------------------------------------------------------
Generated x86 code for inner loop, with crud removed, and bb start points
distinguished by a blank line.
decl (0x400DFD0C)
jnz-8 %eip+6
movl $0x1D, %ebp
ret
movl 0x4(%ebp), %eax
xorl %ebx, %ebx
leal 0x0(%ebx,%eax,4), %ebx
movl %ebx, 0x0(%ebp)
movl 0xC(%ebp), %ecx
leal 0x0(%ebx,%ecx,1), %edx
movl (%edx), %edx
addl 0x8(%ebp), %edx
leal 0x4(%ecx,%ebx,1), %esi
movl (%esi), %esi
xorl %edx, %esi
leal 0x8(%ecx,%ebx,1), %edi
movl (%edi), %edi
subl %edi, %esi
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
leal 0x14(%ecx,%ebx,1), %edx
movl (%edx), %edx
xorl %esi, %edx
leal 0x18(%ecx,%ebx,1), %esi
movl (%esi), %esi
subl %esi, %edx
leal 0x1C(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
movl %esi, 0x8(%ebp)
addl $0x8, %eax
movl %eax, 0x4(%ebp)
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
movl $0x8048508, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
movl $0x8048539, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
|
|
From: Julian S. <js...@ac...> - 2002-11-20 22:14:02
|
[I was going to put two topics in here, but decided it's simpler to put each in its own msg.] You'll have to forgive me for not quite keeping up with you young 'uns. I see the stuff on extended bbs but haven't grokked it properly yet. Besides, I spent half the day in meetings with CPU architects and if that isn't enough to put one's head in a spin, I don't know what is. Jeremy. I've built it with the 47-chained-bb patch and it works. A questions: What's the deal on the LRU mechanism when chaining is going, now? Do chained translations get marked when they get used, now? Not afaics ... but this is going to present something of a problem when LRUing because chained ones will appear to not have been used, and so will be much more likely to be dumped. Perhaps we can do something clever based on the idea that if a T has been chained it must have been used at least once since the last LRU pass. Not critical, but something which we need to have a plausible story on. J |