You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Jeremy F. <je...@go...> - 2002-11-22 18:05:22
|
On Fri, 2002-11-22 at 09:37, Nicholas Nethercote wrote:
> That should work so long as you do it in the skin function called
> immediately from generated code. Currently, for example, MemCheck calls
> an assembly routine which calls a C function which causes a chain of calls
> before VG_(get_ExeContext)() is called.
Well, if there's a helper assembly routine, it would be trivial for it
to:
movl (%esp), %reg
movl %reg, NN(%ebp)
so it can be easily found by whoever needs it (and if we ran all helpers
through a wrapper like this, there wouldn't be a problem).
> I think that if we go down the eip-to-EIP route, the
> needle-in-a-call-stack approach is worth it so that it's all completely
> transparent to skins. If the skin has to remember to call
> __builtin_return_address before recording EIP that's a bit nasty.
Well, it isn't too bad, since there'll be this argument you need to pass
a value to, and it has to come from somewhere. But if you consider how
much work it is to avoid a single "movl $EIP, VGOFF_(m_eip)(%ebp)", it
doesn't seem like such a good idea.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 17:37:23
|
On 22 Nov 2002, Jeremy Fitzhardinge wrote: > If the helper/skin function wants it, wouldn't using > __builtin_return_address(0) near the top of the helper be a good enough > solution? Is it that you don't want to require helpers/skin callbacks > to do this? What am I missing? That should work so long as you do it in the skin function called immediately from generated code. Currently, for example, MemCheck calls an assembly routine which calls a C function which causes a chain of calls before VG_(get_ExeContext)() is called. If it was done with __builtin_return_address(0), it would have to change so a C function was called (not so hard I think) and then the %eip would have to be passed through the calls, which is a bit of a pain... I think that if we go down the eip-to-EIP route, the needle-in-a-call-stack approach is worth it so that it's all completely transparent to skins. If the skin has to remember to call __builtin_return_address before recording EIP that's a bit nasty. > So, I guess I'm veering away from mapping tables and complexity again. Hmm... yeah, maybe. I'm completely unsure now. N |
|
From: Jeremy F. <je...@go...> - 2002-11-22 17:07:48
|
On Fri, 2002-11-22 at 00:28, Julian Seward wrote:
> I have thought of a way to get hold of %eip in helper fns, skin independently
> and without slowing down the normal I-don't-want-to-know-%eip case, but it has
> a high coefficient of horribleness [although is easy to implement].
>
> Basic idea is that we can probably find the return address for the call to the
> helper, lying round on the stack, without outside assistance.
Um, isn't the return address into the generated code exactly and
precisely at (%esp)? If the helper/skin function wants it, wouldn't
using __builtin_return_address(0) near the top of the helper be a good
enough solution? Is it that you don't want to require helpers/skin
callbacks to do this? What am I missing?
> How does that sound? Does it make sense? Is it utterly horrible?
> I'm not sure if something this fragile is really a good idea, but still
> it might be worth trying ...
It seems a bit over-complex to me. If the problem is that we want
things which need %EIP can get to the up to date value, then I think we
can probably do it very simply while still avoiding the expense of
INCEIP:
1. Have a bit in each UInstr which means "update EIP" (as Nick
suggested)
2. When generating code for a UInstr with that bit set, emit "movl
$EIP, XX(%ebp)" before generating the rest of the code
Done. No rummaging, no tables.
It seems to me that the overhead of INCEIP is most noticeable with
simple skins which don't do much in the way of per-instruction work. As
soon as the skin starts doing reasonably complex things, then the
overhead of INCEIP (or its replacement) becomes less of an issue.
Therefore, I don't think its all that important to try and absolutely
squash any EIP manipulation in generated code, because it doesn't matter
that much.
The trouble with a search-around-for-EIP scheme is that it works OK for
the cases where there are very few calls out to the skin, but if you
don't have many skin calls, then you don't need to update the EIP that
much anyway. If you have lots of skin calls which need EIP, then doing
lots of searches is expensive, whereas the generated update is still
cheap.
I suppose the common case for something like memcheck is that there are
lots of skin helper calls, but only those which are actually reporting
errors care about EIP. My suspicion is that the work that memcheck does
anyway would overwhelm the effect of an extra EIP store per call.
So, I guess I'm veering away from mapping tables and complexity again.
BTW, even if we kill INCEIP, we still need to have a UInstr which marks
the boundaries between x86 instructions, even if it is functionally a
NOP.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 17:04:29
|
On Fri, 2002-11-22 at 01:08, Julian Seward wrote:
> One comment is that it might be worth having in each TT entry a small number
> of such slots, say 8. That way most basic blocks can store their table in
> the TT entry. For the occasional large bb, TT will have to contain a ptr to
> a table allocated in VG_(malloc)ville. This seems like a good idea because
> there's a considerable space overhead using VG's malloc stuff -- about 6 words
> per alloc -- so doing a large number of small malloc's is quite wasteful of
> space.
I was thinking about tacking the table onto the end of the generated
code, so it just lives in the translation cache. Unfortunately that
would break a nice property the current cache has: many chained jumps
are short, and their target is within the same cache line (because the
translation cache tends to lay basic blocks out in the order they were
executed). Still, I don't think that's a huge loss (definitely not
compared to removing INCEIP).
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 16:06:39
|
On Thu, 2002-11-21 at 23:58, Julian Seward wrote:
> Sorry to be stupid. I'm still confused. In that case, wouldn't be simpler
> just to set %EIP to the correct value before the call, and completely avoid
> all the hassle of establishing %eip and converting it to %EIP ? I think
> I must be missing something.
No, you're right, I'm being silly.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 12:37:23
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> [S2: suggestion of smarter code for Jz et al]
>
> Another thing which would surely help is to generate smarter code for
> the Jz / Jnz UINSTRS. Observation to be made here is that at least for
> the Zero flag, we might as well directly test the bit in %EFLAGS in
> memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
> %eflags: [note; this illustration is without t-chaining, but is equally
> valid in t-chaining's presence]
>
> Jzo $addr
> -->
> testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
> -- and N is the offset of the Z flag in
> -- %EFLAGS
> jz-8 %eip+6
> movl $addr, %eax
> ret
> %eip+6:
>
> in contrast to the current scheme
>
> pushl 32(%ebp)
> popfl
> jnz-8 %eip+6
> movl $addr, %eax
> ret
> %eip+6:
>
> In fact the new scheme not only avoids the horrible popfl; it also
> saves an insn!
Nice idea. But doesn't it clash with the lazy EFLAGs idea? viz:
37: ANDL %eax, %ebx (-wOSZACP)
pushl 32(%ebp) ; popfl
andl %eax, %ebx
pushfl ; popl 32(%ebp) ***
39: Jzo $0x402047AC (-rOSZACP)
pushl 32(%ebp) ; popfl ***
jnz-8 %eip+6
movl $0x402047AC, %eax
ret
If Jzo doesn't have the "pushl 32(%ebp) ; popfl" then you can't remove the
"pushfl ; popl 32(%ebp)" from ANDL -- instead of cutting 4 instructions,
it only cuts 2.
Another thing: will the ANDL's "pushfl ; popl 32(%ebp)" always be
necessary? Ie. do the condition codes have to be saved in EFLAGs before
the end of the BB? If so, then lazy EFLAG updating can only save two
instructions, not four, in which case the new Jz/Jnz idea seems better
because it saves another instruction anyway...
> Some complicated tests (not-below, etc) would still have to be done the
> old way, but we could cover the tests associated with single flags
> (zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
> catch most _uses_ of the condition codes.
FYI, from "valgrind --skin=none --trace-codegen=00001 true":
all jumps 1908
JMPo 916
Jzo 407
Jnzo 150
JMPo-c 177
JMPo-r 121
Jle 20
Jnbo 25
Jbo 13
Jnbeo 12
Jns 16
Jnleo 9
JMPo-sys 7
sum of those listed: 1873 (35 not mentioned)
Of the 726 Jccs, 77% are Jz/Jnz.
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 12:03:05
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> Second part of the trick is that we can probably constrain ourselves to
> scanning about a dozen or so stack locations; the RA must be in that area.
> Consider what the stack looks like immediately after the doing the call
> insn which leads to the helper (stack grows down):
>
> RA for VG_(run_innerloop)
> saved ebx, ecx, edx, esi, edi, ebp pushed at start of VG_(run_innerloop)
> RA for where VG_(run_innerloop) calls translation
> /* stuff below is created by translation and helpers it calls */
> RA for call to helper ("MEMEME") (since all args are passed in regs)
> /* stuff pushed by stack of helper functions */
> /* top-of-stack; %esp points here (or one word above) */
>
> Imagine now we grab the value of %esp at entry to VG_(run_innerloop) and
> park it in some global variable. If my picture of the stack is right, we
> can find the relevant RA ("MEMEME") by scanning downwards from that value
> for 8 or 9 words, looking for a word satisfying the constraints mentioned
> above. If we have to go even a few words further, we've probably missed it.
>
> The combination of value constraint and location constraint should make
> this fairly good at finding the correct RA. The main danger AFAICS is
> that the 6 regs saved at the start of VG_(run_innerloop) might hold
> an spuriously matching value. Even that can be avoided by snapshotting
> %eip after those are pushed, in which case we're even more strongly
> location-constrained: there can be only about 3 or 4 words to search
> before declaring that we've missed it somehow.
Do you mean "%esp after those are pushed"?
Actually, won't we know exactly which stack location it should be in, ie.
immediately after the RA of VG_(run_innerloop)? Or can extra stuff get
put between the two RAs?
> How does that sound? Does it make sense? Is it utterly horrible?
> I'm not sure if something this fragile is really a good idea, but still
> it might be worth trying ...
Another thing is that we have to subtract 6 (or whatever) from the RA to
get the right %eip before doing the eip-to-EIP lookup, but that's fine.
I'm starting to come around to this idea now...
N
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-22 11:33:21
|
On Fri, 22 Nov 2002, Julian Seward wrote:
> I have thought of a way to get hold of %eip in helper fns, skin independently
> and without slowing down the normal I-don't-want-to-know-%eip case, but it has
> a high coefficient of horribleness [although is easy to implement].
>
> Basic idea is that we can probably find the return address for the call to the
> helper, lying round on the stack, without outside assistance.
>
> Suppose (as is indeed the case) that %EIP is brought up to date at the
> start of each bb. Now the bb calls a helper fn and you want to know
> what %eip is at the point of the call.
>
> So we're looking for a word on the stack (the RA) that points back into
> the translation. Since we have %EIP for the bb, we can look up in TT
> the location and size of the translation. The word we're looking for
> must have a value in the range: location .. location + size - 1.
> Furthermore, 6 (?) bytes before the word must also be a valid location in the
> translation -- the call insn -- and it must contain the opcode for "call".
> [...]
> Actually I'd guess it's pretty robust, if the above analysis is correct.
Wow. I think it's the sort of idea that if I'd thought of it I would
think it was really cool and worth doing, but because someone else thought
of it I take two steps back and think hmm...
----
Ideally, we want these characteristics:
1a. simple idea
1b. simple effect on existing code (eg. affects only one place)
2. entirely skin-independent
3. maximally efficient, ie. only updates EIP when EIP is actually
recorded (I'm assuming we update EIP between BBs, though)
4. no extra space required
Your suggestion, assuming it works effectively, meets 2 and 3. It fails
1a; as for 1b, the finding-%eip complexity would all be in the one spot,
although it still needs the eip-to-EIP table.
----
Before you mentioned this finding-a-needle-in-a-call-stack idea, I was
leaning back towards the update-EIP-only-when-necessary approach.
%EIP only needs to be up-to-date for a small number of UInstrs:
a. CCALL/CALLM, if the called code records EIP
b. PUT %ESP, if stack change/post_mem_write events are tracked and the
called function records EIP
c. Any extended UInstr that records EIP
There are only two ways skin code can record EIP:
1. Call VG_(get_ExeContext)()
2. Call VG_(get_EIP)()
VG_(get_ExeContext)() could be changed so that it calls VG_(get_EIP)(),
which would essentially reduce the problem to a single point of code.
Using the latest ideas:
a. add a 1-bit field `records_eip' to CCALL and CALLM instructions that
a skin must turn on if the called code can record EIP
b. add a bool argument to the functions VG_(track_new_mem_stack)() etc,
which the skin makes `True' if the called code can record EIP
c. add another function that must be implemented by skins using extended
UCode, SK_(UInstr_record_EIP)()
a,b,c in this list correspond directly to a,b,c the list above.
(Actually, c could be handled by a's mechanism too.)
Any UInstr which could cause an EIP-record would have to be preceded with
a PUTEIP instruction (I agree that sounds better than INCEIP). Actually,
we could keep INCEIP instructions (necessary for determining x86
instruction boundaries), generate no code for them, and the codegen phase
could effectively add the PUTEIP code for UInstrs when the relevant
boolean is set without actually needing an explicit PUTEIP instruction.
This satisfies 1a and 4, isn't too bad for 1b, but fails the others. It
also makes the VG_(track_*) functions non-uniform, which is unfortunate.
----
So, I've criticised the flaws in the current proposals, but haven't come
up with anything better. Harumph.
----
A couple of other random thing that occurred to me:
i. If we do use an eip-to-EIP table, it doesn't need to record entries
for every instruction, just those that can cause an EIP-recording
(CCALL/CALLM/PUT ESP/extended UInstrs).
ii. None of this affects Cachegrind, since it records %eip for every x86
instruction in its cost-centre anyway. At one point I took out all
INCEIPs for Cachegrind, but then put them back in because of the
weird-signals-sometimes issue.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-22 09:01:44
|
> Maybe we should at least keep EIP up to date at the basic block level,
> so we only need to deal with the intra-BB case.
As per recent msg to Nick, doing so at least makes it easy to find the
%eip (and hence %EIP) by scanning the stack for helper-function
return addresses. One PUTEIP per bb doesn't seem too onerous.
> I was thinking of a table with one entry per original instruction, which
> contains the length of the original instruction and the length of the
> corresponding generated code. It may be possible to pack this into 2
> nibbles per instruction, with an escape mechanism to deal with rare
> cases (ie, if length == 0xF then get the details from the following
> byte). That could be built as the code is generated then tacked onto
> the end of the generated code (the TTE would contain the generated code
> length plus the length of the whole translation, including the
> instruction mapping table). That way doing the mapping is a simple
> linear scan:
>
> cur_eip = tte->trans_addr;
> cur_EIP = tte->orig_addr;
> entry = 0;
> while(eip > cur_eip) {
> cur_eip += tab[entry].trans_len;
> cur_EIP += tab[entry].orig_len;
> entry++'
> }
I was thinking of the same thing. We could be simpler and just use a pair
of bytes (orig_insn_len, trans_insn_len) for each orig insn; this gives an
overhead of 2 bytes per orig insn, which is not too bad. (About 2 mbyte for
a run of Mozilla).
We can easily do a bit better with some kind of cleverer
encoding scheme as you mention, if that's needed [probably an array of
12-bit entities, 4 bits for orig len (they are all in the range 1 .. 17)
and 8 for translated length, with perhaps (N, 0xFF) indicating that the
entire next 12-bit group is a translated length, in the (appalling) case
that one orig insn generates more than 254 bytes of translation.
One comment is that it might be worth having in each TT entry a small number
of such slots, say 8. That way most basic blocks can store their table in
the TT entry. For the occasional large bb, TT will have to contain a ptr to
a table allocated in VG_(malloc)ville. This seems like a good idea because
there's a considerable space overhead using VG's malloc stuff -- about 6 words
per alloc -- so doing a large number of small malloc's is quite wasteful of
space.
Adding 8 slots to each TT covers most bbs, and would take 4 words
with the (byte,byte) encoding scheme and 3 words with the (4,8)
scheme.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 08:24:08
|
Nick,
I have thought of a way to get hold of %eip in helper fns, skin independently
and without slowing down the normal I-don't-want-to-know-%eip case, but it has
a high coefficient of horribleness [although is easy to implement].
Basic idea is that we can probably find the return address for the call to the
helper, lying round on the stack, without outside assistance.
Suppose (as is indeed the case) that %EIP is brought up to date at the
start of each bb. Now the bb calls a helper fn and you want to know
what %eip is at the point of the call.
So we're looking for a word on the stack (the RA) that points back into
the translation. Since we have %EIP for the bb, we can look up in TT
the location and size of the translation. The word we're looking for
must have a value in the range: location .. location + size - 1.
Furthermore, 6 (?) bytes before the word must also be a valid location in the
translation -- the call insn -- and it must contain the opcode for "call".
So you're heavily constrained as to the value of the word being searched
for.
Second part of the trick is that we can probably constrain ourselves to
scanning about a dozen or so stack locations; the RA must be in that area.
Consider what the stack looks like immediately after the doing the call
insn which leads to the helper (stack grows down):
RA for VG_(run_innerloop)
saved ebx, ecx, edx, esi, edi, ebp pushed at start of VG_(run_innerloop)
RA for where VG_(run_innerloop) calls translation
/* stuff below is created by translation and helpers it calls */
RA for call to helper ("MEMEME") (since all args are passed in regs)
/* stuff pushed by stack of helper functions */
/* top-of-stack; %esp points here (or one word above) */
Imagine now we grab the value of %esp at entry to VG_(run_innerloop) and
park it in some global variable. If my picture of the stack is right, we
can find the relevant RA ("MEMEME") by scanning downwards from that value
for 8 or 9 words, looking for a word satisfying the constraints mentioned
above. If we have to go even a few words further, we've probably missed it.
The combination of value constraint and location constraint should make
this fairly good at finding the correct RA. The main danger AFAICS is
that the 6 regs saved at the start of VG_(run_innerloop) might hold
an spuriously matching value. Even that can be avoided by snapshotting
%eip after those are pushed, in which case we're even more strongly
location-constrained: there can be only about 3 or 4 words to search
before declaring that we've missed it somehow.
How does that sound? Does it make sense? Is it utterly horrible?
I'm not sure if something this fragile is really a good idea, but still
it might be worth trying ...
Actually I'd guess it's pretty robust, if the above analysis is correct.
And even if it calculates a wrong %EIP once in 10000 calls, do we care?
[perhaps cachegrind does? I don't know]. The calculated %EIP can be
sanity checked against the %EIP-saved-at-bb-start and the bb's known
original length as extracted from TT. If the %EIP calculated via this
method points outside that range, something's clearly wrong and so we
can ignore it and fall back to the %EIP-saved-at-bb-start.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 07:51:04
|
On Friday 22 November 2002 7:08 am, Jeremy Fitzhardinge wrote: > On Thu, 2002-11-21 at 23:07, Julian Seward wrote: > > Um, I'm confused. I don't understand how to solve the following problem: > > memcheck calls a helper function to do a check, concludes there's an > > error and needs %EIP. How exactly do you propose to generate it if you > > don't know %eip at the point where the helper was called? > > Oh, sorry. I was assuming you'd store that in a baseblock field before > the call. Hm, I guess that would need a trampoline to do it properly: > > movl %ccall-func, %eax > call ccall-tramp > [...] > > > ccall-tramp: > movl (%esp), %ebx > movl %ebx, XX(%ebp) > jmp *%eax > > Or perhaps do it completely inline. Maybe it's all a bit clumsy. Sorry to be stupid. I'm still confused. In that case, wouldn't be simpler just to set %EIP to the correct value before the call, and completely avoid all the hassle of establishing %eip and converting it to %EIP ? I think I must be missing something. J |
|
From: Jeremy F. <je...@go...> - 2002-11-22 07:08:50
|
On Thu, 2002-11-21 at 23:07, Julian Seward wrote:
> Um, I'm confused. I don't understand how to solve the following problem:
> memcheck calls a helper function to do a check, concludes there's an error
> and needs %EIP. How exactly do you propose to generate it if you don't
> know %eip at the point where the helper was called?
Oh, sorry. I was assuming you'd store that in a baseblock field before
the call. Hm, I guess that would need a trampoline to do it properly:
movl %ccall-func, %eax
call ccall-tramp
[...]
ccall-tramp:
movl (%esp), %ebx
movl %ebx, XX(%ebp)
jmp *%eax
Or perhaps do it completely inline. Maybe it's all a bit clumsy.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-22 07:00:18
|
> > The problem appears to be how to get hold of %eip (the real one) when > > a skin calls a helper function, in a way which is not skin-specific. > > I thought about this this evening for some considerable time, but I > > can't think of a clean solution which doesn't involve some amount of > > performance loss. The least-worst thing I thought of was to associate > > with CCALL uinstrs a boolean flag, which when set automagically causes > > the current %eip to get passed to the called function as an extra > > parameter. This at least makes it skin-independent. > > But we have a VG_(get_EIP)() function. Why can't skins just use that, > and we hide all the complexity in that function? Why does a skin need > to know about the hardware %eip? Um, I'm confused. I don't understand how to solve the following problem: memcheck calls a helper function to do a check, concludes there's an error and needs %EIP. How exactly do you propose to generate it if you don't know %eip at the point where the helper was called? > That's nice and simple to implement. It's a good start. Has anyone > looked to see what the proportions of the different comparisons are? I > would expect Z and NZ would be way up there... I bet Hennessy and Patterson has this info. I'll look. J |
|
From: Julian S. <js...@ac...> - 2002-11-22 05:07:01
|
So, I've read all todays email messages, I think.
Nick, it looks like you've picked up some good issues in today's hackery.
Let me try and re-draw the picture in light of them.
We're trying to fix 3 performance-sapping problems:
1. too many INCEIPs
2. too many eflags save/restores
3. too many fpu save/restores
3 is pretty much solved already by Jeremy's work. 1 and 2 are still open.
1. INCEIPs
~~~~~~~~~~~
The best way to handle the INCEIP expense is replace it all with a
table lookup. As Jeremy suggests, it's probably simpler to have one
big table than one table per bb, and I'm probably inclined in favour
of that.
The problem appears to be how to get hold of %eip (the real one) when
a skin calls a helper function, in a way which is not skin-specific.
I thought about this this evening for some considerable time, but I
can't think of a clean solution which doesn't involve some amount of
performance loss. The least-worst thing I thought of was to associate
with CCALL uinstrs a boolean flag, which when set automagically causes
the current %eip to get passed to the called function as an extra
parameter. This at least makes it skin-independent.
Unfortunately, it adds extra expense for almost all of the calls made
by cachegrind, memcheck and addrcheck, since most of those can potentially
report an error and so need %eip. This is not good. So I don't like
this idea.
2. eflags save/restores
~~~~~~~~~~~~~~~~~~~~~~~~
The second thing Nick has unearthed is that the lazy eflag optimisation
will often not apply, due to intervening flag-mangling operations.
[Nick: yes, your understanding of how this is supposed to work appears
to be correct]. This is bad; but I'd like to do better, because we're
really getting clobbered by these pushf/popf insns.
So here's ...
Yet another proposal
~~~~~~~~~~~~~~~~~~~~
If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
should back off from the %EIP-by-tables story; in effect fall back to
Variant (1) in this mornings proposal. That means retaining INCEIP, at
least when we can't get rid of redundant updates.
I've realised that perhaps INCEIP isn't a clever semantics. Problem is
that (1) it turns into (eg) addl $2, 36(%ebp), which means a load, ALU op,
and store, and (2) it trashes the flags.
[S1: suggestion of PUTEIP]
How about if we had instead PUTEIP, which simply sets the value of %EIP
to some literal value? So, we just emit PUTEIPs with absolute addresses
where currently we create INCEIPs with deltas. PUTEIP then becomes
something like
movl $0x8048517, 36(%ebp)
which is a longer insn (7 bytes?), but is still a fastpath decode at least
on Athlon. It has the advantage of not requiring a load or an ALU op, so is
less resource-hungry. And best of all, it is flag-unaffecting, so that we
can now do the lazy eflags stuff without getting messed up by them.
[S2: suggestion of smarter code for Jz et al]
Another thing which would surely help is to generate smarter code for
the Jz / Jnz UINSTRS. Observation to be made here is that at least for
the Zero flag, we might as well directly test the bit in %EFLAGS in
memory -- in fact this would be a lot cheaper than hauling %EFLAGS into
%eflags: [note; this illustration is without t-chaining, but is equally
valid in t-chaining's presence]
Jzo $addr
-->
testl $(1<<N), 32(%ebp) -- 32(%ebp) is %EFLAGS
-- and N is the offset of the Z flag in
-- %EFLAGS
jz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
in contrast to the current scheme
pushl 32(%ebp)
popfl
jnz-8 %eip+6
movl $addr, %eax
ret
%eip+6:
In fact the new scheme not only avoids the horrible popfl; it also
saves an insn!
Some complicated tests (not-below, etc) would still have to be done the
old way, but we could cover the tests associated with single flags
(zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
catch most _uses_ of the condition codes.
[S3: suggestion of uinstr reordering]
Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
created by memcheck and cachegrind will often get in the way of the lazy
eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
specifically) to generatee instrumentation a bit more cleverly so as to try
and keep condition code generators and users together more often. Dunno if
this will help.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 02:53:21
|
On Thu, 2002-11-21 at 15:31, Julian Seward wrote:
> Ha, that sounds like it might work well enough. I sometimes wondered
> how OSs implement the pseudo-LRU stuff for VMs. Is there a standard
> description of this, that you know of?
This is called the one-hand clock algorithm. The two-hand variant is
used when the scan over the whole of memory would take too long, and you
want to test before a complete pass has been made. I don't know what
the standard reference is, but I think pretty much all OS books talk
about it (Tannenbaum, for example).
> Yes, that is the simplest thing, but I fear it would be very expensive.
> A less radical similar scheme is to randomly discard some fraction of
> the translations. Then you get into schemes where you randomly move
> some fraction of the translations into a "victimisation cache" (in
> hardware speak). If they turn out to be needed again soon, they can
> be moved back from the cache back into the main set of active translations.
> If the V-cache gets full _it_ then is flushed. This scheme has the
> effect of somewhat mitigating the situation where a translation is
> randomly selected for demotion and then used again soon afterwards.
> But it's all a bit complex.
Dynamo doesn't quite dump everything when the cache gets full. It keeps
track of the rate of creation of new basic blocks, and dumps the cache
when this starts getting high, on the grounds that the program is
entering a new working set and the cache is stale anyway. You could
define "too high" in terms of how full the cache is in proportion to the
size you'd like it to be, since you'd obviously like a high cache fill
rate when the cache is empty.
A generational caching scheme sounds good, so long as it doesn't
actually involve physically moving the code.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-22 02:41:14
|
On Thu, 2002-11-21 at 16:08, Julian Seward wrote:
> 1. INCEIPs
> ~~~~~~~~~~~
> The best way to handle the INCEIP expense is replace it all with a
> table lookup. As Jeremy suggests, it's probably simpler to have one
> big table than one table per bb, and I'm probably inclined in favour
> of that.
Eh? No I think one table per BB is easiest to handle, particularly if
we have EIP at least up to date to the current basic block.
> The problem appears to be how to get hold of %eip (the real one) when
> a skin calls a helper function, in a way which is not skin-specific.
> I thought about this this evening for some considerable time, but I
> can't think of a clean solution which doesn't involve some amount of
> performance loss. The least-worst thing I thought of was to associate
> with CCALL uinstrs a boolean flag, which when set automagically causes
> the current %eip to get passed to the called function as an extra
> parameter. This at least makes it skin-independent.
But we have a VG_(get_EIP)() function. Why can't skins just use that,
and we hide all the complexity in that function? Why does a skin need
to know about the hardware %eip?
> Unfortunately, it adds extra expense for almost all of the calls made
> by cachegrind, memcheck and addrcheck, since most of those can potentially
> report an error and so need %eip. This is not good. So I don't like
> this idea.
Memcheck only needs %EIP when actually reporting an error though.
> 2. eflags save/restores
> ~~~~~~~~~~~~~~~~~~~~~~~~
> The second thing Nick has unearthed is that the lazy eflag optimisation
> will often not apply, due to intervening flag-mangling operations.
> [Nick: yes, your understanding of how this is supposed to work appears
> to be correct]. This is bad; but I'd like to do better, because we're
> really getting clobbered by these pushf/popf insns.
I had an idea about restoring eflags without pushf/popf: just rerun the
setting instruction. Since (almost?) all the instructions which
generate eflags are otherwise side-effect free and don't depend on
memory, why not just rerun them on the same values just before the
jump? (Or perhaps move it altogether, but that might be tricker than
just copying the instruction.)
I haven't looked at this in detail, but it's an amusing idea...
Hm, I guess making sure the flags are correct for the end of the BB
still needs them to be explicitly saved. Which I suppose we could do in
the form of saving a piece of code which has the right effect, but
that's probably going a bit far...
> If we can't cleanly solve problem (1) [getting hold of %eip], perhaps we
> should back off from the %EIP-by-tables story; in effect fall back to
> Variant (1) in this mornings proposal. That means retaining INCEIP, at
> least when we can't get rid of redundant updates.
I still don't think its hard to efficiently implement an EIP table
lookup.
> How about if we had instead PUTEIP, which simply sets the value of %EIP
> to some literal value? So, we just emit PUTEIPs with absolute addresses
> where currently we create INCEIPs with deltas. PUTEIP then becomes
> something like[...]
I do this in the t-chaining patch. Just before each jump, it generates:
movl $target-addr, %eax
movl %eax, 36(%ebp)
It means that if the chaining patch fails, it can still just ret into
the dispatch loop, and %eax is set up properly, as well as having %EIP
up to date in case there's a context switch.
I currently generate this inline, but I was thinking of adding a SETEIP
with these semantics. This would also be useful to distinguish basic
blocks in an extended basic block when doing trace caching.
> [S2: suggestion of smarter code for Jz et al]
>
> Another thing which would surely help is to generate smarter code for
> the Jz / Jnz UINSTRS. Observation to be made here is that at least for
> the Zero flag, we might as well directly test the bit in %EFLAGS in
> memory [...]
> Some complicated tests (not-below, etc) would still have to be done the
> old way, but we could cover the tests associated with single flags
> (zero/nonzero, sign/nonsign, carry/nocarry) and I think that would
> catch most _uses_ of the condition codes.
That's nice and simple to implement. It's a good start. Has anyone
looked to see what the proportions of the different comparisons are? I
would expect Z and NZ would be way up there...
Hm, looking at the definitions of the tests in the manual, it seems that
it would be pretty easy to generate open-coded equivalents for most of
them in much less than the 12 cycles to do popf...
> Even with the PUTEIP hack, I can see that, as Nick says, the instrumentation
> created by memcheck and cachegrind will often get in the way of the lazy
> eflags stuff. Now I'm wondering if it is possible for the skins (memcheck
> specifically) to generatee instrumentation a bit more cleverly so as to try
> and keep condition code generators and users together more often. Dunno if
> this will help.
This is along the lines of my suggestion above of duplicating the
flags-setting instruction to just before the test. It's a little bit
like Shade's use of carefully chosen instructions to set the flags to
the desired value.
J
|