|
From: Josef W. <Jos...@gm...> - 2007-02-15 19:34:43
|
Hi,
we recently talked about VEX possibly inverting conditional jumps,
which can confuse tools' instrumentation.
I just came around a simple example for this using lackey:
==============================================
int main()
{
int i, sum=0;
for(i=0;i<1000000;i++)
sum += i;
return sum;
}
==============================================
Relevant assembler:
==============================================
...
80484ac: 03 d0 add %eax,%edx
80484ae: 83 c0 01 add $0x1,%eax
80484b1: 3d 40 42 0f 00 cmp $0xf4240,%eax
80484b6: 7c f4 jl 80484ac <main+0x18>
...
==============================================
Obviously, the conditional branch is taken 1000000 times.
Yet, "valgrind --tool=lackey ./loop" gives
==============================================
...
==24183== Jccs:
==24183== total: 1,036,628
==24183== taken: 14,348 ( 1%)
...
==============================================
Which is way off.
Using --trace-flags=11100000, one can see the invertion in VEX code
(only relevant parts):
==============================================
==== BB 1435 main+24(0x80484AC) BBs exec'd 39078 ====
------------------------ Front end ------------------------
...
0x80484B6: jl-8 0x80484AC
------ IMark(0x80484B6, 2) ------
PUT(60) = 0x80484B6:I32
if (32to1(x86g_calculate_condition[mcx=0x13]{0x380a4ef0}(0xC:I32,GET:I32(32),GET:I32(36),GET:I32(40),GET:I32(
44)):I32)) goto {Boring} 0x80484AC:I32
goto {Boring} 0x80484B8:I32
------------------------ After pre-instr IR optimisation ------------------------
...
------ IMark(0x80484B6, 2) ------
PUT(60) = 0x80484B6:I32
t70 = CmpLT32S(t57,0xF4240:I32)
t81 = 1Uto32(t70)
t82 = 32to1(t81)
t83 = Not1(t82)
if (t83) goto {Boring} 0x80484B8:I32
goto {Boring} 0x80484AC:I32
}
...
==============================================
The "pre-instr IR optimisation" phase inverts the condition,
so that the original "taken" case becomes a fall through case, counted
by lackey as "not taken".
The solution is to not look at VEX code for deciding the "taken" case,
but check for not-sequential instruction addresses of the guest code.
However, lackey is meant as an example to show how to write tools; it
is really bad to give wrong examples. So we could
(1) Get rid of JCC counting in lackey
(2) Correct it and show how to do it the right way
Solution (2) complicates lackey (a little?), but if it is also
a tutorial about branch counting, it should be correct.
So perhaps a third solution would be:
(3) Not invert the meaning of conditions in the "pre-instr IR optimisation"
phase
Comments?
Josef
|
|
From: Julian S. <js...@ac...> - 2007-02-15 19:46:13
|
> (2) Correct it and show how to do it the right way
>
> Solution (2) complicates lackey (a little?), but if it is also
> a tutorial about branch counting, it should be correct.
I agree. Sounds like a good solution to me.
> So perhaps a third solution would be:
> (3) Not invert the meaning of conditions in the "pre-instr IR optimisation"
> phase
That is hard to avoid given that IR optimisation does some simple
loop unrolling for single-bb loops. eg, it will unroll your
top:
add %eax,%edx
add $0x1,%eax
cmp $0xf4240,%eax
jl top
producing IR as if you had written
top:
add %eax,%edx
add $0x1,%eax
cmp $0xf4240,%eax
jnl after
add %eax,%edx
add $0x1,%eax
cmp $0xf4240,%eax
jl top
after:
See, the middle exit is now inverted.
Also, if we ever do more aggressive block formation which speculatively
chases across conditional branches ("trace formation"), then some branch
conditions will have to be inverted - it is unavoidable.
J
|
|
From: Josef W. <Jos...@gm...> - 2007-02-15 21:57:39
|
On Thursday 15 February 2007, Julian Seward wrote:
>
> > (2) Correct it and show how to do it the right way
> >
> > Solution (2) complicates lackey (a little?), but if it is also
> > a tutorial about branch counting, it should be correct.
>
> I agree. Sounds like a good solution to me.
Here is a proposed patch:
=====================================================================
--- a/lackey/lk_main.c
+++ b/lackey/lk_main.c
@@ -240,6 +240,8 @@ static ULong n_IRStmts = 0;
static ULong n_guest_instrs = 0;
static ULong n_Jccs = 0;
static ULong n_Jccs_untaken = 0;
+static ULong n_IJccs = 0;
+static ULong n_IJccs_untaken = 0;
static void add_one_func_call(void)
{
@@ -276,6 +278,16 @@ static void add_one_Jcc_untaken(void)
n_Jccs_untaken++;
}
+static void add_one_inverted_Jcc(void)
+{
+ n_IJccs++;
+}
+
+static void add_one_inverted_Jcc_untaken(void)
+{
+ n_IJccs_untaken++;
+}
+
/*------------------------------------------------------------*/
/*--- Stuff for --detailed-counts ---*/
/*------------------------------------------------------------*/
@@ -600,6 +612,9 @@ IRSB* lk_instrument ( VgCallbackClosure* closure,
Char fnname[100];
IRType type;
IRTypeEnv* tyenv = sbIn->tyenv;
+ Addr iaddr = 0, dst;
+ UInt ilen = 0;
+ Bool condition_inverted = False;
if (gWordTy != hWordTy) {
/* We don't currently support this case. */
@@ -661,6 +676,10 @@ IRSB* lk_instrument ( VgCallbackClosure* closure,
case Ist_IMark:
if (clo_basic_counts) {
+ /* Needed to be able to check for inverted condition in Ist_Exit */
+ iaddr = st->Ist.IMark.addr;
+ ilen = st->Ist.IMark.len;
+
/* Count guest instruction. */
di = unsafeIRDirty_0_N( 0, "add_one_guest_instr",
VG_(fnptr_to_fnentry)( &add_one_guest_instr ),
@@ -770,10 +789,23 @@ IRSB* lk_instrument ( VgCallbackClosure* closure,
case Ist_Exit:
if (clo_basic_counts) {
+ // The condition of a branch was inverted by VEX if a taken
+ // branch is in fact a fall trough according to client address
+ tl_assert(iaddr != 0);
+ dst = (sizeof(Addr) == 4) ? st->Ist.Exit.dst->Ico.U32 :
+ st->Ist.Exit.dst->Ico.U64;
+ condition_inverted = (dst == iaddr + ilen);
+
/* Count Jcc */
- di = unsafeIRDirty_0_N( 0, "add_one_Jcc",
+ if (!condition_inverted)
+ di = unsafeIRDirty_0_N( 0, "add_one_Jcc",
VG_(fnptr_to_fnentry)( &add_one_Jcc ),
mkIRExprVec_0() );
+ else
+ di = unsafeIRDirty_0_N( 0, "add_one_inverted_Jcc",
+ VG_(fnptr_to_fnentry)( &add_one_inverted_Jcc ),
+ mkIRExprVec_0() );
+
addStmtToIRSB( sbOut, IRStmt_Dirty(di) );
}
if (clo_trace_mem) {
@@ -784,10 +816,17 @@ IRSB* lk_instrument ( VgCallbackClosure* closure,
if (clo_basic_counts) {
/* Count non-taken Jcc */
- di = unsafeIRDirty_0_N( 0, "add_one_Jcc_untaken",
+ if (!condition_inverted)
+ di = unsafeIRDirty_0_N( 0, "add_one_Jcc_untaken",
VG_(fnptr_to_fnentry)(
&add_one_Jcc_untaken ),
mkIRExprVec_0() );
+ else
+ di = unsafeIRDirty_0_N( 0, "add_one_inverted_Jcc_untaken",
+ VG_(fnptr_to_fnentry)(
+ &add_one_inverted_Jcc_untaken ),
+ mkIRExprVec_0() );
+
addStmtToIRSB( sbOut, IRStmt_Dirty(di) );
}
break;
@@ -823,16 +862,19 @@ static void lk_fini(Int exitcode)
tl_assert(clo_fnname[0]);
if (clo_basic_counts) {
+ ULong total_Jccs = n_Jccs + n_IJccs;
+ ULong taken_Jccs = (n_Jccs - n_Jccs_untaken) + n_IJccs_untaken;
+
VG_(message)(Vg_UserMsg,
"Counted %,llu calls to %s()", n_func_calls, clo_fnname);
VG_(message)(Vg_UserMsg, "");
VG_(message)(Vg_UserMsg, "Jccs:");
- VG_(message)(Vg_UserMsg, " total: %,llu", n_Jccs);
- VG_(percentify)((n_Jccs - n_Jccs_untaken), (n_Jccs ? n_Jccs : 1),
+ VG_(message)(Vg_UserMsg, " total: %,llu", total_Jccs);
+ VG_(percentify)(taken_Jccs, (total_Jccs ? total_Jccs : 1),
percentify_decs, percentify_size, percentify_buf);
VG_(message)(Vg_UserMsg, " taken: %,llu (%s)",
- (n_Jccs - n_Jccs_untaken), percentify_buf);
+ taken_Jccs, percentify_buf);
VG_(message)(Vg_UserMsg, "");
VG_(message)(Vg_UserMsg, "Executed:");
=====================================================================
I checked the instrumentation; and in fact, VEX does 4x unrolling.
It seems to do the right thing now.
> Also, if we ever do more aggressive block formation which speculatively
> chases across conditional branches ("trace formation"), then some branch
> conditions will have to be inverted - it is unavoidable.
Is such a speculative chasing useful? I assume that you take the
outcome of the first execution, which does the translation. It is not
clear to me why the first execution will stay a good speculation in
the run of a program.
Josef
|
|
From: Julian S. <js...@ac...> - 2007-02-16 08:41:01
|
On Thursday 15 February 2007 21:56, Josef Weidendorfer wrote: > On Thursday 15 February 2007, Julian Seward wrote: > > > (2) Correct it and show how to do it the right way > > > > > > Solution (2) complicates lackey (a little?), but if it is also > > > a tutorial about branch counting, it should be correct. > > > > I agree. Sounds like a good solution to me. > > Here is a proposed patch: Looks good to me. > I checked the instrumentation; and in fact, VEX does 4x unrolling. > It seems to do the right thing now. Yes, up to 8x for the IR corresponding to 'rep movsb' etc. J |
|
From: Nicholas N. <nj...@cs...> - 2007-02-16 00:22:11
|
On Thu, 15 Feb 2007, Josef Weidendorfer wrote:
>> Also, if we ever do more aggressive block formation which speculatively
>> chases across conditional branches ("trace formation"), then some branch
>> conditions will have to be inverted - it is unavoidable.
>
> Is such a speculative chasing useful? I assume that you take the
> outcome of the first execution, which does the translation. It is not
> clear to me why the first execution will stay a good speculation in
> the run of a program.
Systems that do this kind of thing normally do so after doing some run-time
profiling for a while -- eg. optimise the hot blocks after they've run
enough times.
Nick
|
|
From: Julian S. <js...@ac...> - 2007-02-16 08:34:14
|
On Friday 16 February 2007 00:22, Nicholas Nethercote wrote:
> On Thu, 15 Feb 2007, Josef Weidendorfer wrote:
> >> Also, if we ever do more aggressive block formation which speculatively
> >> chases across conditional branches ("trace formation"), then some branch
> >> conditions will have to be inverted - it is unavoidable.
> >
> > Is such a speculative chasing useful? I assume that you take the
> > outcome of the first execution, which does the translation. It is not
> > clear to me why the first execution will stay a good speculation in
> > the run of a program.
>
> Systems that do this kind of thing normally do so after doing some run-time
> profiling for a while -- eg. optimise the hot blocks after they've run
> enough times.
That sounds like the right solution, but adds lots of complexity in
code cache management, which is already a bit hairy, what with redirection,
wrapping, discarding.
A couple of months back I did some experiments with trace selection
using static branch "prediction", predicting backwards branches taken
and forwards ones not taken. IIRC that gives about 68% correct rate
on spec cpu2000 'test'. Because, by default, vex ends basic blocks
at a conditional branch, this is better than it sounds - even a 50%
correct predict rate could in principle halve the number of block-to-
block transitions through the scheduler.
Some programs ran significantly faster (gzip) and many ran a couple
of percent faster. But it tends to slow down program startup because
the JIT produces (even) more code. Overall it didn't seem worth it.
J
|