Re: [Valgrind-developers] Experiences with direct instrumentation on amd64 (trying to speedup cache

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Unfortunately I can't say anything much useful, but anyway here's
some background:

> I want to share some results I get from VEX for direct
> instrumentation on amd64, while trying to improve cachegrind.
> Perhaps there is some low-hanging fruit regarding some
> VEX code gen improvements.
> (I suppose this is mainly for Julian. Perhaps you can give
> me some pointers if improvements seem possible?)
> 
> Here is an example were I tried to do a counter increment in
> instrumentation (adding 0xB for 11 previous Ir events).
> After the instrumentation pass:
> 
> ...
>     t35 = 0x402066B80:I64
>     t36 = LDle:I64(t35)
>     t37 = Add64(t36,0xB:I64)
>     STle(t35) = t37
> ...
> 
> This is the result for amd64:
> 
> ...
>   movabsq $0x402066B80,%r12
>   movabsq $0x402066B80,%r13
>   movq (%r13),%r14
>   addq $0xB,%r14
>   movq %r14,(%r12)

The code makes more sense if you compare it against the IR that goes into the
instruction selector (after tree-building that is).  You can get this with 
trace flag 00001000 I believe.  For example, I guess it would show

  STle(0x402066B80:I64) = Add64(LDle:I64(0x402066B80:I64),0xB:I64)

One problem is that the IR optimiser (ir_opt.c) is applied to the IR after
instrumentation (producing the "tree-build" IR).  One action of IR opt is to
aggressively propagate constant values as far as possible.  So your careful
CSEing of t35 in the example above is completely destroyed, resulting in the
IR line above with the duplicated constant, and hence in the two copies of the
constant. 

This is one of the many difficulties of writing an optimiser -- how aggressive
should constant propagation be?  Generally, "as aggressive as possible" is
a win because it creates the maximum opportunity for constant folding and
dead code removal .. but in this case it is not a win.

One kludgey workaround is to add a special case to iselStmt in
host_amd64_isel.c, to detect the special case

   STle(atom1) = Add64(LDle:I64(atom2), expr)

and in the case that atom1 == atom2, emit code more like what you expect.  

   .. compute expr into %rExpr
   .. compute atom1 into %rAddr (really, into an AMD64AMode)
   addq %rExpr, (%rAddr)

but that's fragile (you would need a new rule for the 32 bit (addl) case,
for example) and would need to be re-done for every back end (for optimum
performance).  Right now it's probably your least bad option.  Once you
figure out how to hack on the instruction selector, it's actually very
easy to do.  (+ quite fun)

If expr is a constant that fits in 32 bits then you can special case it
even more, to generate exactly the 2 insn sequence you want.

------------------------------------------------------------

> To reduce code blowup for constant parameters for dirty helpers, I
> tried to load them in tempregs, knowning that they are used for multiple
> dirty helpers in a row: [...]

>   movabsq $0x402066A28,%rbx
>   movabsq $0x1003C,%r10
>   movabsq $0x4020020C0,%r9
>   movabsq $0x4020020C0,%r8
>   movq (%r8),%rdi
>   cmpq $0x1003C,%rdi
>   movq %rbx,%rdi
>   movq %r10,%rsi
>   movq %r9,%rdx
>   callnz[3] 0x38021130
>   movabsq $0x402066A40,%rbx
>   movabsq $0x1003C,%r10
>   movabsq $0x4020020C0,%r9
>   movabsq $0x4020020C0,%r8
>   movq (%r8),%rdi
>   cmpq $0x1003C,%rdi
>   movq %rbx,%rdi
>   movq %r10,%rsi
>   movq %r9,%rdx
>   callnz[3] 0x38021130
> 
> Again, this probably is expected, as VEX implements guarded helper calls as
> special instruction such as "callnz". But this means that all the setup for
> the dirty helper call is done all the time, such as loading constant
> parameters.
> They are even loaded twice once for the check and once for preparing
> parameters.
> 
> This makes the guarded dirty helpers slower than delaying the check into
> an always called dirty helper.

Yeah, this is really bad.  Really what we need in IR, that would solve this 
problem properly, and make the IR generally much more flexible, is control
flow diamonds -- if-then-else constructs.  Then you can put the dirty call
in an else part and the fast case code in the then part (obviously), or
vice versa.

Unfortunately doing if-then-else control flow makes the IR optimisation and
register allocation passes much more complex, because of the need to unify
the register/optimisation state from the two branches at the merge point
(for forwards analysis/optimisation passes) or at the start of the construct
(for backwards analysis/optimisation passes).

Optimisation and regalloc in the presence of loops (full control flow) is even
more complex, but I don't see much use for supporting loops, fortunately.

This is the real reason for the callnz kludge -- to avoid such problems.
If I had to do all this over again, I would support if-then-else in IR
and the back ends properly.  If there are any enthusiastic compiler hackers
out there who want to try this, speak up now.

J

Re: [Valgrind-developers] Experiences with direct instrumentation on amd64 (trying to speedup cache

Re: [Valgrind-developers] Experiences with direct instrumentation on amd64 (trying to speedup cachegrind)