|
From: Josef W. <Jos...@gm...> - 2011-11-29 00:28:00
|
Hi,
I want to share some results I get from VEX for direct
instrumentation on amd64, while trying to improve cachegrind.
Perhaps there is some low-hanging fruit regarding some
VEX code gen improvements.
(I suppose this is mainly for Julian. Perhaps you can give
me some pointers if improvements seem possible?)
Here is an example were I tried to do a counter increment in
instrumentation (adding 0xB for 11 previous Ir events).
After the instrumentation pass:
...
t35 = 0x402066B80:I64
t36 = LDle:I64(t35)
t37 = Add64(t36,0xB:I64)
STle(t35) = t37
...
This is the result for amd64:
...
movabsq $0x402066B80,%r12
movabsq $0x402066B80,%r13
movq (%r13),%r14
addq $0xB,%r14
movq %r14,(%r12)
...
Even though I am loading the constant address into a tempreg (t35) to
use for loading and storing, the resulting code uses two registers for
that, and loads the same 64bit address twice.
The ideal result probably would be:
movabsq $0x402066B80,%r12
addq $0xB,(%r12)
To reduce code blowup for constant parameters for dirty helpers, I
tried to load them in tempregs, knowning that they are used for multiple
dirty helpers in a row:
t31 = 0x1003C:I64
t32 = 0x4020020C0:I64
DIRTY 1:I1 :::
log_1I_0D_cache_access[rp=3]{0x38021130}(0x402066A28:I64,t31,t32)
DIRTY 1:I1 :::
log_1I_0D_cache_access[rp=3]{0x38021130}(0x402066A40:I64,t31,t32)
Result:
movabsq $0x402066A28,%rdi
movq $0x1003C,%rsi
movabsq $0x4020020C0,%rdx
call[3] 0x38021130
movabsq $0x402066A40,%rdi
movq $0x1003C,%rsi
movabsq $0x4020020C0,%rdx
call[3] 0x38021130
Probably this is expected, as instead of spilling a register to the
stack and
restoring it after the dirty helper, it is better to re-load the 64bit
value again.
However, the result is that values found constant at instrumentation
time should
probably not passed as parameters, as this blows up generated code.
Better put them into a struct and just pass the one address of this
struct to the
helper. Accessing this struct once will bring a full line into cache, thus
accessing further values in this struct is cheap.
The last two parameters above are directly used for a fast check whether
simulation
has to be called at all, or whether the cache state stays the same. If
doing the
check in instrumentation, it looks like:
t31 = 0x1003C:I64
t32 = 0x4020020C0:I64
t33 = LDle:I64(t32)
t34 = CmpNE64(t33,t31)
DIRTY t34 :::
log_1I_0D_cache_access[rp=3]{0x38021130}(0x402066A28:I64,t31,t32)
t35 = LDle:I64(t32)
t36 = CmpNE64(t35,t31)
DIRTY t36 :::
log_1I_0D_cache_access[rp=3]{0x38021130}(0x402066A40:I64,t31,t32)
It looks quite compact, as the check is reusing data in tempregs, which
even can
be used as parameters for the dirty helpers. However:
movabsq $0x402066A28,%rbx
movabsq $0x1003C,%r10
movabsq $0x4020020C0,%r9
movabsq $0x4020020C0,%r8
movq (%r8),%rdi
cmpq $0x1003C,%rdi
movq %rbx,%rdi
movq %r10,%rsi
movq %r9,%rdx
callnz[3] 0x38021130
movabsq $0x402066A40,%rbx
movabsq $0x1003C,%r10
movabsq $0x4020020C0,%r9
movabsq $0x4020020C0,%r8
movq (%r8),%rdi
cmpq $0x1003C,%rdi
movq %rbx,%rdi
movq %r10,%rsi
movq %r9,%rdx
callnz[3] 0x38021130
Again, this probably is expected, as VEX implements guarded helper calls as
special instruction such as "callnz". But this means that all the setup for
the dirty helper call is done all the time, such as loading constant
parameters.
They are even loaded twice once for the check and once for preparing
parameters.
This makes the guarded dirty helpers slower than delaying the check into
an always called dirty helper.
Josef
|
|
From: Julian S. <js...@ac...> - 2011-12-09 14:28:11
|
Unfortunately I can't say anything much useful, but anyway here's some background: > I want to share some results I get from VEX for direct > instrumentation on amd64, while trying to improve cachegrind. > Perhaps there is some low-hanging fruit regarding some > VEX code gen improvements. > (I suppose this is mainly for Julian. Perhaps you can give > me some pointers if improvements seem possible?) > > Here is an example were I tried to do a counter increment in > instrumentation (adding 0xB for 11 previous Ir events). > After the instrumentation pass: > > ... > t35 = 0x402066B80:I64 > t36 = LDle:I64(t35) > t37 = Add64(t36,0xB:I64) > STle(t35) = t37 > ... > > This is the result for amd64: > > ... > movabsq $0x402066B80,%r12 > movabsq $0x402066B80,%r13 > movq (%r13),%r14 > addq $0xB,%r14 > movq %r14,(%r12) The code makes more sense if you compare it against the IR that goes into the instruction selector (after tree-building that is). You can get this with trace flag 00001000 I believe. For example, I guess it would show STle(0x402066B80:I64) = Add64(LDle:I64(0x402066B80:I64),0xB:I64) One problem is that the IR optimiser (ir_opt.c) is applied to the IR after instrumentation (producing the "tree-build" IR). One action of IR opt is to aggressively propagate constant values as far as possible. So your careful CSEing of t35 in the example above is completely destroyed, resulting in the IR line above with the duplicated constant, and hence in the two copies of the constant. This is one of the many difficulties of writing an optimiser -- how aggressive should constant propagation be? Generally, "as aggressive as possible" is a win because it creates the maximum opportunity for constant folding and dead code removal .. but in this case it is not a win. One kludgey workaround is to add a special case to iselStmt in host_amd64_isel.c, to detect the special case STle(atom1) = Add64(LDle:I64(atom2), expr) and in the case that atom1 == atom2, emit code more like what you expect. .. compute expr into %rExpr .. compute atom1 into %rAddr (really, into an AMD64AMode) addq %rExpr, (%rAddr) but that's fragile (you would need a new rule for the 32 bit (addl) case, for example) and would need to be re-done for every back end (for optimum performance). Right now it's probably your least bad option. Once you figure out how to hack on the instruction selector, it's actually very easy to do. (+ quite fun) If expr is a constant that fits in 32 bits then you can special case it even more, to generate exactly the 2 insn sequence you want. ------------------------------------------------------------ > To reduce code blowup for constant parameters for dirty helpers, I > tried to load them in tempregs, knowning that they are used for multiple > dirty helpers in a row: [...] > movabsq $0x402066A28,%rbx > movabsq $0x1003C,%r10 > movabsq $0x4020020C0,%r9 > movabsq $0x4020020C0,%r8 > movq (%r8),%rdi > cmpq $0x1003C,%rdi > movq %rbx,%rdi > movq %r10,%rsi > movq %r9,%rdx > callnz[3] 0x38021130 > movabsq $0x402066A40,%rbx > movabsq $0x1003C,%r10 > movabsq $0x4020020C0,%r9 > movabsq $0x4020020C0,%r8 > movq (%r8),%rdi > cmpq $0x1003C,%rdi > movq %rbx,%rdi > movq %r10,%rsi > movq %r9,%rdx > callnz[3] 0x38021130 > > Again, this probably is expected, as VEX implements guarded helper calls as > special instruction such as "callnz". But this means that all the setup for > the dirty helper call is done all the time, such as loading constant > parameters. > They are even loaded twice once for the check and once for preparing > parameters. > > This makes the guarded dirty helpers slower than delaying the check into > an always called dirty helper. Yeah, this is really bad. Really what we need in IR, that would solve this problem properly, and make the IR generally much more flexible, is control flow diamonds -- if-then-else constructs. Then you can put the dirty call in an else part and the fast case code in the then part (obviously), or vice versa. Unfortunately doing if-then-else control flow makes the IR optimisation and register allocation passes much more complex, because of the need to unify the register/optimisation state from the two branches at the merge point (for forwards analysis/optimisation passes) or at the start of the construct (for backwards analysis/optimisation passes). Optimisation and regalloc in the presence of loops (full control flow) is even more complex, but I don't see much use for supporting loops, fortunately. This is the real reason for the callnz kludge -- to avoid such problems. If I had to do all this over again, I would support if-then-else in IR and the back ends properly. If there are any enthusiastic compiler hackers out there who want to try this, speak up now. J |
|
From: Josef W. <Jos...@gm...> - 2011-12-09 15:35:30
|
Hi Julian, On 09.12.2011 15:27, Julian Seward wrote: > Unfortunately I can't say anything much useful, but anyway here's > some background: thank you very much for the comments. Introducing new optimization rules and special cases for the instruction selector seems not to be low-hanging fruit for me at the moment, but I can imagine to put some time into it if I see benefits (I actually have a working patch for improved code generation for writing byte constants into memory for the amd64 backend; what's missing are regression test). In the end, I think my mail just puts together some experiences/pitfalls with VEX, and that some cases are better solved not by direct instrumentation. >> I want to share some results I get from VEX for direct >> instrumentation on amd64, while trying to improve cachegrind. >> Perhaps there is some low-hanging fruit regarding some >> VEX code gen improvements. >> (I suppose this is mainly for Julian. Perhaps you can give >> me some pointers if improvements seem possible?) >> >> Here is an example were I tried to do a counter increment in >> instrumentation (adding 0xB for 11 previous Ir events). >> After the instrumentation pass: >> >> ... >> t35 = 0x402066B80:I64 >> t36 = LDle:I64(t35) >> t37 = Add64(t36,0xB:I64) >> STle(t35) = t37 >> ... >> >> This is the result for amd64: >> >> ... >> movabsq $0x402066B80,%r12 >> movabsq $0x402066B80,%r13 >> movq (%r13),%r14 >> addq $0xB,%r14 >> movq %r14,(%r12) > The code makes more sense if you compare it against the IR that goes into the > instruction selector (after tree-building that is). You can get this with > trace flag 00001000 I believe. For example, I guess it would show > > STle(0x402066B80:I64) = Add64(LDle:I64(0x402066B80:I64),0xB:I64) > > One problem is that the IR optimiser (ir_opt.c) is applied to the IR after > instrumentation (producing the "tree-build" IR). One action of IR opt is to > aggressively propagate constant values as far as possible. So your careful > CSEing of t35 in the example above is completely destroyed, resulting in the > IR line above with the duplicated constant, and hence in the two copies of the > constant. > > This is one of the many difficulties of writing an optimiser -- how aggressive > should constant propagation be? Generally, "as aggressive as possible" is > a win because it creates the maximum opportunity for constant folding and > dead code removal .. but in this case it is not a win. Ok, I understand. > One kludgey workaround is to add a special case to iselStmt in > host_amd64_isel.c, to detect the special case > > STle(atom1) = Add64(LDle:I64(atom2), expr) > > and in the case that atom1 == atom2, emit code more like what you expect. > > .. compute expr into %rExpr > .. compute atom1 into %rAddr (really, into an AMD64AMode) > addq %rExpr, (%rAddr) > > but that's fragile (you would need a new rule for the 32 bit (addl) case, > for example) and would need to be re-done for every back end (for optimum > performance). Hmm. This actually does not sound like a kludge to me. > Right now it's probably your least bad option. Actually, with the current way VEX works I do not regard dirty helper calls with a lot of constant parameters as a good thing. It really seems better to just put the constants into a struct. Of course, this adds some indirection, but e.g. for cachegrind, we have such a struct anyway. > Once you > figure out how to hack on the instruction selector, it's actually very > easy to do. (+ quite fun) This sounds really like a good exercise to do if I get some time. >> movabsq $0x402066A28,%rbx >> movabsq $0x1003C,%r10 >> movabsq $0x4020020C0,%r9 >> movabsq $0x4020020C0,%r8 >> movq (%r8),%rdi >> cmpq $0x1003C,%rdi >> movq %rbx,%rdi >> movq %r10,%rsi >> movq %r9,%rdx >> callnz[3] 0x38021130 >> movabsq $0x402066A40,%rbx >> movabsq $0x1003C,%r10 >> movabsq $0x4020020C0,%r9 >> movabsq $0x4020020C0,%r8 >> movq (%r8),%rdi >> cmpq $0x1003C,%rdi >> movq %rbx,%rdi >> movq %r10,%rsi >> movq %r9,%rdx >> callnz[3] 0x38021130 >> >> >> ... > Yeah, this is really bad. Really what we need in IR, that would solve this > problem properly, and make the IR generally much more flexible, is control > flow diamonds -- if-then-else constructs. Then you can put the dirty call > in an else part and the fast case code in the then part (obviously), or > vice versa. > > Unfortunately doing if-then-else control flow makes the IR optimisation and > register allocation passes much more complex, because of the need to unify > the register/optimisation state from the two branches at the merge point > (for forwards analysis/optimisation passes) or at the start of the construct > (for backwards analysis/optimisation passes). I understand. Another option would probably be to allow the code generation to be used independently, and make the guarded helper to call into generated code. > Optimisation and regalloc in the presence of loops (full control flow) is even > more complex, but I don't see much use for supporting loops, fortunately. > > This is the real reason for the callnz kludge -- to avoid such problems. > If I had to do all this over again, I would support if-then-else in IR > and the back ends properly. If there are any enthusiastic compiler hackers > out there who want to try this, speak up now. I wonder if it makes sense to try to combine Valgrind with other open-source projects doing code generation (JavaScript engines, JVMs, LLVM ...?). Josef |