|
From: Julian S. <js...@ac...> - 2011-12-09 14:28:11
|
Unfortunately I can't say anything much useful, but anyway here's some background: > I want to share some results I get from VEX for direct > instrumentation on amd64, while trying to improve cachegrind. > Perhaps there is some low-hanging fruit regarding some > VEX code gen improvements. > (I suppose this is mainly for Julian. Perhaps you can give > me some pointers if improvements seem possible?) > > Here is an example were I tried to do a counter increment in > instrumentation (adding 0xB for 11 previous Ir events). > After the instrumentation pass: > > ... > t35 = 0x402066B80:I64 > t36 = LDle:I64(t35) > t37 = Add64(t36,0xB:I64) > STle(t35) = t37 > ... > > This is the result for amd64: > > ... > movabsq $0x402066B80,%r12 > movabsq $0x402066B80,%r13 > movq (%r13),%r14 > addq $0xB,%r14 > movq %r14,(%r12) The code makes more sense if you compare it against the IR that goes into the instruction selector (after tree-building that is). You can get this with trace flag 00001000 I believe. For example, I guess it would show STle(0x402066B80:I64) = Add64(LDle:I64(0x402066B80:I64),0xB:I64) One problem is that the IR optimiser (ir_opt.c) is applied to the IR after instrumentation (producing the "tree-build" IR). One action of IR opt is to aggressively propagate constant values as far as possible. So your careful CSEing of t35 in the example above is completely destroyed, resulting in the IR line above with the duplicated constant, and hence in the two copies of the constant. This is one of the many difficulties of writing an optimiser -- how aggressive should constant propagation be? Generally, "as aggressive as possible" is a win because it creates the maximum opportunity for constant folding and dead code removal .. but in this case it is not a win. One kludgey workaround is to add a special case to iselStmt in host_amd64_isel.c, to detect the special case STle(atom1) = Add64(LDle:I64(atom2), expr) and in the case that atom1 == atom2, emit code more like what you expect. .. compute expr into %rExpr .. compute atom1 into %rAddr (really, into an AMD64AMode) addq %rExpr, (%rAddr) but that's fragile (you would need a new rule for the 32 bit (addl) case, for example) and would need to be re-done for every back end (for optimum performance). Right now it's probably your least bad option. Once you figure out how to hack on the instruction selector, it's actually very easy to do. (+ quite fun) If expr is a constant that fits in 32 bits then you can special case it even more, to generate exactly the 2 insn sequence you want. ------------------------------------------------------------ > To reduce code blowup for constant parameters for dirty helpers, I > tried to load them in tempregs, knowning that they are used for multiple > dirty helpers in a row: [...] > movabsq $0x402066A28,%rbx > movabsq $0x1003C,%r10 > movabsq $0x4020020C0,%r9 > movabsq $0x4020020C0,%r8 > movq (%r8),%rdi > cmpq $0x1003C,%rdi > movq %rbx,%rdi > movq %r10,%rsi > movq %r9,%rdx > callnz[3] 0x38021130 > movabsq $0x402066A40,%rbx > movabsq $0x1003C,%r10 > movabsq $0x4020020C0,%r9 > movabsq $0x4020020C0,%r8 > movq (%r8),%rdi > cmpq $0x1003C,%rdi > movq %rbx,%rdi > movq %r10,%rsi > movq %r9,%rdx > callnz[3] 0x38021130 > > Again, this probably is expected, as VEX implements guarded helper calls as > special instruction such as "callnz". But this means that all the setup for > the dirty helper call is done all the time, such as loading constant > parameters. > They are even loaded twice once for the check and once for preparing > parameters. > > This makes the guarded dirty helpers slower than delaying the check into > an always called dirty helper. Yeah, this is really bad. Really what we need in IR, that would solve this problem properly, and make the IR generally much more flexible, is control flow diamonds -- if-then-else constructs. Then you can put the dirty call in an else part and the fast case code in the then part (obviously), or vice versa. Unfortunately doing if-then-else control flow makes the IR optimisation and register allocation passes much more complex, because of the need to unify the register/optimisation state from the two branches at the merge point (for forwards analysis/optimisation passes) or at the start of the construct (for backwards analysis/optimisation passes). Optimisation and regalloc in the presence of loops (full control flow) is even more complex, but I don't see much use for supporting loops, fortunately. This is the real reason for the callnz kludge -- to avoid such problems. If I had to do all this over again, I would support if-then-else in IR and the back ends properly. If there are any enthusiastic compiler hackers out there who want to try this, speak up now. J |