|
From: Julian S. <js...@ac...> - 2005-03-30 22:51:12
|
Hmm. This is a pig of a problem. With r3484/r3485 I managed to get rid of one insn per fast-iteration, but being PIE still costs us. The "best" solution I can think of is to glom all the dispatcher- visible state (dispatch_ctr, fast[], fastN[]) into a single struct, and pass a pointer to that to run_innerloop. That pointer can be reloaded from the stack each time round the loop, so the extra cost is one insn/bb, which given that one insn was just got rid of, is free, at least in terms of insn counts. So the main difficulty -- apart from the uglyness of inventing a struct purely for this reason -- is how the assembly code can know the (literal) offsets of the struct components, without having to mess around with preprocessor programs to extract the offsets. That's just too ugly. For the time being, let's go with the "movq VG_(tt_fast)@GOTPCREL(%rip), %rcx" style fixes that Tom/Jeremy cooked up. Does the x86 side need similar modification or is that OK as-is ? > Yep, that looks about right. I wasn't too worried about adding a couple > of extra instructions here because BB-chaining should take most of the > load off this code (in 2.4); in 3, one presumes Vex's BB-fusing is about > as effective (or we need to look at doing chaining as well). Further ahead, it would be nice to reinstate bb-chaining. Vex's chasing results in bb count reductions on the order of 20%-30% compared to 2.4, which is a start, and gives indirect benefits in that the guest regs remains cached in host regs across the boundary. Nevertheless this leaves a lot of scope for chaining. For one thing, chaining removes the indirect and presumably unpredictable call that the dispatcher must make. J |