|
From: Nuno L. <nun...@sa...> - 2008-03-25 11:34:49
|
>> Today I've written a simple peephole optimizer that works after the >> instruction selection and before register allocation. So it takes an >> array >> of instructions and outputs another array of instructions. It's also >> independent of the platform. The patch is available at: >> http://web.ist.utl.pt/nuno.lopes/valgrind_vex_peephole_optimizations.txt >> Currently it only removes redudant MOVs between virtual registers that >> can >> be propagated forward. Imagine this: >> >> 9 movl %vr16,%vr85 ; %vr16 isn't referenced below this line >> 10 subl $0x4,%vr85 >> >> (maps %vr85 to %vr16) >> >> the movl is removed and translated into: >> 9 subl $0x4,%vr16 > > Surely, whether %vr16 is reference again later isn't important; but > whether %vr85 is referenced again later is important? Yes, sure. I then replace all references to %vr85 with %vr16. Let's say that I do the 'movl' at compile time, by propagating it through the following references. The optimization I missed was to remove the movl if the destination register isn't used anymore, but that doesn't seem much likely to happen (i.e. I don't think that isel will produce such code, but I didn't check it further). >> With only this simple transformation, I was able to reduce the number of >> instructions of a simple block (instrumented with memcheck) from 120 to >> 106, >> which is quite good. The number of register spills is also hugely reduced >> (on a x86 host)! (memcheck creates some unnecessary moves because of the >> dirty handler arguments). > > The 120-to-106 reduction is on the final generated code? Thats the reduction in this peep-hole optimization pass. The register allocated code has 108 instructions. Without this peephole optimizer, the register allocated code has 122 instructions! (i.e. no redudant 'movl' is removed) > The register allocator can get rid of many register-to-register moves (as > the PLDI paper explains), but perhaps it misses some. Or many. Perhaps > you could send an example block, showing before and after? The example I gave in the previous email. The movl instruction isn't killed by the register allocation code. You can take a look at the output of the optimizer running on an example block: http://web.ist.utl.pt/nuno.lopes/valgrind_block_after_peephole.txt >> The caveats? Well it segfaults *after* compiling all the blocks, which is >> weird.. I get the following error: >> >> ==28498== Invalid read of size 1 >> ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) >> ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) >> ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) >> ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) >> ==28498== Address 0x10090188 is not stack'd, malloc'd or (recently) >> free'd >> ==28498== >> ==28498== Process terminating with default action of signal 11 (SIGSEGV) >> ==28498== Access not within mapped region at address 0x10090188 >> ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) >> ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) >> ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) >> ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) >> >> Do you know what might be causing the problem? > > I can't think of anything, other than the transformation has a bug in it. It's weird, because if I do the register replacement and do *not* remove the 'movl's, thus getting a few 'movl %vrX, %vrX' (really NOPs), it works. Maybe there's some jump that is screwed by the instruction number decrease, which doesn't make much sense and register spill adds additional instructions. >> Also, after this is working correctly (i.e. fixing this segfault and >> running >> the tests), do you think this could be incorporated in valgrind/VEX? > > Quite possibly, although Julian has the final say. It would depend on > whether it actually improves performance -- it's always hard to predict > whether code optimisations have a genuine effect. So far, from the little tests I've run it seems that the optimizer is very useful when running memcheck (reduces the number of instructions up to 12%). I don't have any benchmarks yet (as I can't run them :P). Thanks, Nuno |