|
From: Nuno L. <nun...@sa...> - 2008-03-24 18:23:19
|
Hi, Today I've written a simple peephole optimizer that works after the instruction selection and before register allocation. So it takes an array of instructions and outputs another array of instructions. It's also independent of the platform. The patch is available at: http://web.ist.utl.pt/nuno.lopes/valgrind_vex_peephole_optimizations.txt Currently it only removes redudant MOVs between virtual registers that can be propagated forward. Imagine this: 9 movl %vr16,%vr85 ; %vr16 isn't referenced below this line 10 subl $0x4,%vr85 (maps %vr85 to %vr16) the movl is removed and translated into: 9 subl $0x4,%vr16 With only this simple transformation, I was able to reduce the number of instructions of a simple block (instrumented with memcheck) from 120 to 106, which is quite good. The number of register spills is also hugely reduced (on a x86 host)! (memcheck creates some unnecessary moves because of the dirty handler arguments). The caveats? Well it segfaults *after* compiling all the blocks, which is weird.. I get the following error: ==28498== Invalid read of size 1 ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) ==28498== Address 0x10090188 is not stack'd, malloc'd or (recently) free'd ==28498== ==28498== Process terminating with default action of signal 11 (SIGSEGV) ==28498== Access not within mapped region at address 0x10090188 ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) ==28498== If you believe this happened as a result of a stack overflow in your ==28498== program's main thread (unlikely but possible), you can try to increase ==28498== the size of the main thread stack using the --main-stacksize= flag. ==28498== The main thread stack size used in this run was 8388608. ==28498== ==28498== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) ==28498== malloc/free: in use at exit: 0 bytes in 0 blocks. ==28498== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==28498== For counts of detected errors, rerun with: -v ==28498== All heap blocks were freed -- no leaks are possible. ./vg-in-place: line 8: 28498 Segmentation fault VALGRIND_LIB=$d/.in_place VALGRIND_LIB_INNER=$d/.in_place $d/coregrind/valgrind "$@" Do you know what might be causing the problem? Also, after this is working correctly (i.e. fixing this segfault and running the tests), do you think this could be incorporated in valgrind/VEX? Regards, Nuno |
|
From: Nicholas N. <nj...@cs...> - 2008-03-25 08:33:40
|
On Mon, 24 Mar 2008, Nuno Lopes wrote: > Today I've written a simple peephole optimizer that works after the > instruction selection and before register allocation. So it takes an array > of instructions and outputs another array of instructions. It's also > independent of the platform. The patch is available at: > http://web.ist.utl.pt/nuno.lopes/valgrind_vex_peephole_optimizations.txt > Currently it only removes redudant MOVs between virtual registers that can > be propagated forward. Imagine this: > > 9 movl %vr16,%vr85 ; %vr16 isn't referenced below this line > 10 subl $0x4,%vr85 > > (maps %vr85 to %vr16) > > the movl is removed and translated into: > 9 subl $0x4,%vr16 Surely, whether %vr16 is reference again later isn't important; but whether %vr85 is referenced again later is important? > With only this simple transformation, I was able to reduce the number of > instructions of a simple block (instrumented with memcheck) from 120 to 106, > which is quite good. The number of register spills is also hugely reduced > (on a x86 host)! (memcheck creates some unnecessary moves because of the > dirty handler arguments). The 120-to-106 reduction is on the final generated code? Huh. The register allocator can get rid of many register-to-register moves (as the PLDI paper explains), but perhaps it misses some. Or many. Perhaps you could send an example block, showing before and after? > The caveats? Well it segfaults *after* compiling all the blocks, which is > weird.. I get the following error: > > ==28498== Invalid read of size 1 > ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) > ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) > ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) > ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) > ==28498== Address 0x10090188 is not stack'd, malloc'd or (recently) free'd > ==28498== > ==28498== Process terminating with default action of signal 11 (SIGSEGV) > ==28498== Access not within mapped region at address 0x10090188 > ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) > ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) > ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) > ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) > > Do you know what might be causing the problem? I can't think of anything, other than the transformation has a bug in it. > Also, after this is working correctly (i.e. fixing this segfault and running > the tests), do you think this could be incorporated in valgrind/VEX? Quite possibly, although Julian has the final say. It would depend on whether it actually improves performance -- it's always hard to predict whether code optimisations have a genuine effect. Nick |
|
From: Nuno L. <nun...@sa...> - 2008-03-25 11:34:49
|
>> Today I've written a simple peephole optimizer that works after the >> instruction selection and before register allocation. So it takes an >> array >> of instructions and outputs another array of instructions. It's also >> independent of the platform. The patch is available at: >> http://web.ist.utl.pt/nuno.lopes/valgrind_vex_peephole_optimizations.txt >> Currently it only removes redudant MOVs between virtual registers that >> can >> be propagated forward. Imagine this: >> >> 9 movl %vr16,%vr85 ; %vr16 isn't referenced below this line >> 10 subl $0x4,%vr85 >> >> (maps %vr85 to %vr16) >> >> the movl is removed and translated into: >> 9 subl $0x4,%vr16 > > Surely, whether %vr16 is reference again later isn't important; but > whether %vr85 is referenced again later is important? Yes, sure. I then replace all references to %vr85 with %vr16. Let's say that I do the 'movl' at compile time, by propagating it through the following references. The optimization I missed was to remove the movl if the destination register isn't used anymore, but that doesn't seem much likely to happen (i.e. I don't think that isel will produce such code, but I didn't check it further). >> With only this simple transformation, I was able to reduce the number of >> instructions of a simple block (instrumented with memcheck) from 120 to >> 106, >> which is quite good. The number of register spills is also hugely reduced >> (on a x86 host)! (memcheck creates some unnecessary moves because of the >> dirty handler arguments). > > The 120-to-106 reduction is on the final generated code? Thats the reduction in this peep-hole optimization pass. The register allocated code has 108 instructions. Without this peephole optimizer, the register allocated code has 122 instructions! (i.e. no redudant 'movl' is removed) > The register allocator can get rid of many register-to-register moves (as > the PLDI paper explains), but perhaps it misses some. Or many. Perhaps > you could send an example block, showing before and after? The example I gave in the previous email. The movl instruction isn't killed by the register allocation code. You can take a look at the output of the optimizer running on an example block: http://web.ist.utl.pt/nuno.lopes/valgrind_block_after_peephole.txt >> The caveats? Well it segfaults *after* compiling all the blocks, which is >> weird.. I get the following error: >> >> ==28498== Invalid read of size 1 >> ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) >> ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) >> ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) >> ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) >> ==28498== Address 0x10090188 is not stack'd, malloc'd or (recently) >> free'd >> ==28498== >> ==28498== Process terminating with default action of signal 11 (SIGSEGV) >> ==28498== Access not within mapped region at address 0x10090188 >> ==28498== at 0x4015508: (within /lib/ld-2.6.1.so) >> ==28498== by 0x4013CB5: (within /lib/ld-2.6.1.so) >> ==28498== by 0x400134E: (within /lib/ld-2.6.1.so) >> ==28498== by 0x40009A6: (within /lib/ld-2.6.1.so) >> >> Do you know what might be causing the problem? > > I can't think of anything, other than the transformation has a bug in it. It's weird, because if I do the register replacement and do *not* remove the 'movl's, thus getting a few 'movl %vrX, %vrX' (really NOPs), it works. Maybe there's some jump that is screwed by the instruction number decrease, which doesn't make much sense and register spill adds additional instructions. >> Also, after this is working correctly (i.e. fixing this segfault and >> running >> the tests), do you think this could be incorporated in valgrind/VEX? > > Quite possibly, although Julian has the final say. It would depend on > whether it actually improves performance -- it's always hard to predict > whether code optimisations have a genuine effect. So far, from the little tests I've run it seems that the optimizer is very useful when running memcheck (reduces the number of instructions up to 12%). I don't have any benchmarks yet (as I can't run them :P). Thanks, Nuno |
|
From: Julian S. <js...@ac...> - 2008-03-25 11:44:21
|
> Currently it only removes redudant MOVs between virtual registers that can > be propagated forward. Imagine this: > > 9 movl %vr16,%vr85 ; %vr16 isn't referenced below this line > 10 subl $0x4,%vr85 > > (maps %vr85 to %vr16) > > the movl is removed and translated into: > 9 subl $0x4,%vr16 Yes. However the register allocator does the same transformation (virtual-to-virtual register move coalescing), in the case where the source register's live range ends at the move instruction and the destination register's live range starts at the move instruction. That allows the instruction selectors to generate apparently-stupid code with lots of v-v moves, and reg-alloc then cleans it up later. This makes construction of the instruction selectors much simpler, especially for 2-address targets (x86, amd64). Or are you doing something else apart from V-V coalescing? > (memcheck creates some unnecessary moves because > of the dirty handler arguments). Yes .. it could probably do better in cases where a V(irtual) register is moved to a (R)eal register, and the V's live range ends at that point. I think I tried to add a "preference" mechanism to regalloc2.c, which says, for each virtual reg, which real reg it would "prefer" to be in, for this reason. But I also think that didn't help, because it constrains regalloc's choice of registers and so just moves the reg-to-reg moves elsewhere. However, I didn't investigate much; you may be able to do better. > Do you know what might be causing the problem? Check carefully the validity of the transformation(s) you do. It's easy to break stuff at this level. J |
|
From: Nuno L. <nun...@sa...> - 2008-03-25 15:08:12
|
>> Currently it only removes redudant MOVs between virtual registers that >> can >> be propagated forward. Imagine this: >> >> 9 movl %vr16,%vr85 ; %vr16 isn't referenced below this line >> 10 subl $0x4,%vr85 >> >> (maps %vr85 to %vr16) >> >> the movl is removed and translated into: >> 9 subl $0x4,%vr16 > > Yes. However the register allocator does the same transformation > (virtual-to-virtual register move coalescing), in the case where > the source register's live range ends at the move instruction and > the destination register's live range starts at the move instruction. > > That allows the instruction selectors to generate apparently-stupid > code with lots of v-v moves, and reg-alloc then cleans it up later. > This makes construction of the instruction selectors much simpler, > especially for 2-address targets (x86, amd64). Uhm, strange.. I must confess that I didn't read the register allocation part of the PLDI paper, as I've no experience with register allocation algorithms. Now looking at it, it seems that part of the moves I'm eliminating should have been covered by the register allocator, but in fact they aren't. If you check my other e-mail, I was able to reduce the number of instructions of a single block (already reg-allocated) from 122 to "just" 108. > Or are you doing something else apart from V-V coalescing? I'm also killing R-V dead stores. (happens when some instruction updates e.g. %eax, then %eax is moved to a virtual reg, but it's never used). >> (memcheck creates some unnecessary moves because >> of the dirty handler arguments). > > Yes .. it could probably do better in cases where a V(irtual) register > is moved to a (R)eal register, and the V's live range ends at that > point. I think I tried to add a "preference" mechanism to regalloc2.c, > which says, for each virtual reg, which real reg it would "prefer" to > be in, for this reason. But I also think that didn't help, because > it constrains regalloc's choice of registers and so just moves the > reg-to-reg moves elsewhere. However, I didn't investigate much; you > may be able to do better. > >> Do you know what might be causing the problem? > > Check carefully the validity of the transformation(s) you do. > It's easy to break stuff at this level. Ah, found the problem.. The problem is in the register allocator. I knew the problem couldn't be from my nice code :P Seriously take a look at this: after peephole optimization: 9 call[1] 0x38007070 10 movl %eax,%vr30 11 movzwl (%vr2),%vr69 12 cmpl $0x0,%vr26 13 callnz[0] 0x38006920 14 andl $0xFFFF,%vr30 15 movl %vr30,%edx 16 movl %vr1,%eax 17 call[2] 0x38006C30 after register allocation: 12 call[1] 0x38007070 13 movl %eax,%edx 14 movl 0x288(%ebp),%ecx 15 movzwl (%ecx),%eax 16 cmpl $0x0,0x2A0(%ebp) 17 movl %edx,0x2A8(%ebp) ; %vr30 to memory 18 movl %eax,0x2B8(%ebp) 19 callnz[0] 0x38006920 20 movl 0x2A8(%ebp),%edx ; load %vr30 from memory 21 andl $0xFFFF,%edx 22 movl 0x2A8(%ebp),%edx ; <-- %vr30 is in %edx, not in memory 23 movl %edi,%eax 24 call[2] 0x38006C30 So the problem is that the register allocator gets confused somehow and loads %vr30 twice, destroying its value (i.e. the line 22 is bogus and could be removed altogether). Thanks, Nuno |
|
From: Julian S. <js...@ac...> - 2008-03-25 15:40:47
|
> > Or are you doing something else apart from V-V coalescing? > > I'm also killing R-V dead stores. (happens when some instruction updates > e.g. %eax, then %eax is moved to a virtual reg, but it's never used). Do you have an example of that, to look at? > 20 movl 0x2A8(%ebp),%edx ; load %vr30 from memory > 21 andl $0xFFFF,%edx > 22 movl 0x2A8(%ebp),%edx ; <-- %vr30 is in %edx, not in memory > 23 movl %edi,%eax > 24 call[2] 0x38006C30 > > So the problem is that the register allocator gets confused somehow and > loads %vr30 twice, destroying its value (i.e. the line 22 is bogus and > could be removed altogether). That's very strange. After the reload on line 20, it should indeed have noted that %vr30 is now in %edx, so then there would be no need to incorrectly reload it again at line 22. I have no idea why. J |
|
From: Nuno L. <nun...@sa...> - 2008-03-25 15:55:44
|
>> > Or are you doing something else apart from V-V coalescing? >> >> I'm also killing R-V dead stores. (happens when some instruction updates >> e.g. %eax, then %eax is moved to a virtual reg, but it's never used). > > Do you have an example of that, to look at? sure: -- t11 = Shr32(64HIto32(MullU32(0xCCCCCCCD:I32,Add32(Shl32(LDle:I32(t25),0x2:I8),0x27:I32))),0x4:I8) movl $0xCCCCCCCD,%vr139 movl (%vr26),%vr142 movl %vr142,%vr141 shll $2,%vr141 movl %vr141,%vr140 addl $0x27,%vr140 movl %vr140,%eax umull %vr139 movl %edx,%vr138 movl %eax,%vr137 ; <-- dead assignment to: %vr137 movl %vr138,%vr136 shrl $4,%vr136 movl %vr136,%vr12 %vr137 is never referenced after that assignment, so the peephole optimizer removes that line. >> 20 movl 0x2A8(%ebp),%edx ; load %vr30 from memory >> 21 andl $0xFFFF,%edx >> 22 movl 0x2A8(%ebp),%edx ; <-- %vr30 is in %edx, not in memory >> 23 movl %edi,%eax >> 24 call[2] 0x38006C30 >> >> So the problem is that the register allocator gets confused somehow and >> loads %vr30 twice, destroying its value (i.e. the line 22 is bogus and >> could be removed altogether). > > That's very strange. After the reload on line 20, it should indeed > have noted that %vr30 is now in %edx, so then there would be no need > to incorrectly reload it again at line 22. I have no idea why. So well, any idea on how to debug that problem? I have zero knowledge about register allocation algorithms, hence my relutance to get my hands on it.. Regards, Nuno |
|
From: Nuno L. <nun...@sa...> - 2008-03-26 13:23:35
|
>>> 20 movl 0x2A8(%ebp),%edx ; load %vr30 from memory >>> 21 andl $0xFFFF,%edx >>> 22 movl 0x2A8(%ebp),%edx ; <-- %vr30 is in %edx, not in memory >>> 23 movl %edi,%eax >>> 24 call[2] 0x38006C30 >>> >>> So the problem is that the register allocator gets confused somehow and >>> loads %vr30 twice, destroying its value (i.e. the line 22 is bogus and >>> could be removed altogether). >> >> That's very strange. After the reload on line 20, it should indeed >> have noted that %vr30 is now in %edx, so then there would be no need >> to incorrectly reload it again at line 22. I have no idea why. > > So well, any idea on how to debug that problem? I have zero knowledge > about > register allocation algorithms, hence my relutance to get my hands on it.. So this patch almost fixes the problem: http://web.ist.utl.pt/nuno.lopes/vex_regalloc_mov_vr.txt It removes mov instructions between a virtual register and a real register, as long as the virtual register is mapped to the real register. But then it fails the sanity check at line 939 (rreg_state[k].disp == Unavail) for the rreg of the previous mov instruction we had just skip. So, the idea seems ok, but the patch needs some work to fix that assertion. Julian, could you take a look at it, please? Thanks, Nuno ------------------- output of regalloc showing the bug: ====----====---- Insn 15 ----====----==== ---- movl %vr30,%edx Initial state: rreg_state[ 0] = %eax Free rreg_state[ 1] = %ebx BoundTo %vr8 rreg_state[ 2] = %ecx Free rreg_state[ 3] = %edx BoundTo %vr30 rreg_state[ 4] = %esi BoundTo %vr68 rreg_state[ 5] = %edi BoundTo %vr1 rreg_state[ 6] = %fake0 Free rreg_state[ 7] = %fake1 Free rreg_state[ 8] = %fake2 Free rreg_state[ 9] = %fake3 Free rreg_state[10] = %fake4 Free rreg_state[11] = %fake5 Free rreg_state[12] = %xmm0 Free rreg_state[13] = %xmm1 Free rreg_state[14] = %xmm2 Free rreg_state[15] = %xmm3 Free rreg_state[16] = %xmm4 Free rreg_state[17] = %xmm5 Free rreg_state[18] = %xmm6 Free rreg_state[19] = %xmm7 Free vreg_state[0 .. 77]: [1] -> 5 [8] -> 1 [30] -> 3 [68] -> 4 need to free up rreg: %edx After pre-insn actions for fixed regs: rreg_state[ 0] = %eax Free rreg_state[ 1] = %ebx BoundTo %vr8 rreg_state[ 2] = %ecx Free rreg_state[ 3] = %edx Unavail rreg_state[ 4] = %esi BoundTo %vr68 rreg_state[ 5] = %edi BoundTo %vr1 rreg_state[ 6] = %fake0 Free rreg_state[ 7] = %fake1 Free rreg_state[ 8] = %fake2 Free rreg_state[ 9] = %fake3 Free rreg_state[10] = %fake4 Free rreg_state[11] = %fake5 Free rreg_state[12] = %xmm0 Free rreg_state[13] = %xmm1 Free rreg_state[14] = %xmm2 Free rreg_state[15] = %xmm3 Free rreg_state[16] = %xmm4 Free rreg_state[17] = %xmm5 Free rreg_state[18] = %xmm6 Free rreg_state[19] = %xmm7 Free vreg_state[0 .. 77]: [1] -> 5 [8] -> 1 [68] -> 4 ** movl 0x2A8(%ebp),%edx After dealing with current insn: rreg_state[ 0] = %eax Free rreg_state[ 1] = %ebx BoundTo %vr8 rreg_state[ 2] = %ecx Free rreg_state[ 3] = %edx Unavail rreg_state[ 4] = %esi BoundTo %vr68 rreg_state[ 5] = %edi BoundTo %vr1 rreg_state[ 6] = %fake0 Free rreg_state[ 7] = %fake1 Free rreg_state[ 8] = %fake2 Free rreg_state[ 9] = %fake3 Free rreg_state[10] = %fake4 Free rreg_state[11] = %fake5 Free rreg_state[12] = %xmm0 Free rreg_state[13] = %xmm1 Free rreg_state[14] = %xmm2 Free rreg_state[15] = %xmm3 Free rreg_state[16] = %xmm4 Free rreg_state[17] = %xmm5 Free rreg_state[18] = %xmm6 Free rreg_state[19] = %xmm7 Free vreg_state[0 .. 77]: [1] -> 5 [8] -> 1 [68] -> 4 After post-insn actions for fixed regs: rreg_state[ 0] = %eax Free rreg_state[ 1] = %ebx BoundTo %vr8 rreg_state[ 2] = %ecx Free rreg_state[ 3] = %edx Unavail rreg_state[ 4] = %esi BoundTo %vr68 rreg_state[ 5] = %edi BoundTo %vr1 rreg_state[ 6] = %fake0 Free rreg_state[ 7] = %fake1 Free rreg_state[ 8] = %fake2 Free rreg_state[ 9] = %fake3 Free rreg_state[10] = %fake4 Free rreg_state[11] = %fake5 Free rreg_state[12] = %xmm0 Free rreg_state[13] = %xmm1 Free rreg_state[14] = %xmm2 Free rreg_state[15] = %xmm3 Free rreg_state[16] = %xmm4 Free rreg_state[17] = %xmm5 Free rreg_state[18] = %xmm6 Free rreg_state[19] = %xmm7 Free vreg_state[0 .. 77]: [1] -> 5 [8] -> 1 [68] -> 4 |
|
From: Nuno L. <nun...@sa...> - 2008-03-26 15:38:39
|
>>>> 20 movl 0x2A8(%ebp),%edx ; load %vr30 from memory >>>> 21 andl $0xFFFF,%edx >>>> 22 movl 0x2A8(%ebp),%edx ; <-- %vr30 is in %edx, not in memory >>>> 23 movl %edi,%eax >>>> 24 call[2] 0x38006C30 >>>> >>>> So the problem is that the register allocator gets confused somehow and >>>> loads %vr30 twice, destroying its value (i.e. the line 22 is bogus and >>>> could be removed altogether). >>> >>> That's very strange. After the reload on line 20, it should indeed >>> have noted that %vr30 is now in %edx, so then there would be no need >>> to incorrectly reload it again at line 22. I have no idea why. >> >> So well, any idea on how to debug that problem? I have zero knowledge >> about >> register allocation algorithms, hence my relutance to get my hands on >> it.. > > So this patch almost fixes the problem: > http://web.ist.utl.pt/nuno.lopes/vex_regalloc_mov_vr.txt > It removes mov instructions between a virtual register and a real > register, > as long as the virtual register is mapped to the real register. But then > it > fails the sanity check at line 939 (rreg_state[k].disp == Unavail) for the > rreg of the previous mov instruction we had just skip. > So, the idea seems ok, but the patch needs some work to fix that > assertion. > Julian, could you take a look at it, please? never mind. I've already discovered the problem and updated the patch accordingly (link above). Can you please check if the patch is ok (and commit it if it's ok)? Thanks, Nuno |
|
From: Nuno L. <nun...@sa...> - 2008-03-28 17:56:33
|
Hi, In the past days I have improved the patches I had provided previously. So this e-mail is a summary of the patches I have for valgrind (for now). http://web.ist.utl.pt/nuno.lopes/vex_regalloc_eqspill_bugfix.txt Fix a bug in the register allocator that assumes that a register that is modified in the same instruction that triggers the reload isn't dirty. This means that e.g.: addl %vr1, %vr2 if %vr2 has to be reloaded and isn't written before the next spill, it won't get written back to memory (because eq_spill_slot == True). http://web.ist.utl.pt/nuno.lopes/vex_regalloc_mov_vr.txt This is an optimziation that removes redudant 'movl %vrX, %RR' when %vrX is mapped to the real register %RR. http://web.ist.utl.pt/nuno.lopes/vex_peephole_optimizations.txt The peephole optimizer I described earlier. It does some coalescing and removes redudant moves between virtual registers. On code instrumented with memcheck it can reduce the number of reg-allocated instructions by up to 12%. All the regression tests pass after applying these patches (except those that didn't pass before applying the patches). I've only tested on a x86 pc, but I'll soon try on a x86_64 and ppc64. Regards, Nuno P.S.: I have two colleagues that are experimenting with other optimizations as well. They should submit a few patches in the next few days (we hope). |
|
From: Filipe C. <fi...@gm...> - 2008-04-03 23:49:29
|
Hi I've put my patch in http://web.ist.utl.pt/~filipe.cabecinhas/patches/vex-amd64-and-x86-IP-Store-optimization.txt This patch: - Adds a Store imm->mem (valgrinds x86 and amd64 Store only stores from a register to memory even though the ISA permits storing an immediate to a register/memory). - This is needed for the optimization; - Changing every use of X86/AMD64Instr_Store to use X86/AMD64RI instead of HReg as the first parameter; - Optimizes PUT(EIP) and PUT(RIP) when the previously stored value is known; Since PUT(EIP/RIP) is so often used, we thought it could be good to stop loading an imm64 to a register and then storing it to memory when one knows the upper 56 bits are the same (for example). In the description: IP == EIP or RIP (depending on the mode) The optimization is: add a slot in the ISelEnv structure to keep track of the last stored IP in the current basic block when a PUT(IP) is encountered, behave as before but store the stored IP (if it's an immediate) in env when the next PUT(IP) is encountered, if we can only store a byte, a word or a double-word, only store those bits, not the full 64 bits. I'm assuming there is never a PUT(IP+1) instruction, for example. Although it's not difficult to deal with that, I think it's not needed because the IP can't change in arbitrary ways like the other registers and we could use the spare CPU cycles. :-) In the perf tests, it goes rather well, beating the regular valgrind in almost every benchmark by a small percentage (but hey, every micro- second counts ;)). By being able to store an imm8/imm16(/imm32) directly to memory, it's now possible to optimize other places in libVEX where a X86/ AMD64Instr_Store is used by not using so many temporary registers. I opted not to do that right now so the patch wouldn't be too big. Please tell me what you think of the patch and if it needs improving. Attached are the code expansion rates and perf timings for the normal valgrind and the (patched) new-valgrind Regards, - Filipe Cabecinhas |
|
From: Filipe C. <fi...@gm...> - 2008-04-25 15:49:13
|
Hi I'm doing the same optimization in ppc(64) but now I have a little problem. What's the best (semantically) way to get the offset of the middle of a register, in VEX? I have the constant that is getting stored in the CIA register but it may have 2, 4 or 8 bytes (maybe I'll check for 6 bytes, but I don't think it's worth it, considering the time to make those tests). So, being the powerpc a big-endian computer, I have to store it at offset(CIA)+size(CIA)-bytes(newCIA). I could do simple arithmetic with offset(CIA) and offset(CR0_0) (this one is rather fragile). I could also do some arithmetic with sizeof(HWord). But maybe there's a way to find, in VEX, the size of the register... Is there a way to do that? Thanks for the help, - Filipe Cabecinhas |
|
From: Filipe C. <fi...@gm...> - 2008-04-28 13:37:00
Attachments:
perf-code-expansion
perf-tests
|
Hi On 4 Apr, 2008, at 00:49, Filipe Cabecinhas wrote: > The optimization is: > add a slot in the ISelEnv structure to keep track of the last stored > IP in the current basic block > when a PUT(IP) is encountered, behave as before but store the stored > IP (if it's an immediate) in env > when the next PUT(IP) is encountered, if we can only store a byte, a > word or a double-word, only store those bits, not the full 64 bits. I've ported this patch to powerpc with the following difference: When only the lower 16/32 bits differ, only store those bits to memory. Not with an immediate, but it still saves come instructions. To implement this, I change the IR instruction and let it continue to a normal Ist_Put. Maybe this would be better doing in the non-hardware dependent part of VEX but then there's the problem of not being able to tell if it's better to change only some bits or not, so I don't know if it's feasible. I may also refactor the x86 and amd64 patch to do the same thing so it's consistent, if you think it's better. The powerpc patch is at http://web.ist.utl.pt/~filipe.cabecinhas/patches/vex-CIA-optimization.patch If you think anything is wrong with the patch, please comment. - Filipe Cabecinhas P.S: Attached are the code expansion rates and the performance tests' results. |
|
From: Julian S. <js...@ac...> - 2008-03-29 13:12:09
|
> The peephole optimizer I described earlier. It does some coalescing and > removes redudant moves between virtual registers. On code instrumented with > memcheck it can reduce the number of reg-allocated instructions by up to > 12%. > > All the regression tests pass after applying these patches (except those > that didn't pass before applying the patches). What about something really large, like konqueror, mozilla, OOo? Do you have some more comprehensive before/after numbers? * what are the before/after code expansion ratios for apps as a whole, as shown by these lines (run with -v) ? --8732-- transtab: new 3,095 (68,098 -> 1,067,412; ratio 156:10) * what are the actual measured performance effects? Have you made friends yet with 'make perf' ? J |
|
From: Nuno L. <nun...@sa...> - 2008-03-30 14:30:55
Attachments:
perf-instrs-memcheck.txt
timings.txt
|
Hi, I've run 'make perf' and I give the results in the attached file (to avoid weird email line wrapping). The tests were run in a Pentium-M 2.0 Ghz + 1 GB RAM laptop with linux. 'valgrind-orig' is vanilla valgrind trunk and 'valgrind' is valgrind trunk + the 3 patches I sent to the ML. In summary, only bz2 and ffbench see improvements in the running time of memcheck. Attached you can also find the code expansion ratios for memcheck. In general all tests see a reduction in the number of instructions and the biggest reduction is with the bz2 test. So while the peephole is not ready yet, I think the other patches (bugfixes) should still be considered for inclusion. Regards, Nuno |
|
From: Filipe C. <fi...@gm...> - 2008-03-29 17:45:52
|
On 29 Mar, 2008, at 13:08, Julian Seward wrote: > > What about something really large, like konqueror, mozilla, OOo? > > Do you have some more comprehensive before/after numbers? > > * what are the before/after code expansion ratios for apps as a whole, > as shown by these lines (run with -v) ? > > --8732-- transtab: new 3,095 (68,098 -> 1,067,412; ratio 156:10) > > * what are the actual measured performance effects? Have you made > friends yet with 'make perf' ? > I just tried the optimizations with openssl test and here are some numbers, tested with: time ./vg-in-place --tool=none -v openssl speed Before: ==10356== --10356-- translate: fast SP updates identified: 0 ( --%) --10356-- translate: generic_known SP updates identified: 0 ( --%) --10356-- translate: generic_unknown SP updates identified: 0 ( --%) --10356-- tt/tc: 32,633,199 tt lookups requiring 47,799,060 probes --10356-- tt/tc: 32,633,199 fast-cache updates, 2 flushes --10356-- transtab: new 8,362 (300,800 -> 2,120,621; ratio 70:10) [0 scs] --10356-- transtab: dumped 0 (0 -> ??) --10356-- transtab: discarded 0 (0 -> ??) --10356-- scheduler: 9,698,533,428 jumps (bb entries). --10356-- scheduler: 96,985/32,728,700 major/minor sched events. --10356-- sanity: 96986 cheap, 419 expensive checks. --10356-- exectx: 769 lists, 0 contexts (avg 0 per list) --10356-- exectx: 0 searches, 0 full compares (0 per 1000) --10356-- exectx: 0 cmp2, 0 cmp4, 0 cmpAll --10356-- errormgr: 0 supplist searches, 0 comparisons during search --10356-- errormgr: 0 errlist searches, 0 comparisons during search real 8m52.593s user 6m35.829s sys 2m13.512s After: ==14932== --14932-- translate: fast SP updates identified: 0 ( --%) --14932-- translate: generic_known SP updates identified: 0 ( --%) --14932-- translate: generic_unknown SP updates identified: 0 ( --%) --14932-- tt/tc: 34,848,394 tt lookups requiring 51,153,884 probes --14932-- tt/tc: 34,848,394 fast-cache updates, 2 flushes --14932-- transtab: new 8,359 (300,880 -> 2,119,507; ratio 70:10) [0 scs] --14932-- transtab: dumped 0 (0 -> ??) --14932-- transtab: discarded 0 (0 -> ??) --14932-- scheduler: 10,034,632,377 jumps (bb entries). --14932-- scheduler: 100,346/34,948,591 major/minor sched events. --14932-- sanity: 100347 cheap, 427 expensive checks. --14932-- exectx: 769 lists, 0 contexts (avg 0 per list) --14932-- exectx: 0 searches, 0 full compares (0 per 1000) --14932-- exectx: 0 cmp2, 0 cmp4, 0 cmpAll --14932-- errormgr: 0 supplist searches, 0 comparisons during search --14932-- errormgr: 0 errlist searches, 0 comparisons during search real 8m52.360s user 6m28.008s sys 2m22.093s There is only a marginal difference. I may try with memcheck if you want but it will just give up on error reporting about half-way due to getting 1000000 errors. I could also try mozilla, firefox, OOo or other programs like those but shouldn't the conditions be more or less similar? I couldn't figure out how to do a run with firefox that didn't implied human interaction, I'll look into that now. FWIW: I'm running this on an intel mac with vmware fusion. I wasn't using the CPU outside vmware fusion (only mild web browsing) so the test shouldn't have been skewed too much. - Filipe Cabecinhas |
|
From: Julian S. <js...@ac...> - 2008-03-29 18:49:05
|
Hi,
> Before:
> --10356-- transtab: new 8,362 (300,800 -> 2,120,621; ratio
> 70:10)
> --14932-- transtab: new 8,359 (300,880 -> 2,119,507; ratio
> 70:10) [0 scs]
--tool=memcheck stresses the register allocator and spiller much more
than --tool=none, so it is a better test of any optimisations. I suggest
you try it.
Also, maybe use the test cases in perf/; at least bz2.c, fbench.c and
ffbench.c ("make perf"). These are conveniently provided for exactly
this kind of reason :-)
> FWIW: I'm running this on an intel mac with vmware fusion. I wasn't
> using the CPU outside vmware fusion (only mild web browsing) so the
> test shouldn't have been skewed too much.
Hmm, getting reliable timing numbers even on a real machine is hard
enough. On a VM is likely to be more difficult. (The VMM introduces
more sources of measurement "noise" ..)
J
|
|
From: Nicholas N. <nj...@cs...> - 2008-03-30 04:23:33
|
On Sat, 29 Mar 2008, Julian Seward wrote:
> Also, maybe use the test cases in perf/; at least bz2.c, fbench.c and
> ffbench.c ("make perf").
Also 'tinycc'.
Nick
|
|
From: Filipe C. <fi...@gm...> - 2008-03-29 18:57:05
|
On 29 Mar, 2008, at 13:08, Julian Seward wrote: > > What about something really large, like konqueror, mozilla, OOo? > > Do you have some more comprehensive before/after numbers? > > * what are the before/after code expansion ratios for apps as a whole, > as shown by these lines (run with -v) ? > > --8732-- transtab: new 3,095 (68,098 -> 1,067,412; ratio 156:10) > > * what are the actual measured performance effects? Have you made > friends yet with 'make perf' ? > > J Also, the make perf suite: Old valgrind: -- Running tests in perf ---------------------------------------------- bigcode1 valgrind :0.23s no: 2.4s (10.6x, -----) me: 4.7s (20.3x, -----) bigcode2 valgrind :0.28s no: 6.3s (22.6x, -----) me:13.7s (49.0x, -----) bz2 valgrind :0.99s no: 2.4s ( 2.4x, -----) me:10.7s (10.8x, -----) fbench valgrind :0.56s no: 0.7s ( 1.3x, -----) me: 7.1s (12.7x, -----) ffbench valgrind :0.40s no: 0.9s ( 2.3x, -----) me: 5.2s (13.0x, -----) heap valgrind :0.28s no: 0.8s ( 2.7x, -----) me:15.7s (56.0x, -----) sarp valgrind :0.03s no: 0.2s ( 6.0x, -----) me: 3.6s (119.0x, -----) tinycc valgrind :0.31s no: 2.1s ( 6.7x, -----) me:19.2s (62.0x, -----) -- Finished tests in perf ---------------------------------------------- real 2m13.018s user 1m39.190s sys 0m32.710s Patched valgrind: -- Running tests in perf ---------------------------------------------- bigcode1 new-valgrind:0.24s no: 2.9s (12.2x, -----) me: 6.1s (25.4x, -----) bigcode2 new-valgrind:0.27s no: 7.8s (28.7x, -----) me:16.1s (59.6x, -----) bz2 new-valgrind:0.96s no: 4.8s ( 5.0x, -----) me:13.4s (14.0x, -----) fbench new-valgrind:0.57s no: 1.2s ( 2.0x, -----) me: 6.4s (11.3x, -----) ffbench new-valgrind:0.40s no: 0.9s ( 2.3x, -----) me: 5.2s (13.0x, -----) heap new-valgrind:0.27s no: 0.7s ( 2.7x, -----) me:14.5s (53.6x, -----) sarp new-valgrind:0.04s no: 0.2s ( 4.2x, -----) me: 3.6s (89.5x, -----) tinycc new-valgrind:0.31s no: 1.5s ( 4.9x, -----) me:17.9s (57.6x, -----) -- Finished tests in perf ---------------------------------------------- real 2m11.275s user 1m46.779s sys 0m23.545s This is with a x86_64 machine with vmware. Is there a way to compile the tests in 32-bit mode? Or must I do some hacking in the Makefiles? I'll revert the optimization patch so we could see if the problem is in the regalloc bugfixes or in the optimization. - Filipe Cabecinhas |
|
From: Julian S. <js...@ac...> - 2008-03-29 19:06:41
|
> This is with a x86_64 machine with vmware. Is there a way to compile > the tests in 32-bit mode? Or must I do some hacking in the Makefiles? I think (although not sure) that you can get a complete 32-bit only build if you build from clean, giving --enable-only32bit to ./configure. btw, if you give "--vg=" to perf/vg_perf multiple times, then it will summarise all the results in one table, which makes them much easier to compare. (try "perl perf/vg_perf --help") The code expansion numbers (eg 300,800 -> 2,120,621; ratio 70:10) are important, though. At this level of tuning any speedup often gets lost in the measurement noise, and so these numbers are a much more reliable way to see if you managed to generate less code. What they mean is new 8,362 (300,800 -> 2,120,621; ratio 70:10) 8362 blocks translated 300800 bytes of original code in those blocks converted to 2120621 bytes by vex J |
|
From: Filipe C. <fi...@gm...> - 2008-03-29 20:28:50
|
On 29 Mar, 2008, at 19:02, Julian Seward wrote: > > I think (although not sure) that you can get a complete 32-bit only > build > if you build from clean, giving --enable-only32bit to ./configure. > > btw, if you give "--vg=" to perf/vg_perf multiple times, then it will > summarise all the results in one table, which makes them much easier > to > compare. (try "perl perf/vg_perf --help") > Thanks. Here's the stats from an amd64 server with one core that was idling: valgrind - original new-valgrind - with Nuno's regalloc patches optimized-valgrind - with those patches + optimization. filcab@escher ~/cvsw/valgrind $ cat ../perf-timings -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 new-valgrind:0.29s no: 7.5s (25.8x, -----) me:10.9s (37.6x, -----) bigcode1 optimized-valgrind:0.29s no: 7.6s (26.1x, -1.1%) me:11.1s (38.3x, -2.0%) bigcode1 valgrind :0.29s no: 7.5s (25.8x, 0.3%) me:10.9s (37.5x, 0.2%) -- bigcode2 -- bigcode2 new-valgrind:0.29s no:12.3s (42.3x, -----) me:21.1s (72.9x, -----) bigcode2 optimized-valgrind:0.29s no:12.4s (42.9x, -1.5%) me:21.9s (75.4x, -3.5%) bigcode2 valgrind :0.29s no:12.2s (42.1x, 0.4%) me:21.2s (73.0x, -0.2%) -- bz2 -- bz2 new-valgrind:1.02s no: 8.7s ( 8.6x, -----) me:18.7s (18.4x, -----) bz2 optimized-valgrind:1.02s no: 8.7s ( 8.5x, 0.7%) me:18.6s (18.2x, 1.0%) bz2 valgrind :1.02s no: 8.7s ( 8.5x, 0.3%) me:19.1s (18.7x, -2.0%) -- fbench -- fbench new-valgrind:0.57s no: 3.4s ( 6.0x, -----) me: 8.9s (15.7x, -----) fbench optimized-valgrind:0.57s no: 3.4s ( 6.0x, 0.0%) me: 9.0s (15.7x, -0.1%) fbench valgrind :0.57s no: 3.5s ( 6.1x, -0.9%) me: 9.0s (15.7x, -0.2%) -- ffbench -- ffbench new-valgrind:2.25s no: 3.5s ( 1.6x, -----) me: 8.2s ( 3.6x, -----) ffbench optimized-valgrind:2.25s no: 3.6s ( 1.6x, -0.8%) me: 8.2s ( 3.6x, -0.6%) ffbench valgrind :2.25s no: 3.5s ( 1.6x, 0.0%) me: 8.1s ( 3.6x, 0.1%) -- heap -- heap new-valgrind:0.24s no: 2.8s (11.5x, -----) me:16.4s (68.5x, -----) heap optimized-valgrind:0.24s no: 2.8s (11.7x, -1.1%) me:16.8s (70.0x, -2.3%) heap valgrind :0.24s no: 2.8s (11.7x, -1.1%) me:16.7s (69.8x, -1.9%) -- sarp -- sarp new-valgrind:0.11s no: 0.6s ( 5.0x, -----) me: 4.3s (39.2x, -----) sarp optimized-valgrind:0.11s no: 0.6s ( 5.2x, -3.6%) me: 4.3s (39.5x, -0.9%) sarp valgrind :0.11s no: 0.6s ( 5.1x, -1.8%) me: 4.3s (39.1x, 0.2%) -- tinycc -- tinycc new-valgrind:0.38s no: 5.3s (13.9x, -----) me:23.0s (60.6x, -----) tinycc optimized-valgrind:0.38s no: 5.3s (14.0x, -0.2%) me:23.2s (61.1x, -0.7%) tinycc valgrind :0.38s no: 5.3s (14.0x, -0.4%) me:23.1s (60.8x, -0.3%) -- Finished tests in perf ---------------------------------------------- == 8 programs, 48 timings ================= 475.29user 5.76system 8:01.96elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1major+731739minor)pagefaults 0swaps I can't test it decently with firefox or something because right now I don't have physical access (well. maybe I could just kill firefox after some minutes). Also, tracing firefox is quite strange because there are some shell scripts in between and I have to search for the process firefox-bin afterward so I can get the stats. Is there a way to --trace- children=yes and then have the global stats? I think running "valgrind -v valgrind firefox" may be a bit of an overkill. - Filipe Cabecinhas |