From: Nathan F. <fr...@cs...> - 2006-07-30 04:11:38
|
On a random walk through some of the bignum code, tightening a few loose screws, I disassembled LOGAND-SHORTER-NEGATIVE on a whim (x86): ; 288: 8F45F8 POP DWORD PTR [EBP-8] ; 28B: 8D65E0 LEA ESP, [EBP-32] ; 28E: 8B5DF0 MOV EBX, [EBP-16] ; 291: 8B4DEC MOV ECX, [EBP-20] ; 294: 8955E8 MOV [EBP-24], EDX ; 297: C745EC00000000 MOV DWORD PTR [EBP-20], 0 ; 29E: EB36 JMP L1 ; 2A0: L0: 8B45E8 MOV EAX, [EBP-24] ; 2A3: 8B55EC MOV EDX, [EBP-20] ; 2A6: 8B4410FD MOV EAX, [EAX+EDX-3] ; 2AA: 8945F4 MOV [EBP-12], EAX ; 2AD: 8B45EC MOV EAX, [EBP-20] ; 2B0: 8B4406FD MOV EAX, [ESI+EAX-3] ; 2B4: 8945F0 MOV [EBP-16], EAX ; 2B7: 8B45F4 MOV EAX, [EBP-12] ; 2BA: 2345F0 AND EAX, [EBP-16] ; 2BD: 8945F4 MOV [EBP-12], EAX ; 2C0: 8B55EC MOV EDX, [EBP-20] ; 2C3: 8B45F4 MOV EAX, [EBP-12] ; 2C6: 894411FD MOV [ECX+EDX-3], EAX ; 2CA: 8945F4 MOV [EBP-12], EAX ; 2CD: 8B45EC MOV EAX, [EBP-20] ; 2D0: 83C004 ADD EAX, 4 ; 2D3: 8945EC MOV [EBP-20], EAX ; 2D6: L1: 397DEC CMP [EBP-20], EDI ; 2D9: 7CC5 JL L0 ; 2DB: 8BC7 MOV EAX, EDI ; 2DD: EB0B JMP L3 ; 2DF: L2: 8B5406FD MOV EDX, [ESI+EAX-3] ; 2E3: 895401FD MOV [ECX+EAX-3], EDX ; 2E7: 83C004 ADD EAX, 4 ; 2EA: L3: 39D8 CMP EAX, EBX ; 2EC: 75F1 JNE L2 Pretty ugly, but then again, the x86 doesn't have many registers and this code requires just enough to stretch the register allocator. Three registers for the bignums, one for the loop index, one for the loop bound, and two temporaries to hold the bignum digits from the bignums being AND'd together. "No problem," I think, "we can actually use a count-down-to-zero loop to save a register." This means that we're down to six required registers, which should be OK. Resulting code as seen from SLIME: ; 1242E069: 8B45EC MOV EAX, [EBP-20] ; no-arg-parsing entry point ; 6C: 83E804 SUB EAX, 4 ; 6F: EB1D JMP L1 ; 71: L0: 8BC8 MOV ECX, EAX ; 73: 8BD0 MOV EDX, EAX ; 75: 8B5DE8 MOV EBX, [EBP-24] ; 78: 8B5413FD MOV EDX, [EBX+EDX-3] ; 7C: 8BD8 MOV EBX, EAX ; 7E: 8B7DF4 MOV EDI, [EBP-12] ; 81: 8B5C1FFD MOV EBX, [EDI+EBX-3] ; 85: 21DA AND EDX, EBX ; 87: 89540EFD MOV [ESI+ECX-3], EDX ; 8B: 83E804 SUB EAX, 4 ; 8E: L1: 83F800 CMP EAX, 0 ; 91: 7DDE JNL L0 ; 93: 8B45EC MOV EAX, [EBP-20] ; 96: EB0E JMP L3 ; 98: L2: 8B4DF4 MOV ECX, [EBP-12] ; 9B: 8B4C01FD MOV ECX, [ECX+EAX-3] ; 9F: 894C06FD MOV [ESI+EAX-3], ECX ; A3: 83C004 ADD EAX, 4 ; A6: L3: 3B45F0 CMP EAX, [EBP-16] ; A9: 7CED JL L2 Not quite as tight as you might find in GMP or similar, but pretty good. Some of those MOVs look a little suspicious, but it's a lot better than the previous monstrosity. After a rebuild, however, the cross-compiler turns the function into: ; E00: 8F45F8 POP DWORD PTR [EBP-8] ; E03: 8D65E0 LEA ESP, [EBP-32] ; E06: 8BC7 MOV EAX, EDI ; E08: 8B5DF0 MOV EBX, [EBP-16] ; E0B: 8B4DEC MOV ECX, [EBP-20] ; E0E: 8955E4 MOV [EBP-28], EDX ; E11: 8BF8 MOV EDI, EAX ; E13: 83E804 SUB EAX, 4 ; E16: 8945E8 MOV [EBP-24], EAX ; E19: EB48 JMP L1 ; E1B: L0: 8B45E8 MOV EAX, [EBP-24] ; E1E: 8945F4 MOV [EBP-12], EAX ; E21: 8B45E8 MOV EAX, [EBP-24] ; E24: 8945F0 MOV [EBP-16], EAX ; E27: 8B45E4 MOV EAX, [EBP-28] ; E2A: 8B55F0 MOV EDX, [EBP-16] ; E2D: 8B4410FD MOV EAX, [EAX+EDX-3] ; E31: 8945F0 MOV [EBP-16], EAX ; E34: 8B45E8 MOV EAX, [EBP-24] ; E37: 8945EC MOV [EBP-20], EAX ; E3A: 8B45EC MOV EAX, [EBP-20] ; E3D: 8B4406FD MOV EAX, [ESI+EAX-3] ; E41: 8945EC MOV [EBP-20], EAX ; E44: 8B45F0 MOV EAX, [EBP-16] ; E47: 2345EC AND EAX, [EBP-20] ; E4A: 8945F0 MOV [EBP-16], EAX ; E4D: 8B55F4 MOV EDX, [EBP-12] ; E50: 8B45F0 MOV EAX, [EBP-16] ; E53: 894411FD MOV [ECX+EDX-3], EAX ; E57: 8945F0 MOV [EBP-16], EAX ; E5A: 8B45E8 MOV EAX, [EBP-24] ; E5D: 83E804 SUB EAX, 4 ; E60: 8945E8 MOV [EBP-24], EAX ; E63: L1: 837DE800 CMP DWORD PTR [EBP-24], 0 ; E67: 7DB2 JNL L0 ; E69: 8BC7 MOV EAX, EDI ; E6B: EB0B JMP L3 ; E6D: L2: 8B5406FD MOV EDX, [ESI+EAX-3] ; E71: 895401FD MOV [ECX+EAX-3], EDX ; E75: 83C004 ADD EAX, 4 ; E78: L3: 39D8 CMP EAX, EBX ; E7A: 7CF1 JL L2 An optimist will notice that the cleanup loop (L2 to L3) has actually improved, since we're not reloading the bignum from the stack in each iteration. However, the core loop (L0 to L1) is awful. Lots of unnecessary movement between the stack and registers. Is this a cross-compiler problem (e.g. its inability to do let conversion or similar--a problem I don't really understand), a register allocator problem, or both? Staring at the trace files suggests that it might be a problem in the cross compiler. A trace file from a COMPILE-FILE of the revised LOGAND-SHORTER-NEGATIVE contains: IR1 block 20 start c143 start stack: 143>144: %BIGNUM-SET {GLOBAL-FUNCTION} 145>146: RES 147>148: I 149>150: %BIGNUM-REF {GLOBAL-FUNCTION} 151>152: A 153>154: I 155>156: known combination v150 v152 v154 157>158: %BIGNUM-REF {GLOBAL-FUNCTION} 159>160: B 161>162: I 163>164: known combination v158 v160 v162 end stack: successors c165 which is what you would expect. However, the corresponding part in a trace file from an SBCL compilation contains: IR1 block 4 start c51 start stack: 51> 52: %BIGNUM-SET {GLOBAL-FUNCTION} 53> 54: RES 55> 56: I 57> 58: %BIGNUM-REF {GLOBAL-FUNCTION} 59> 60: I 61> 62: known combination v58 v15 v60 63> 64: %BIGNUM-REF {GLOBAL-FUNCTION} 65> 66: B 67> 68: I 69> 70: known combination v64 v66 v68 end stack: successors c71 Notice the lack of a explicit reference to A. Clearly, however, it's still linked in somewhere (the '61> 62' indicates that it's 'v15'). Scanning backwards, we find: 14> 15: cast v5 -[#<NAMED-TYPE *> -> #<UNION-TYPE BIGNUM>] 16> 17: cast v7 -[#<NAMED-TYPE *> -> #<NUMERIC-TYPE (UNSIGNED-BYTE 24)>] 18> 19: cast v9 -[#<NAMED-TYPE *> -> #<UNION-TYPE BIGNUM>] 20> 21: cast v11 -[#<NAMED-TYPE *> -> #<NUMERIC-TYPE (UNSIGNED-BYTE 24)>] 22> 23: cast v13 -[#<NAMED-TYPE *> -> #<UNION-TYPE BIGNUM>] 24> local combination v3 <none> v17 v19 v21 v23 25> bind SB!C::CLAMBDA LOGAND-SHORTER-NEGATIVE :KIND :LET At 24>, we would expect 'local combination v3 v15 v17 v19 v21 v23', except that v15 is missing here... Looking at other binds of :KIND :LET in the trace file reveals similar <none> markers where one would expect to find references to continuations. I don't know what this means, or how to fix it, but it suggests something is going wrong inside IR1. IR1 experts, comments? -- Nathan | From Man's effeminate slackness it begins. --Paradise Lost The last good thing written in C was Franz Schubert's Symphony Number 9. --Erwin Dieterich |