From: Nathan F. <fr...@gm...> - 2007-10-13 05:01:09
|
On 10/12/07, James Y Knight <fo...@fu...> wrote: > Taking a very simple example: > > (declaim (optimize (safety 0) (speed 3) (debug 0))) > (defun foo ()) > (defun test () (dotimes (i 10000) (foo))) > > And here's the code generated on x86-64, with sbcl 1.0.10.45. I > interspersed some comments; it seems to me that almost every > instruction generated is suboptimal in some way or another. Of > course, fixing much of it doesn't seem like it'll be particularly > easy... Comments? (I hope my lack of full understanding of how SBCL > works doesn't show through too much here. :) As a high-level note: some of your concerns could be addressed (specifically, the JMP to start the loop) by making DOTIMES a bit smarter about macro-expansion. Doing so wouldn't help cases like (LOOP FOR I FROM 16 BELOW 64 ...) -- unless you made those smarter as well -- but it'd be a start, and quite a bit easier than significant changes to the IR1/IR2 optimizers. I don't claim to address all your concerns perfectly--it's late here, for one thing, so I may not be thinking straight. :) > ; 6B3: L0: 48895DE8 MOV [RBP-24], RBX > save 'i' to stack, this shouldn't be necessary, 'i' should be stored > in a callee saved register (are there any in the calling convention?) I believe all non-argument passing registers are caller-saved. > ; 6B7: 488BD4 MOV RDX, RSP > ; 6BA: 4883EC18 SUB RSP, 24 > why is RSP decremented here instead of on entry to this function (or > not at all)? Because of the manner in which function calls are expanded from IR1 to IR2 and the lack of loop invariant code motion in IR2. > ; 6C5: 31C9 XOR ECX, ECX > okay: mark number of args = 0 What, no complaints about not doing loop-invariant code motion here? :) > ; 6C7: 48896AF8 MOV [RDX-8], RBP > ; 6CB: 488BEA MOV RBP, RDX > could possibly eliminate base pointer, and use externally maintained > PC-to-stack-depth mapping, like Dwarf3 does for other languages? This is certainly possible, but it would be quite a bit of work. I don't think we'd want full-blown Dwarf3 information; we could probably get by with something that's a bit simpler. I think it also might make throwing somewhat more expensive. > ; 6CE: FF5009 CALL QWORD PTR [RAX+9] > Would be nice if it wasn't an indirect call. It seems possible that > code components could directly call target code. Hypothesizing an > implementation: (setf (symbol-function 'foo)) would mutate the first > instruction (which starts out as a nop) in the target to be a jump to > the new code (and GC would notice and fixup that forwarding sometime > later). (symbol-function 'foo) would need to point to the instruction > after the jump-or-nop so as to not be affected by above mutation. You'd need several nops at the beginning of the function for overwriting--on x86, you'd need five--and you'd need to teach the garbage collector about these new absolute addresses you're introducing into the code. Direct calls on x86-64, in the general case, would be pretty expensive: a 64-bit MOV followed by a JMP-to-register. It's not obvious that this would be any more efficient. It's possible I am misinterpreting what your suggested implementation is, though. > ; 6D9: 488D4B08 LEA RCX, [RBX+8] > ; 6DD: 488BD9 MOV RBX, RCX > this looks like a stupid sequence of instructions. Why isn't this > just an add? I'm guessing this is from poor register allocation and/or bad interaction with the argument/result save/restore code generated by the fixnum + VOP. (If you look at the IR2, this is probably a single VOP where the result has a load-tn.) > ; 6E0: L1: 4881FB80380100 CMP RBX, 80000 > ; 6E7: 7CCA JL L0 > compare and jump. well, really the loop should have been inverted to > count down from 9999 to 0, so that the flags from the add can be used > directly in a JS instruction. The high-level comment at the beginning of my email applies here, but doing the loop reversal is...doable, I suppose, but it'd involve a bit of hacking on the IR1 plus some analysis. Not sure how pleasant it would be to do that sort of surgery on IR1. Being smart about flags is certainly possible with what we have now, but I think we'd have to make the CMP above not a jump target to make things work right generally. (Basically, we have flags in the EFFECTS/AFFECTED slot of VOPs for each of the flag bits. Comparison VOPs then check the previous VOP to see what flags were set, if any, and decide whether to emit the CMP accordingly. Modeling flag usage for things like the + VOPs doesn't work right because we can emit LEA or ADD depending on the operands, but I think we'd only care about - VOPs for starters anyway.) -Nathan |