Re: [Sbcl-devel] Ugly assembly

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 10/12/07, James Y Knight <fo...@fu...> wrote:
> Taking a very simple example:
>
> (declaim (optimize (safety 0) (speed 3) (debug 0)))
> (defun foo ())
> (defun test () (dotimes (i 10000) (foo)))
>
> And here's the code generated on x86-64, with sbcl 1.0.10.45. I
> interspersed some comments; it seems to me that almost every
> instruction generated is suboptimal in some way or another. Of
> course, fixing much of it doesn't seem like it'll be particularly
> easy... Comments? (I hope my lack of full understanding of how SBCL
> works doesn't show through too much here. :)

As a high-level note: some of your concerns could be addressed
(specifically, the JMP to start the loop) by making DOTIMES a bit
smarter about macro-expansion.  Doing so wouldn't help cases like
(LOOP FOR I FROM 16 BELOW 64 ...) -- unless you made those smarter as
well -- but it'd be a start, and quite a bit easier than significant
changes to the IR1/IR2 optimizers.

I don't claim to address all your concerns perfectly--it's late here,
for one thing, so I may not be thinking straight. :)

> ;      6B3: L0:   48895DE8         MOV [RBP-24], RBX
> save 'i' to stack, this shouldn't be necessary, 'i' should be stored
> in a callee saved register (are there any in the calling convention?)

I believe all non-argument passing registers are caller-saved.

> ;      6B7:       488BD4           MOV RDX, RSP
> ;      6BA:       4883EC18         SUB RSP, 24
> why is RSP decremented here instead of on entry to this function (or
> not at all)?

Because of the manner in which function calls are expanded from IR1 to
IR2 and the lack of loop invariant code motion in IR2.

> ;      6C5:       31C9             XOR ECX, ECX
> okay: mark number of args = 0

What, no complaints about not doing loop-invariant code motion here? :)

> ;      6C7:       48896AF8         MOV [RDX-8], RBP
> ;      6CB:       488BEA           MOV RBP, RDX
> could possibly eliminate base pointer, and use externally maintained
> PC-to-stack-depth mapping, like Dwarf3 does for other languages?

This is certainly possible, but it would be quite a bit of work.  I
don't think we'd want full-blown Dwarf3 information; we could probably
get by with something that's a bit simpler.  I think it also might
make throwing somewhat more expensive.

> ;      6CE:       FF5009           CALL QWORD PTR [RAX+9]
> Would be nice if it wasn't an indirect call. It seems possible that
> code components could directly call target code. Hypothesizing an
> implementation: (setf (symbol-function 'foo)) would mutate the first
> instruction (which starts out as a nop) in the target to be a jump to
> the new code (and GC would notice and fixup that forwarding sometime
> later). (symbol-function 'foo) would need to point to the instruction
> after the jump-or-nop so as to not be affected by above mutation.

You'd need several nops at the beginning of the function for
overwriting--on x86, you'd need five--and you'd need to teach the
garbage collector about these new absolute addresses you're
introducing into the code.  Direct calls on x86-64, in the general
case, would be pretty expensive: a 64-bit MOV followed by a
JMP-to-register.  It's not obvious that this would be any more
efficient.  It's possible I am misinterpreting what your suggested
implementation is, though.

> ;      6D9:       488D4B08         LEA RCX, [RBX+8]
> ;      6DD:       488BD9           MOV RBX, RCX
> this looks like a stupid sequence of instructions. Why isn't this
> just an add?

I'm guessing this is from poor register allocation and/or bad
interaction with the argument/result save/restore code generated by
the fixnum + VOP.  (If you look at the IR2, this is probably a single
VOP where the result has a load-tn.)

> ;      6E0: L1:   4881FB80380100   CMP RBX, 80000
> ;      6E7:       7CCA             JL L0
> compare and jump. well, really the loop should have been inverted to
> count down from 9999 to 0, so that the flags from the add can be used
> directly in a JS instruction.

The high-level comment at the beginning of my email applies here, but
doing the loop reversal is...doable, I suppose, but it'd involve a bit
of hacking on the IR1 plus some analysis.  Not sure how pleasant it
would be to do that sort of surgery on IR1.  Being smart about flags
is certainly possible with what we have now, but I think we'd have to
make the CMP above not a jump target to make things work right
generally.  (Basically, we have flags in the EFFECTS/AFFECTED slot of
VOPs for each of the flag bits.  Comparison VOPs then check the
previous VOP to see what flags were set, if any, and decide whether to
emit the CMP accordingly.  Modeling flag usage for things like the +
VOPs doesn't work right because we can emit LEA or ADD depending on
the operands, but I think we'd only care about - VOPs for starters
anyway.)

-Nathan

Re: [Sbcl-devel] Ugly assembly

Common Lisp compiler and runtime

Re: [Sbcl-devel] Ugly assembly