On Sun, May 01, 2005 at 05:46:31PM -0400, James Y Knight wrote:
> I'm confused by this solution. Why is it that the above code causes a
> full call (or now, extra shifts), whereas the minor modification below
> produces the expected code on new and old SBCLs. Why should having more
> information available cause the optimizer to produce *worse* code than
> when less information is available? I think the real bug isn't the lack
> of those VOPs you added, but rather that the optimizer has for some
> reason increased the data range from fixnum to (unsigned-byte 64) when
> it sees the constant second argument to the logand.
The code in question (deleted for brevity's sake) looks something like:
(defun foo (x y)
(declare (optimize (speed 3) (debug 0) (space 0)))
(declare (type fixnum x y))
(logand (logxor x y) #.(1- (ash 1 18))))
This code does not compile to the expected XOR/AND sequence with current
SBCL (although Alexey committed changes to make it not be horrible on
x86). If you compile the above code, you will get a warning (on ppc):
; note: forced to do full call
; unable to do inline fixnum arithmetic (cost 2) because:
; The result is a (VALUES (UNSIGNED-BYTE 32)
; &OPTIONAL), not a (VALUES FIXNUM &REST T).
; unable to do inline (unsigned-byte 32) arithmetic (cost 3) because:
; The first argument is a FIXNUM, not a (UNSIGNED-BYTE 32).
; The second argument is a FIXNUM, not a (UNSIGNED-BYTE 32).
; compilation unit finished
; printed 1 note
As James notes, the compiler seems to be promoting the result type
of the LOGXOR to (UNSIGNED-BYTE 32) which makes it unable to generate
good code for this case. For posterity's sake, what follows is an
explanation of why this is the case.
Modular arithmetic is triggered by the LOGAND optimizer in compiler/srctran.
We're only going to talk about unsigned modular arithmetic here, but the
principles are the same for signed modular arithmetic. The LOGAND
optimizer examines the derived result type of the call to LOGAND. If
the result type is unsigned and bounded, the maximum width (INTEGER-LENGTH)
of the result type is calculated and then each argument to the LOGAND
is "cut" to that width via SB-C::CUT-TO-WIDTH, also in compiler/srctran.
(There's a FIXME in the LOGAND optimizer; I don't understand why that
FIXME is there, though.)
SB-C::CUT-TO-WIDTH takes a node, a modular class (:UNSIGNED or :SIGNED),
and a width, which is the width calculated above. What CUT-TO-WIDTH
attempts to do is find a function call inside the passed node where the
function is a modular function (defined via DEFINE-MODULAR-FUN
and friends in compiler/generic/vm-macs). Roughly speaking, a modular
function is one that is implemented in hardware via a single machine
instruction. Machines generally operate mod their word size, hence the
name "modular function." CUT-TO-WIDTH is also looking for a modular
function that has a "width" less than or equal to the width passed in--
that is, that the function call can be implemented via a single machine
instruction. A call such as (LOGAND ... (1- (ash 1 75))) will not be
subject to CUT-TO-WIDTH's transformations, as SBCL doesn't run on machines
with a word size of >= 75 bits.
Once such a function call is found (the LOGXOR in the example above),
CUT-TO-WIDTH substitutes the name of the modular function in the function
call. This effectively transforms the body of our example into:
(logand (sb-vm::logxor-mod32 x y) #.(1- (ash 1 18)))
SB-VM::LOGXOR-MOD32 works exactly like LOGXOR, except that LOGXOR-MOD32
effectively applies an (ldb (byte 32 0) ...) to its result (hence the
"MOD32" part). This would be slow in normal code, but if we can compile
this call to a single machine instruction, the machine will do all of the
hard work for us. CUT-TO-WIDTH then flags the function call (node) as
needing reoptimization--re-deriving types, etc. etc.
This is where the problems start. The type derivation function for
a modular function such as SB-VM::LOGXOR-MOD32 is generated by
MAKE-MODULAR-FUN-TYPE-DERIVER in compiler/srctran. Type derivation
for modular functions can be done in two stages:
1) Deriving the type of the non-modular function (LOGXOR in this case);
2) Deriving the type of LOGAND (the modulus step) with that type and
the type of the mask (#xffffffff in this case).
These are the steps taken by the function generated by
MAKE-MODULAR-FUN-TYPE-DERIVER. The problem is in step 2. We don't
really want to take (LOGXOR X Y) modulo 2^32-1 before we LOGAND it
with 2^18-1. We'd rather that the type derivation of SB-VM::LOGXOR-MOD32
(in this case, at least) be exactly the same as that for LOGXOR.
The modularizing will take place via the call to LOGAND. So the
modular arithmetic framework is being less precise about deriving
the types of variables (at least in this case).
The upshot is exactly what James saw. The call to SB-VM::LOGXOR-MOD32
(which occurs "under the hood") takes two FIXNUMs and returns an
(UNSIGNED-BYTE 32) result. There's no VOP to implement this, and the
compiler says so. Ergo, a full call has to be generated for what looks
like simplistic code. And I think, although I don't have a lisp to test
with, that an "old-school" CMUCL or SBCL without modular arithmetic
would nail this code.
I do not know an easy way to fix this problem. The first thing that
came to mind (and that I tried this morning) was to preserve the fact
that we're LOGAND'ing the result of the LOGXOR call with 2^18-1 in
CUT-TO-WIDTH and use that type in the type derivation for
SB-VM::LOGXOR-MOD32, instead of a full word-width mask. That solution
died in warm init and, upon further reflection, seemed dodgy.
Another solution might be to convert:
(LOGAND (LOGXOR X Y) (1- (ash 1 18)))
to something like:
(LOGXOR-MOD X Y (1- (ash 1 18)))
where you can have better type-derivation, because you're not throwing
information away and you're concentrating all the necessary information
within the scope of a single function call. (Python has trouble, as has
been previously noted on this list, seeing patterns spanning multiple
function calls, whereas it's pretty good optimizing a single function
call.). As a (small) bonus, various types of field extraction (and
insertion?) would be exposed on the PPC and could be more easily
compiled to a single rlwinm instruction. On the downside, this is some
signficant monkeying with the guts of the compiler. It also carries
potential for serious code bloat, e.g
(LOGAND (OP ...) (OP ...) (OP ...) MASK)
would probably get turned into:
(LOGAND (OP-MOD ... MASK) (OP-MOD ... MASK) (OP-MOD ... MASK))
which might result in a lot of redundant AND'ing with MASK at the
machine code level.
So there's a treatise on how modular arithmetic works. I hope somebody
finds it useful at some point.
Nathan | From Man's effeminate slackness it begins. --Paradise Lost
The last good thing written in C was Franz Schubert's Symphony Number 9.