cmov is notoriously slow on P4s. Fortunately, there are workarounds (sbb,
which is also slow on most P4s, and sar, which usually isn't [1-6 cycles
latency in 32-bit mode]). Now, if we were generating code for Athlons
only, this wouldn't be a problem, since all 3 opcodes have decent
latencies on K7/K8.
http://swox.com/doc/x86-timing.pdf Has some information regarding that
Now, if I understand correctly, the cmov is there to ensure that result is
null when the shift count is lower than -31.
I would suggest trying this generator, which should be better on most
pentiums, and not much worse on athlons, and also saves a register. It
also has the advantage of working on cpus without cmov (admittedly, those
usually don't have such indecently long pipelines, but it still removes
some pressure on the BTB).
[compiler/x86/arith.lisp Line 896]
(:generator 4 ;;no idea what the cost should be.
(move result number)
(move ecx amount)
(inst or ecx ecx)
(inst jmp :ns positive)
(inst neg ecx)
(inst shr result :cl)
(inst sub ecx 32) ;ecx negative iff count < 32 *
(inst sar ecx 31) ;ecx = -1 iff count < 32, 0 otherwise **
(inst and result ecx)
(inst jmp done)
;; * could use an additional register to allow the sub and the shr to
execute in parallel, but sub is only 1 cycle or so. sar and shr
unfortunately use the same execution units for every platform, iirc. ;; **
Sanity check anyone? I wouldn't want to be off by one here, but since
signed numbers basically only have 31 bits of significand (+ sign bit),
this sounds right to me.
;; Older comment says the result-type ensures it won't overflow
(inst shl result :cl)
I believe that with some more cleverness, it is possible to eliminate the
remaining branch, but that would only be useful on higher-clocked cpus
when the count's sign is unpredictable.