Thread: [Sbcl-devel] 64-bit modular arithmetic on intel 64-bit + ubuntu

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm trying to coerce SBCL into competing with C on a simple form of MCMC.
Using the library provided implementations of random, SBCL is only 50-ish%
slower than GCC (~2.1s versus ~1.5s), which is incredible, because GLIBC
random should just be the typical multiply-accumulate, whereas SBCL seems
to be using Mersenne twister, according to the docs.

The reason seems to be that GCC can't inline code in LIBC; for MCMC code,
doing a procedure call to generate a random number (particularly if the
algorithm in question is just a multiply+accumulate) is way overkill.
Inlining the random number generator manually produces a significant speed
boost.  I played with multiple variations until settling on Knuth's
suggested LC-PRNG for 64 bit machines.  (But also merely cribbing
Java.util.Random, which is geared for 32 bit architectures, produces a
similar speed boost.)

Inlining Knuth's generator drops the gcc time down to 0.83, making SBCL
slower by a factor of 2.5 or so.

Of course, that is hardly a fair comparison of SBCL versus GCC, because the
algorithms for random number generation still aren't the same.  But it is
an important step in making things realistic; for something like MCMC, of
course we want to inline random number generation.

So then I tried to backport the C implementation of Knuth's PRNG to SBCL.
In so doing, I slowed down the SBCL implementation by about 11%.  That is
despite using a more hardware efficient algorithm, and making extremely
evident that I want it to be inlined into the MCMC loop (though I haven't
yet gone to the level of making the random procedures be declared as
macros...).

That is a surprising outcome.  I've tried now for 10, maybe even 20 hours
to convince SBCL to compile this to efficient assembly, but I just can't
seem to convince it that 64 bit arithmetic can be done primitively.  I'm
hoping that either an expert can show me how to do this in the current
SBCL, or, failing that, that you can let me know that this is a known
limitation of SBCL.

--

Looking at the disassembly suggests I might be able to get improvements by
redefining all the inline functions to macros; there are sequences like:
;     E25A:       488BC1           MOV RAX, RCX
;     E25D:       48F7E2           MUL RAX, RDX
;     E260:       488BC8           MOV RCX, RAX
...<no reads of RAX>
;     E290:       498B4621         MOV RAX, [R14+33]
which looks suspiciously like an artifact of inlining after
register-allocating.

But I suspect that the cost of such redundant register moves is very very
small on my hardware.  Certainly not enough to explain the difference from
what GCC produces.

The much more likely culprit is just that the state of the generator is
represented as a lisp object boxing a u64, rather than as an unboxed
value.  In particular, the MCMC loop contains what appear to be either/both
(a) useless overflow checks, or (b) boxing operations:
;     E263:       48030D26030000   ADD RCX, [RIP+806]         ;
#x14057B7EF767814F
;     E26A:       48BA00000000000000C0 MOV RDX, -4611686018427387904
;     E274:       4885CA           TEST RCX, RDX
;     E277:       488D1409         LEA RDX, [RCX+RCX]
;     E27B:       740C             JEQ L36
;     E27D:       488BD1           MOV RDX, RCX
;     E280:       41BB3A090020     MOV R11D, 536873274        ;
ALLOC-UNSIGNED-BIGNUM-IN-RDX
;     E286:       41FFD3           CALL R11
;     E289: L36:  4C8B3548F8FFFF   MOV R14, [RIP-1976]        ; '*R*
;     E290:       498B4621         MOV RAX, [R14+33]

Even though I've specified (or tried to, at any rate) that *R* can only
hold 64 bits, and that all intermediate calculations are to be cut down to
that (or fewer) bits...

Indeed, the statistics reported by SBCL indicate that memory allocation is
actually taking place, so that is a likely culprit.

Another reasonable theory is that the nature of converting to from doubles
is crucial; for example, changing (random 1.0) to (random 1.0d0) when using
SBCL's random slows down the microbenchmark from 2.1 seconds to 3.2 seconds.

-Will

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.7 --enable-shared --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--enable-gnu-unique-object --enable-plugin --enable-objc-gc
--disable-werror --with-arch-32=i686 --with-tune=generic
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1)

$ sbcl --version
SBCL 1.1.13

$ uname -a
Linux ubuntu 3.5.0-43-generic #66-Ubuntu SMP Wed Oct 23 12:01:49 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux

Thread: [Sbcl-devel] 64-bit modular arithmetic on intel 64-bit + ubuntu

Common Lisp compiler and runtime

sbcl-devel