From: William C. <wil...@gm...> - 2013-11-09 07:25:40
Attachments:
mcmc.c
mcmc.sbcl.lisp
|
I'm trying to coerce SBCL into competing with C on a simple form of MCMC. Using the library provided implementations of random, SBCL is only 50-ish% slower than GCC (~2.1s versus ~1.5s), which is incredible, because GLIBC random should just be the typical multiply-accumulate, whereas SBCL seems to be using Mersenne twister, according to the docs. The reason seems to be that GCC can't inline code in LIBC; for MCMC code, doing a procedure call to generate a random number (particularly if the algorithm in question is just a multiply+accumulate) is way overkill. Inlining the random number generator manually produces a significant speed boost. I played with multiple variations until settling on Knuth's suggested LC-PRNG for 64 bit machines. (But also merely cribbing Java.util.Random, which is geared for 32 bit architectures, produces a similar speed boost.) Inlining Knuth's generator drops the gcc time down to 0.83, making SBCL slower by a factor of 2.5 or so. Of course, that is hardly a fair comparison of SBCL versus GCC, because the algorithms for random number generation still aren't the same. But it is an important step in making things realistic; for something like MCMC, of course we want to inline random number generation. So then I tried to backport the C implementation of Knuth's PRNG to SBCL. In so doing, I slowed down the SBCL implementation by about 11%. That is despite using a more hardware efficient algorithm, and making extremely evident that I want it to be inlined into the MCMC loop (though I haven't yet gone to the level of making the random procedures be declared as macros...). That is a surprising outcome. I've tried now for 10, maybe even 20 hours to convince SBCL to compile this to efficient assembly, but I just can't seem to convince it that 64 bit arithmetic can be done primitively. I'm hoping that either an expert can show me how to do this in the current SBCL, or, failing that, that you can let me know that this is a known limitation of SBCL. -- Looking at the disassembly suggests I might be able to get improvements by redefining all the inline functions to macros; there are sequences like: ; E25A: 488BC1 MOV RAX, RCX ; E25D: 48F7E2 MUL RAX, RDX ; E260: 488BC8 MOV RCX, RAX ...<no reads of RAX> ; E290: 498B4621 MOV RAX, [R14+33] which looks suspiciously like an artifact of inlining after register-allocating. But I suspect that the cost of such redundant register moves is very very small on my hardware. Certainly not enough to explain the difference from what GCC produces. The much more likely culprit is just that the state of the generator is represented as a lisp object boxing a u64, rather than as an unboxed value. In particular, the MCMC loop contains what appear to be either/both (a) useless overflow checks, or (b) boxing operations: ; E263: 48030D26030000 ADD RCX, [RIP+806] ; #x14057B7EF767814F ; E26A: 48BA00000000000000C0 MOV RDX, -4611686018427387904 ; E274: 4885CA TEST RCX, RDX ; E277: 488D1409 LEA RDX, [RCX+RCX] ; E27B: 740C JEQ L36 ; E27D: 488BD1 MOV RDX, RCX ; E280: 41BB3A090020 MOV R11D, 536873274 ; ALLOC-UNSIGNED-BIGNUM-IN-RDX ; E286: 41FFD3 CALL R11 ; E289: L36: 4C8B3548F8FFFF MOV R14, [RIP-1976] ; '*R* ; E290: 498B4621 MOV RAX, [R14+33] Even though I've specified (or tried to, at any rate) that *R* can only hold 64 bits, and that all intermediate calculations are to be cut down to that (or fewer) bits... Indeed, the statistics reported by SBCL indicate that memory allocation is actually taking place, so that is a likely culprit. Another reasonable theory is that the nature of converting to from doubles is crucial; for example, changing (random 1.0) to (random 1.0d0) when using SBCL's random slows down the microbenchmark from 2.1 seconds to 3.2 seconds. -Will $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1) $ sbcl --version SBCL 1.1.13 $ uname -a Linux ubuntu 3.5.0-43-generic #66-Ubuntu SMP Wed Oct 23 12:01:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux |
From: Douglas K. <do...@go...> - 2013-11-09 15:25:16
|
--- mcmc.lisp.orig 2013-11-09 10:09:21.000000000 -0500 +++ mcmc.sbcl.lisp 2013-11-09 10:20:13.000000000 -0500 @@ -50,8 +50,11 @@ ;;;; Convenient interface ;;;;;;;;;;;;;;;;;;;;;;;;;;;; -(declaim (type generator *r*)) -(defvar *r* (make-generator)) +;(declaim (type generator *r*)) +;(sb-ext:defglobal *r* (make-generator)) +(define-symbol-macro *r* my-generator) +;(defmacro with-generator (var &body body) +; `(let ((,var (make-generator))) ,@body)) (declaim (inline ->double_biased ->double/52 ->double/53)) @@ -98,24 +101,24 @@ ;; calling double-float a real is a misnomer, likewise for nat(ural), but popular... (declaim (inline uniform_real uniform_nat uniform_bit)) -(declaim (ftype (function () double-float) uniform_real)) -(defun uniform_real () +;(declaim (ftype (function () double-float) uniform_real)) +(defmacro uniform_real () ;;(->double/biased *r*) - (->double/52 (setf *r* (random_next *r*))) + `(->double/52 (setf *r* (random_next *r*))) ) ;; uniform nat takes from the low bits, which is bad. ;; taking from the high bits is more effective, but "difficult" to do (efficiently) in non-power-of-two-cases -(declaim (ftype (function (uint64_t) uint64_t) uniform_nat)) -(defun uniform_nat (m) - (mod (setf *r* (random_next *r*)) m) +;(declaim (ftype (function (uint64_t) uint64_t) uniform_nat)) +(defmacro uniform_nat (m) + `(mod (setf *r* (random_next *r*)) m) ) ;; better distributed than (uniform_nat 1), because it takes the high bit ;; logand is redundant, but possibly helpful for SBCL type inference -(declaim (ftype (function () bit) uniform_bit)) -(defun uniform_bit () - (logand (select (setf *r* (random_next *r*)) 1) 1) +;(declaim (ftype (function () bit) uniform_bit)) +(defmacro uniform_bit () + `(logand (select (setf *r* (random_next *r*)) 1) 1) ) @@ -143,7 +146,9 @@ ;; Run MCMC on the simple Burglary example (from Norvig+Russell?), using Knuth's PRNG ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (declaim (ftype (function (uint64_t) (values (simple-array uint64_t (2)))) compiled-mcmc)) -(defun compiled-mcmc/custom (N) +(defun compiled-mcmc/custom (N &aux (my-generator (make-generator))) + (declare (optimize (debug 0) (safety 0)) + ((unsigned-byte 64) my-generator)) (let ((Ncounts (make-array 2 :element-type 'uint64_t |
From: Jan M. <jmo...@te...> - 2013-11-09 16:06:37
|
On Sat, 2013-11-09 at 10:25 -0500, Douglas Katzman wrote: > Nice code, but at a glance I can see that your problem is that the global > symbol *R* is used as your generator state, therefore forcing > materialization of all potential bignums. > I hacked it up a little bit and halved the runtime of (bench 50000000). > > In practice, you should make a macro > (WITH-GENERATOR (R) ...) > which expands to provide UNIFORM_REAL, UNIFORM_NAT, UNIFORM_BIT as lexical > macros within its body. > ("Do as I say, not as I do") > > Attached diffs will get you started. When noticing the defvar, I had a similar idea. I went with local functions which, on my system, also achieved halving the runtime (and consing): (defun compiled-mcmc/local-functions (N) (flet ((make-generator () (logand 8682522807148012 +mask+))) (let ((r (make-generator))) (labels ((random_next (r) (logand (+ (* r +multiplier+) +addend+) +mask+)) (uniform_bit () (logand (select (setf r (random_next r)) 1) 1)) (->double/52 (r) (/ (ash r (- 52 +width+)) +expt-2-52d0+)) (uniform_real () (->double/52 (setf r (random_next r))))) UNMODIFIED-REST-OF-THE-BODY)))) Kind regards, Jan |
From: Paul K. <pv...@pv...> - 2013-11-09 21:31:15
Attachments:
mcmc.lisp
|
William Cushing wrote: > I'm trying to coerce SBCL into competing with C on a simple form of > MCMC. Using the library provided implementations of random, SBCL is > only 50-ish% slower than GCC (~2.1s versus ~1.5s), which is incredible, > because GLIBC random should just be the typical multiply-accumulate, > whereas SBCL seems to be using Mersenne twister, according to the docs. > > The reason seems to be that GCC can't inline code in LIBC; That's not quite true. You can disassemble the code for a short fixed-size memcpy, for example. The reason is more that no gcc maintainer has found a use for such inlining. [...] > Inlining Knuth's generator drops the gcc time down to 0.83, making SBCL > slower by a factor of 2.5 or so. [...] > So then I tried to backport the C implementation of Knuth's PRNG to > SBCL. In so doing, I slowed down the SBCL implementation by about 11%. > That is despite using a more hardware efficient algorithm, and making > extremely evident that I want it to be inlined into the MCMC loop > (though I haven't yet gone to the level of making the random procedures > be declared as macros...). As previously pointed out, there are two immediate issues here, both fixed by moving the state to a local variable: specials access is slow, and values is special variables are boxed. So, while you do get modular arithmetic, it drowns in boxing, unboxing, and dynamic variable lookups. On a more general point, the original code was full of declarations. I've attached the version I would try to work with. There are no global optimisation declaim: I find it best to decide which inner loops are important, and leave the rest at default settings, particularly during development. There are also no (safety 0) declaration, and very few [3] type declarations. The generic MCMC loop becomes ;;; Generic MCMC loops, parameterised on the routine to get a single ;;; random bit and random bits w/ P[bit = 0] = p0, the single ;;; argument. ;;; ;;; Runs for N iterations before returning the count vector. (declaim (maybe-inline generic-mcmc)) (defun generic-mcmc (N uniform-bit biased-bit) (declare (type uint64-t N)) [...]) and switching the PRNG is now trivial: ;; SBCL's MT (defun mt-mcmc (N) (declare (optimize speed (safety 0)) (inline generic-mcmc)) (let ((state *random-state*)) (generic-mcmc N (lambda () (random 2 state)) (lambda (p0) (if (< (random 1d0 state) p0) 1 0))))) ;;; Knuth LCG (defun lcg-mcmc (N) (declare (optimize speed) (inline generic-mcmc)) (let ((state (make-generator))) (generic-mcmc N (lambda () (select (setf state (random-next state)) 1)) (lambda (p0) (let ((r (->double/52 (setf state (random-next state))))) (if (> r p0) 1 0)))))) Running the C program on my machine (E5-4617) for 50000000 iterations takes 0.98 seconds. On the same machine, the MT version (SBCL 1.1.9.xx) takes 3.73 sec, and the LCG 1.25 sec. I also added code to hook into dSFMT, as fast SIMD-oriented version of MT that is geared toward generating large chunks of doubles. The dynamic library was built by executing clang -fPIC -O3 -msse2 -DHAVE_SSE2 -finline-functions -fomit-frame-pointer -fno-strict-aliasing dSFMT.c -DDSFMT_MEXP=19937 --shared -o dSFMT-19937.so in the dSFMT-2.2.2 source tree. With that stronger PRNG, the same benchmark takes only 1.49 seconds. For the same very simple LCG, memory-safe code in SBCL is within 25% of clang on my machine… and makes it trivial to hook in a more robust PRNG at little overhead. Paul Khuong |
From: Paul K. <pv...@pv...> - 2013-11-09 21:36:32
|
Paul Khuong wrote: [...] > On a more general point, the original code was full of declarations. > I've attached the version I would try to work with. There are no global > optimisation declaim: I find it best to decide which inner loops are > important, and leave the rest at default settings, particularly during > development. There are also no (safety 0) declaration, and very few [3] > type declarations. [...] > and switching the PRNG is now trivial: > ;; SBCL's MT > (defun mt-mcmc (N) > (declare (optimize speed (safety 0)) Cough (: That paste is from an older buffer in which I made sure that the declaration had basically no effect. Paul Khuong |
From: Jan M. <jmo...@te...> - 2013-11-09 22:05:25
Attachments:
mcmc-pkhuong.png
mcmc.benchmark.lisp
|
On Sat, 2013-11-09 at 16:30 -0500, Paul Khuong wrote: > Running the C program on my machine (E5-4617) for 50000000 iterations > takes 0.98 seconds. On the same machine, the MT version (SBCL 1.1.9.xx) > takes 3.73 sec, and the LCG 1.25 sec. I put the different variants into an sb-benchmarks [1] benchmark and produced the attached result (Note the differences in consing (green) and code size (blue)). This is on an older x86 machine. Kind regards, Jan [1] https://github.com/scymtym/sb-benchmarks |
From: Paul K. <pv...@pv...> - 2013-11-09 22:10:22
|
Jan Moringen wrote: > On Sat, 2013-11-09 at 16:30 -0500, Paul Khuong wrote: >> Running the C program on my machine (E5-4617) for 50000000 iterations >> takes 0.98 seconds. On the same machine, the MT version (SBCL 1.1.9.xx) >> takes 3.73 sec, and the LCG 1.25 sec. > > I put the different variants into an sb-benchmarks [1] benchmark and > produced the attached result (Note the differences in consing (green) > and code size (blue)). > > This is on an older x86 machine. The output format is nice, but I'm not sure the results mean anything: portions of the code assume 64 bit words. Paul Khuong |
From: Jan M. <jmo...@te...> - 2013-11-09 22:19:20
|
On Sat, 2013-11-09 at 17:10 -0500, Paul Khuong wrote: > Jan Moringen wrote: > > On Sat, 2013-11-09 at 16:30 -0500, Paul Khuong wrote: > >> Running the C program on my machine (E5-4617) for 50000000 iterations > >> takes 0.98 seconds. On the same machine, the MT version (SBCL 1.1.9.xx) > >> takes 3.73 sec, and the LCG 1.25 sec. > > > > I put the different variants into an sb-benchmarks [1] benchmark and > > produced the attached result (Note the differences in consing (green) > > and code size (blue)). > > > > This is on an older x86 machine. > > The output format is nice, but I'm not sure the results mean anything: > portions of the code assume 64 bit words. Yeah … I have to damit this was mainly intended as a cheap plug :) But out of curiosity: at which point does the computation get wrong instead of just very slow? Or does it just assume 64 bit words in the sense that performance will be terrible otherwise? Kind regards, Jan |
From: Paul K. <pv...@pv...> - 2013-11-09 22:21:02
|
Jan Moringen wrote: > On Sat, 2013-11-09 at 17:10 -0500, Paul Khuong wrote: >> The output format is nice, but I'm not sure the results mean anything: >> portions of the code assume 64 bit words. > > Yeah … I have to damit this was mainly intended as a cheap plug :) > > But out of curiosity: at which point does the computation get wrong > instead of just very slow? > > Or does it just assume 64 bit words in the sense that performance will > be terrible otherwise? I'm pretty sure it's just the latter… especially now that the code is safe (: Paul |
From: Paul K. <pv...@pv...> - 2013-11-12 07:58:45
|
William Cushing wrote: > I would not have suspected a global var of being an automatic > performance hit in an otherwise evidently machine-aware implementation. > I guess that's a difference due to supporting GC. More or less. Mostly, this is about having a generic representation for polymorphic operations. Polymorphic but statically typed implementations like SMLton show how to work around that with whole-program compilation. > Certainly for parallelization purposes it is better anyways to pass the > random state through parameters... Specials are not global and also double as thread-local bindings. > I was also considering flattening the CPTs; I was hoping that would be > unnecessary as far as access costs go. Common LISP does permit nested > arrays to be implemented as contiguous blocks of storage in, for > example, row-major order, right? > > It is just that SBCL is making life easier for itself by treating arrays > of arrays with a level of run-time indirection? CL pretty much mandates contiguous storage in row-major order. It also forces support for some sort of indirection, because of displaced arrays. SBCL obeys these two masters by representing non-vector arrays with a header that points into a vector in which data is stored in row-major order. This indirection happens even for simple arrays (but not simple vectors). It's also a single indirection, always to the same location, so only a very low level (lower than caching, for instance) issue. It would probably not be *too* hard of a change to instead represent such non-vector simple-arrays as [header][rank][shape …][data…], but, like I said, it's a very low level issue. > Concerning the array inside the following type: > > (define-alien-type dsfmt-t > (struct dsfmt-t > (status (array double-float #.(* 2 (1+ +dsfmt-n+)))) > (idx int))) > > > It looks like such alien types can be instantiated from the lisp side in > the obvious way, but one has to use: > sb-sys:vector-sap and sb-sys:with-pinned-objects > whenever passing to the C side? No. That's for passing a specialised lisp array to C. This is unrelated to the inline sequence of doubles in the dSFMT_t struct. > Is there any vaguely more portable/standard common lisp idiom for > passing off stuff to C land, i.e., for inhibiting GC and grabbing raw > pointers? There's static-vector. I believe LiamH also has some stuff in Antik (?) to implement pinning on a couple implementations, either with pinning constructs, for free on non-moving implementations, or by disabling GC. > (Is there some macro package that abstracts over, say, Allegro CL's > interface and SBCL's interface? Yes, I could build it myself, but that > wouldn't exactly fit the bill of "at least vaguely portable/standard".)) CFFI implements a common foreign function interface on top of pretty much all contemporary implementations. > I wanted to give SBCL maximum opportunity to compete with C, and since I > of course had my C implementation on hand, I just edited the syntax of > all the declarations/compiler-flags to fit SBCL's syntax for the same > knowledge. [...] > However, the mission here is not functionality, but raw performance. > The point is to test the feasibility of using different systems as > either/both target/implementation languages for reimplementing a > high-level probabilistic modeling language called BLOG. BLOG, short for > Bayesian Logic, is a cool notion... > http://sites.google.com/site/bloginference/ [...] > One obvious idea is to add compiler functionality. I'd like to see the > implementation language of the compiler be a lisp. I'd also like, if at > all possible, for the target language to be a lisp as well (because > generating lisp from lisp is by far the easiest form of code generation). OK. I really don't know ACL's performance characteristics, but, if you want SBCL to give you good code, switch to lexical variables and functions. CL offers a lot of late binding to programmers, and that interferes with compiling for performance. The SBCL model is basically to try and be clever within stuff that can't be rebound (e.g. a single function), but not across (e.g. two distinct top level functions). Type declarations can help a bit, but not overly much: full calls and returns always obey the same representation-generic calling convention, and other externally visible values are always boxed in the generic representation. This is partly caused by the specification and not just lack of manpower, but that's another discussion. (I'm skipping compilation units because SBCL ignores them anyway.) This can all be avoided by moving everything to local bindings (LET/LET* and FLET/LABELS), a style that is somewhat annoying to work with by hand -- although it helps to break definitions up only to inline them in the final body, as I did in my code snippet -- but easily generated. The effect is that a lot of flow analyses will come for free, but, more importantly, that all representations will be specialised, much like in a C program. Word-sized integers will be represented as either fixnums or unboxed signed/unsigned words and doubles and singles as raw machine floats; in all cases, they will be stored and operated on in registers, and spilled to the stack if necessary, as native values. As long as nothing is passed to an outside (Lisp) function, variables can use native representations if it makes sense (e.g. symbols will still be pointers to symbols). Pointers to allocated objects will always remain pointers though: much like C compilers from the 80's or early 90's, SBCL doesn't convert aggregates (structures or small arrays) into a fixed number of local variables, even when passed as arguments. However, arrays and structures can be declared to hold unboxed values; that's not contagious. The only other objects with generic representations will be the arguments for the ball-of-definitions function and its return value(s). The former don't matter too much: they should always be unboxed in the function prologue if necessary. The latter might… except if arrays are returned: arrays are always pointers (boxed) but their contents are unboxed depending on their element type. One big advantage is that, for code that doesn't really (some can be compiled away via constant propagation) manipulate first-class functions, each local function gets a specialised call and return convention. The convention will use however many registers are necessary for both arguments and (multiple) return values, and surrounding lexical variables don't even have to be shuffled about. I don't think state-of-the-art C compilers do that, even for file-local functions; it comes for free given the way functions are represented internally, at the expense of some bad asymptotics on compile times. Contrary to the aforementioned older compilers, SBCL tends to fare better on code that declare fresh lexical bindings and returns multiple values aggressively, rather than code that assigns a lot to a few variables that surround everything. So, the shape of generated functions should probably be something like (let ([parameters and variables that are live across MCMC loops]) (declare [type declarations for parameters, assignment targets, and opaque computations]) (labels ([functions]) (declare [ftype declarations]) (loop ... ; loop repeat counts down, which saves a register. [nested let/let*/labels] [assign loop-carried values]))) With such a structure, it usually suffices to declare types for argument types and hard-to-deduce return types (including local functions), and for variables that are assigned to (or with hard-to-deduce return types). SBCL and CMUCL seem to be in the minority of CL implementations in that they really can exploit range information: types aren't only about representation, and it is useful to, e.g., declare that a double-float value is actually strictly positive with (double-float (0d0)). This would let LOG and SQRT be computed directly, without worrying about complex return values. Similar things happen for floats of small magnitude. [...] > Getting the inner loop (MCMC) as fast as possible is reasonably > important. As far as target language goes, 10% or 20% slower than C is > perhaps good enough, but 100% slower is not. [...] ISTM most of that will quickly drown in the generation of random variates, and that doesn't need much code generation… This test case is already strongly affected by a switch to dSFMT. So this 25% difference ought to go down very quickly. Still, there are also tools that pretty print s-expressions into C; this might be a more easily portable (and maintained) alternative. If that's not enough, you might want to look into running multiple loops in lockstep with SIMD processing (e.g., with SBCL's support for SSE). That'll definitely move the bottleneck to the generation side ;) > Thanks for the prompt response and hard work! Very instructive perusing > through all the code... My pleasure; I had an hour to kill. BTW, there's an initialisation bug in the dSFMT version: either the buffer should be prefilled random values, or the initial index set to a large value. Paul Khuong |
From: William C. <wil...@gm...> - 2013-11-12 13:11:31
|
Thanks again! This is a gold mine :) -Will On Mon, Nov 11, 2013 at 11:58 PM, Paul Khuong <pv...@pv...> wrote: > William Cushing wrote: > >> I would not have suspected a global var of being an automatic >> performance hit in an otherwise evidently machine-aware implementation. >> I guess that's a difference due to supporting GC. >> > > More or less. Mostly, this is about having a generic representation for > polymorphic operations. Polymorphic but statically typed implementations > like SMLton show how to work around that with whole-program compilation. > > > Certainly for parallelization purposes it is better anyways to pass the >> random state through parameters... >> > > Specials are not global and also double as thread-local bindings. > > > I was also considering flattening the CPTs; I was hoping that would be >> unnecessary as far as access costs go. Common LISP does permit nested >> arrays to be implemented as contiguous blocks of storage in, for >> example, row-major order, right? >> >> It is just that SBCL is making life easier for itself by treating arrays >> of arrays with a level of run-time indirection? >> > > CL pretty much mandates contiguous storage in row-major order. It also > forces support for some sort of indirection, because of displaced arrays. > SBCL obeys these two masters by representing non-vector arrays with a > header that points into a vector in which data is stored in row-major > order. This indirection happens even for simple arrays (but not simple > vectors). It's also a single indirection, always to the same location, so > only a very low level (lower than caching, for instance) issue. It would > probably not be *too* hard of a change to instead represent such non-vector > simple-arrays as > [header][rank][shape …][data…], > but, like I said, it's a very low level issue. > > > Concerning the array inside the following type: >> >> (define-alien-type dsfmt-t >> (struct dsfmt-t >> (status (array double-float #.(* 2 (1+ +dsfmt-n+)))) >> (idx int))) >> >> >> It looks like such alien types can be instantiated from the lisp side in >> the obvious way, but one has to use: >> sb-sys:vector-sap and sb-sys:with-pinned-objects >> whenever passing to the C side? >> > > No. That's for passing a specialised lisp array to C. This is unrelated to > the inline sequence of doubles in the dSFMT_t struct. > > > Is there any vaguely more portable/standard common lisp idiom for >> passing off stuff to C land, i.e., for inhibiting GC and grabbing raw >> pointers? >> > > There's static-vector. I believe LiamH also has some stuff in Antik (?) to > implement pinning on a couple implementations, either with pinning > constructs, for free on non-moving implementations, or by disabling GC. > > > (Is there some macro package that abstracts over, say, Allegro CL's >> interface and SBCL's interface? Yes, I could build it myself, but that >> wouldn't exactly fit the bill of "at least vaguely portable/standard".)) >> > > CFFI implements a common foreign function interface on top of pretty much > all contemporary implementations. > > > I wanted to give SBCL maximum opportunity to compete with C, and since I >> of course had my C implementation on hand, I just edited the syntax of >> all the declarations/compiler-flags to fit SBCL's syntax for the same >> knowledge. >> > > [...] > > > However, the mission here is not functionality, but raw performance. >> The point is to test the feasibility of using different systems as >> either/both target/implementation languages for reimplementing a >> high-level probabilistic modeling language called BLOG. BLOG, short for >> Bayesian Logic, is a cool notion... >> http://sites.google.com/site/bloginference/ >> > > [...] > > > One obvious idea is to add compiler functionality. I'd like to see the >> implementation language of the compiler be a lisp. I'd also like, if at >> all possible, for the target language to be a lisp as well (because >> generating lisp from lisp is by far the easiest form of code generation). >> > > OK. I really don't know ACL's performance characteristics, but, if you > want SBCL to give you good code, switch to lexical variables and functions. > CL offers a lot of late binding to programmers, and that interferes with > compiling for performance. The SBCL model is basically to try and be clever > within stuff that can't be rebound (e.g. a single function), but not across > (e.g. two distinct top level functions). Type declarations can help a bit, > but not overly much: full calls and returns always obey the same > representation-generic calling convention, and other externally visible > values are always boxed in the generic representation. This is partly > caused by the specification and not just lack of manpower, but that's > another discussion. (I'm skipping compilation units because SBCL ignores > them anyway.) > > This can all be avoided by moving everything to local bindings (LET/LET* > and FLET/LABELS), a style that is somewhat annoying to work with by hand -- > although it helps to break definitions up only to inline them in the final > body, as I did in my code snippet -- but easily generated. The effect is > that a lot of flow analyses will come for free, but, more importantly, that > all representations will be specialised, much like in a C program. > Word-sized integers will be represented as either fixnums or unboxed > signed/unsigned words and doubles and singles as raw machine floats; in all > cases, they will be stored and operated on in registers, and spilled to the > stack if necessary, as native values. As long as nothing is passed to an > outside (Lisp) function, variables can use native representations if it > makes sense (e.g. symbols will still be pointers to symbols). > > Pointers to allocated objects will always remain pointers though: much > like C compilers from the 80's or early 90's, SBCL doesn't convert > aggregates (structures or small arrays) into a fixed number of local > variables, even when passed as arguments. However, arrays and structures > can be declared to hold unboxed values; that's not contagious. > > The only other objects with generic representations will be the arguments > for the ball-of-definitions function and its return value(s). The former > don't matter too much: they should always be unboxed in the function > prologue if necessary. The latter might… except if arrays are returned: > arrays are always pointers (boxed) but their contents are unboxed depending > on their element type. > > One big advantage is that, for code that doesn't really (some can be > compiled away via constant propagation) manipulate first-class functions, > each local function gets a specialised call and return convention. The > convention will use however many registers are necessary for both arguments > and (multiple) return values, and surrounding lexical variables don't even > have to be shuffled about. I don't think state-of-the-art C compilers do > that, even for file-local functions; it comes for free given the way > functions are represented internally, at the expense of some bad > asymptotics on compile times. > > Contrary to the aforementioned older compilers, SBCL tends to fare better > on code that declare fresh lexical bindings and returns multiple values > aggressively, rather than code that assigns a lot to a few variables that > surround everything. So, the shape of generated functions should probably > be something like > > (let ([parameters and variables that are live across MCMC loops]) > (declare [type declarations for parameters, assignment targets, and > opaque computations]) > (labels ([functions]) > (declare [ftype declarations]) > (loop ... ; loop repeat counts down, which saves a register. > [nested let/let*/labels] > [assign loop-carried values]))) > > With such a structure, it usually suffices to declare types for argument > types and hard-to-deduce return types (including local functions), and for > variables that are assigned to (or with hard-to-deduce return types). SBCL > and CMUCL seem to be in the minority of CL implementations in that they > really can exploit range information: types aren't only about > representation, and it is useful to, e.g., declare that a double-float > value is actually strictly positive with (double-float (0d0)). This would > let LOG and SQRT be computed directly, without worrying about complex > return values. Similar things happen for floats of small magnitude. > > [...] > > > Getting the inner loop (MCMC) as fast as possible is reasonably >> important. As far as target language goes, 10% or 20% slower than C is >> perhaps good enough, but 100% slower is not. >> > > [...] > > ISTM most of that will quickly drown in the generation of random variates, > and that doesn't need much code generation… This test case is already > strongly affected by a switch to dSFMT. So this 25% difference ought to go > down very quickly. Still, there are also tools that pretty print > s-expressions into C; this might be a more easily portable (and maintained) > alternative. If that's not enough, you might want to look into running > multiple loops in lockstep with SIMD processing (e.g., with SBCL's support > for SSE). That'll definitely move the bottleneck to the generation side ;) > > > Thanks for the prompt response and hard work! Very instructive perusing >> through all the code... >> > > My pleasure; I had an hour to kill. > > BTW, there's an initialisation bug in the dSFMT version: either the buffer > should be prefilled random values, or the initial index set to a large > value. > > Paul Khuong > |
From: Liam H. <ln...@he...> - 2013-11-12 17:50:35
|
On Tue, Nov 12, 2013 at 2:58 AM, Paul Khuong <pv...@pv...> wrote: > William Cushing wrote: > >> Is there any vaguely more portable/standard common lisp idiom for >> passing off stuff to C land, i.e., for inhibiting GC and grabbing raw >> pointers? > > There's static-vector. I believe LiamH also has some stuff in Antik (?) > to implement pinning on a couple implementations, either with pinning > constructs, for free on non-moving implementations, or by disabling GC. Antik is now built on top of static-vector (predecessor systems used pinning but that was SBCL-specific, and I decided static-vector was a more portable choice). It is optional; the fallback in its absence is to copy over the array. However, by essentially "living in the foreign world" (i.e., don't use on CL side) you can avoid doing that. I try to make it easy to do so by providing higher-level functions that access the arrays on the foreign side. Other than that, I use trivial-garbage to handle the foreign GC. Liam |