Thanks again!  This is a gold mine :)

-Will


On Mon, Nov 11, 2013 at 11:58 PM, Paul Khuong <pvk@pvk.ca> wrote:
William Cushing wrote:
I would not have suspected a global var of being an automatic
performance hit in an otherwise evidently machine-aware implementation.
I guess that's a difference due to supporting GC.

More or less. Mostly, this is about having a generic representation for polymorphic operations. Polymorphic but statically typed implementations like SMLton show how to work around that with whole-program compilation.


Certainly for parallelization purposes it is better anyways to pass the
random state through parameters...

Specials are not global and also double as thread-local bindings.


I was also considering flattening the CPTs; I was hoping that would be
unnecessary as far as access costs go.  Common LISP does permit nested
arrays to be implemented as contiguous blocks of storage in, for
example, row-major order, right?

It is just that SBCL is making life easier for itself by treating arrays
of arrays with a level of run-time indirection?

CL pretty much mandates contiguous storage in row-major order. It also forces support for some sort of indirection, because of displaced arrays. SBCL obeys these two masters by representing non-vector arrays with a header that points into a vector in which data is stored in row-major order. This indirection happens even for simple arrays (but not simple vectors). It's also a single indirection, always to the same location, so only a very low level (lower than caching, for instance) issue. It would probably not be *too* hard of a change to instead represent such non-vector simple-arrays as
  [header][rank][shape …][data…],
but, like I said, it's a very low level issue.


Concerning the array inside the following type:

(define-alien-type dsfmt-t
     (struct dsfmt-t
             (status (array double-float #.(* 2 (1+ +dsfmt-n+))))
             (idx int)))


It looks like such alien types can be instantiated from the lisp side in
the obvious way, but one has to use:
   sb-sys:vector-sap and sb-sys:with-pinned-objects
whenever passing to the C side?

No. That's for passing a specialised lisp array to C. This is unrelated to the inline sequence of doubles in the dSFMT_t struct.


Is there any vaguely more portable/standard common lisp idiom for
passing off stuff to C land, i.e., for inhibiting GC and grabbing raw
pointers?

There's static-vector. I believe LiamH also has some stuff in Antik (?) to implement pinning on a couple implementations, either with pinning constructs, for free on non-moving implementations, or by disabling GC.


(Is there some macro package that abstracts over, say, Allegro CL's
interface and SBCL's interface?  Yes, I could build it myself, but that
wouldn't exactly fit the bill of "at least vaguely portable/standard".))

CFFI implements a common foreign function interface on top of pretty much all contemporary implementations.


I wanted to give SBCL maximum opportunity to compete with C, and since I
of course had my C implementation on hand, I just edited the syntax of
all the declarations/compiler-flags to fit SBCL's syntax for the same
knowledge.

[...]


However, the mission here is not functionality, but raw performance.
The point is to test the feasibility of using different systems as
either/both target/implementation languages for reimplementing a
high-level probabilistic modeling language called BLOG.  BLOG, short for
Bayesian Logic, is a cool notion...
http://sites.google.com/site/bloginference/

[...]


One obvious idea is to add compiler functionality.  I'd like to see the
implementation language of the compiler be a lisp.  I'd also like, if at
all possible, for the target language to be a lisp as well (because
generating lisp from lisp is by far the easiest form of code generation).

OK. I really don't know ACL's performance characteristics, but, if you want SBCL to give you good code, switch to lexical variables and functions. CL offers a lot of late binding to programmers, and that interferes with compiling for performance. The SBCL model is basically to try and be clever within stuff that can't be rebound (e.g. a single function), but not across (e.g. two distinct top level functions). Type declarations can help a bit, but not overly much: full calls and returns always obey the same representation-generic calling convention, and other externally visible values are always boxed in the generic representation. This is partly caused by the specification and not just lack of manpower, but that's another discussion. (I'm skipping compilation units because SBCL ignores them anyway.)

This can all be avoided by moving everything to local bindings (LET/LET* and FLET/LABELS), a style that is somewhat annoying to work with by hand -- although it helps to break definitions up only to inline them in the final body, as I did in my code snippet -- but easily generated. The effect is that a lot of flow analyses will come for free, but, more importantly, that all representations will be specialised, much like in a C program. Word-sized integers will be represented as either fixnums or unboxed signed/unsigned words and doubles and singles as raw machine floats; in all cases, they will be stored and operated on in registers, and spilled to the stack if necessary, as native values. As long as nothing is passed to an outside (Lisp) function, variables can use native representations if it makes sense (e.g. symbols will still be pointers to symbols).

Pointers to allocated objects will always remain pointers though: much like C compilers from the 80's or early 90's, SBCL doesn't convert aggregates (structures or small arrays) into a fixed number of local variables, even when passed as arguments. However, arrays and structures can be declared to hold unboxed values; that's not contagious.

The only other objects with generic representations will be the arguments for the ball-of-definitions function and its return value(s). The former don't matter too much: they should always be unboxed in the function prologue if necessary. The latter might… except if arrays are returned: arrays are always pointers (boxed) but their contents are unboxed depending on their element type.

One big advantage is that, for code that doesn't really (some can be compiled away via constant propagation) manipulate first-class functions, each local function gets a specialised call and return convention. The convention will use however many registers are necessary for both arguments and (multiple) return values, and surrounding lexical variables don't even have to be shuffled about. I don't think state-of-the-art C compilers do that, even for file-local functions; it comes for free given the way functions are represented internally, at the expense of some bad asymptotics on compile times.

Contrary to the aforementioned older compilers, SBCL tends to fare better on code that declare fresh lexical bindings and returns multiple values aggressively, rather than code that assigns a lot to a few variables that surround everything. So, the shape of generated functions should probably be something like

(let ([parameters and variables that are live across MCMC loops])
  (declare [type declarations for parameters, assignment targets, and opaque computations])
  (labels ([functions])
    (declare [ftype declarations])
    (loop ... ; loop repeat counts down, which saves a register.
      [nested let/let*/labels]
      [assign loop-carried values])))

With such a structure, it usually suffices to declare types for argument types and hard-to-deduce return types (including local functions), and for variables that are assigned to (or with hard-to-deduce return types). SBCL and CMUCL seem to be in the minority of CL implementations in that they really can exploit range information: types aren't only about representation, and it is useful to, e.g., declare that a double-float value is actually strictly positive with (double-float (0d0)). This would let LOG and SQRT be computed directly, without worrying about complex return values. Similar things happen for floats of small magnitude.

[...]


Getting the inner loop (MCMC) as fast as possible is reasonably
important.  As far as target language goes, 10% or 20% slower than C is
perhaps good enough, but 100% slower is not.

[...]

ISTM most of that will quickly drown in the generation of random variates, and that doesn't need much code generation… This test case is already strongly affected by a switch to dSFMT. So this 25% difference ought to go down very quickly. Still, there are also tools that pretty print s-expressions into C; this might be a more easily portable (and maintained) alternative.  If that's not enough, you might want to look into running multiple loops in lockstep with SIMD processing (e.g., with SBCL's support for SSE). That'll definitely move the bottleneck to the generation side ;)


Thanks for the prompt response and hard work!  Very instructive perusing
through all the code...

My pleasure; I had an hour to kill.

BTW, there's an initialisation bug in the dSFMT version: either the buffer should be prefilled random values, or the initial index set to a large value.

Paul Khuong