Thanks again! This is a gold mine :)
On Mon, Nov 11, 2013 at 11:58 PM, Paul Khuong <pvk@...> wrote:
> William Cushing wrote:
>> I would not have suspected a global var of being an automatic
>> performance hit in an otherwise evidently machine-aware implementation.
>> I guess that's a difference due to supporting GC.
> More or less. Mostly, this is about having a generic representation for
> polymorphic operations. Polymorphic but statically typed implementations
> like SMLton show how to work around that with whole-program compilation.
> Certainly for parallelization purposes it is better anyways to pass the
>> random state through parameters...
> Specials are not global and also double as thread-local bindings.
> I was also considering flattening the CPTs; I was hoping that would be
>> unnecessary as far as access costs go. Common LISP does permit nested
>> arrays to be implemented as contiguous blocks of storage in, for
>> example, row-major order, right?
>> It is just that SBCL is making life easier for itself by treating arrays
>> of arrays with a level of run-time indirection?
> CL pretty much mandates contiguous storage in row-major order. It also
> forces support for some sort of indirection, because of displaced arrays.
> SBCL obeys these two masters by representing non-vector arrays with a
> header that points into a vector in which data is stored in row-major
> order. This indirection happens even for simple arrays (but not simple
> vectors). It's also a single indirection, always to the same location, so
> only a very low level (lower than caching, for instance) issue. It would
> probably not be *too* hard of a change to instead represent such non-vector
> simple-arrays as
> [header][rank][shape …][data…],
> but, like I said, it's a very low level issue.
> Concerning the array inside the following type:
>> (define-alien-type dsfmt-t
>> (struct dsfmt-t
>> (status (array double-float #.(* 2 (1+ +dsfmt-n+))))
>> (idx int)))
>> It looks like such alien types can be instantiated from the lisp side in
>> the obvious way, but one has to use:
>> sb-sys:vector-sap and sb-sys:with-pinned-objects
>> whenever passing to the C side?
> No. That's for passing a specialised lisp array to C. This is unrelated to
> the inline sequence of doubles in the dSFMT_t struct.
> Is there any vaguely more portable/standard common lisp idiom for
>> passing off stuff to C land, i.e., for inhibiting GC and grabbing raw
> There's static-vector. I believe LiamH also has some stuff in Antik (?) to
> implement pinning on a couple implementations, either with pinning
> constructs, for free on non-moving implementations, or by disabling GC.
> (Is there some macro package that abstracts over, say, Allegro CL's
>> interface and SBCL's interface? Yes, I could build it myself, but that
>> wouldn't exactly fit the bill of "at least vaguely portable/standard".))
> CFFI implements a common foreign function interface on top of pretty much
> all contemporary implementations.
> I wanted to give SBCL maximum opportunity to compete with C, and since I
>> of course had my C implementation on hand, I just edited the syntax of
>> all the declarations/compiler-flags to fit SBCL's syntax for the same
> However, the mission here is not functionality, but raw performance.
>> The point is to test the feasibility of using different systems as
>> either/both target/implementation languages for reimplementing a
>> high-level probabilistic modeling language called BLOG. BLOG, short for
>> Bayesian Logic, is a cool notion...
> One obvious idea is to add compiler functionality. I'd like to see the
>> implementation language of the compiler be a lisp. I'd also like, if at
>> all possible, for the target language to be a lisp as well (because
>> generating lisp from lisp is by far the easiest form of code generation).
> OK. I really don't know ACL's performance characteristics, but, if you
> want SBCL to give you good code, switch to lexical variables and functions.
> CL offers a lot of late binding to programmers, and that interferes with
> compiling for performance. The SBCL model is basically to try and be clever
> within stuff that can't be rebound (e.g. a single function), but not across
> (e.g. two distinct top level functions). Type declarations can help a bit,
> but not overly much: full calls and returns always obey the same
> representation-generic calling convention, and other externally visible
> values are always boxed in the generic representation. This is partly
> caused by the specification and not just lack of manpower, but that's
> another discussion. (I'm skipping compilation units because SBCL ignores
> them anyway.)
> This can all be avoided by moving everything to local bindings (LET/LET*
> and FLET/LABELS), a style that is somewhat annoying to work with by hand --
> although it helps to break definitions up only to inline them in the final
> body, as I did in my code snippet -- but easily generated. The effect is
> that a lot of flow analyses will come for free, but, more importantly, that
> all representations will be specialised, much like in a C program.
> Word-sized integers will be represented as either fixnums or unboxed
> signed/unsigned words and doubles and singles as raw machine floats; in all
> cases, they will be stored and operated on in registers, and spilled to the
> stack if necessary, as native values. As long as nothing is passed to an
> outside (Lisp) function, variables can use native representations if it
> makes sense (e.g. symbols will still be pointers to symbols).
> Pointers to allocated objects will always remain pointers though: much
> like C compilers from the 80's or early 90's, SBCL doesn't convert
> aggregates (structures or small arrays) into a fixed number of local
> variables, even when passed as arguments. However, arrays and structures
> can be declared to hold unboxed values; that's not contagious.
> The only other objects with generic representations will be the arguments
> for the ball-of-definitions function and its return value(s). The former
> don't matter too much: they should always be unboxed in the function
> prologue if necessary. The latter might… except if arrays are returned:
> arrays are always pointers (boxed) but their contents are unboxed depending
> on their element type.
> One big advantage is that, for code that doesn't really (some can be
> compiled away via constant propagation) manipulate first-class functions,
> each local function gets a specialised call and return convention. The
> convention will use however many registers are necessary for both arguments
> and (multiple) return values, and surrounding lexical variables don't even
> have to be shuffled about. I don't think state-of-the-art C compilers do
> that, even for file-local functions; it comes for free given the way
> functions are represented internally, at the expense of some bad
> asymptotics on compile times.
> Contrary to the aforementioned older compilers, SBCL tends to fare better
> on code that declare fresh lexical bindings and returns multiple values
> aggressively, rather than code that assigns a lot to a few variables that
> surround everything. So, the shape of generated functions should probably
> be something like
> (let ([parameters and variables that are live across MCMC loops])
> (declare [type declarations for parameters, assignment targets, and
> opaque computations])
> (labels ([functions])
> (declare [ftype declarations])
> (loop ... ; loop repeat counts down, which saves a register.
> [nested let/let*/labels]
> [assign loop-carried values])))
> With such a structure, it usually suffices to declare types for argument
> types and hard-to-deduce return types (including local functions), and for
> variables that are assigned to (or with hard-to-deduce return types). SBCL
> and CMUCL seem to be in the minority of CL implementations in that they
> really can exploit range information: types aren't only about
> representation, and it is useful to, e.g., declare that a double-float
> value is actually strictly positive with (double-float (0d0)). This would
> let LOG and SQRT be computed directly, without worrying about complex
> return values. Similar things happen for floats of small magnitude.
> Getting the inner loop (MCMC) as fast as possible is reasonably
>> important. As far as target language goes, 10% or 20% slower than C is
>> perhaps good enough, but 100% slower is not.
> ISTM most of that will quickly drown in the generation of random variates,
> and that doesn't need much code generation… This test case is already
> strongly affected by a switch to dSFMT. So this 25% difference ought to go
> down very quickly. Still, there are also tools that pretty print
> s-expressions into C; this might be a more easily portable (and maintained)
> alternative. If that's not enough, you might want to look into running
> multiple loops in lockstep with SIMD processing (e.g., with SBCL's support
> for SSE). That'll definitely move the bottleneck to the generation side ;)
> Thanks for the prompt response and hard work! Very instructive perusing
>> through all the code...
> My pleasure; I had an hour to kill.
> BTW, there's an initialisation bug in the dSFMT version: either the buffer
> should be prefilled random values, or the initial index set to a large
> Paul Khuong