From: Jonathan A. <jon...@gm...> - 2013-02-23 16:45:02
|
Hi, Has any progress been made on merging Alexander Garilov's cl-simd (https://github.com/angavrilov/cl-simd) into SBCL? I'm interested in using SSE intrinsics in a 3D math library I'm writing and it'd be nice to support SBCL out of the box. If not, what's left to do? Jon |
From: Paul K. <pv...@pv...> - 2013-02-25 06:40:47
|
Jonathan Armond wrote: > Has any progress been made on merging Alexander Garilov's cl-simd > (https://github.com/angavrilov/cl-simd) into SBCL? AFAIK, none. > If not, what's left to do? It's been a while, so take this with a grain a salt. We really want[1] to avoid ping-ponging between the FP and Integer SIMD pipes if we can, so we must track whether a given value is FP or Int, at least during compilation. Now, we also like constant folding -- more generally, I dislike stuff that only exists as static information, without reflecting any runtime reality -- so if we want this (necessary for high-performance code[2]) optimisation to work, we now have to track FP/Int-ness at runtime. [1] if only for the reg-reg moves we insert to shuffle values around. [2] and otherwise, why would anyone bother with intrinsics? Tracking FP/Int-ness at runtime isn't too much of an issue, but it does make it really clear that approximating FP/Int-ness with some sort of kludgey propagation is a Bad Idea. So, we'll have to involve the regular type system/type propagation logic. And this is where it gets hairy. What happens with type union/intersection/negation of typed SIMD packets? I'm thinking of mostly replicating the specialised array logic, so, hopefully, I can leverage Christophe's brain when things go wrong. What kind of interface can we expose to users, so that naïve code works, and isn't too horrible, but also for efficient code to remain convenient? Half-assing it with static types that correspond to nothing at runtime made this step easier to fake, but that doesn't work anymore. I believe I settled on an interface such that intrinsics accept any-typed SIMD packets, but return typed (specialised) ones. This way, naïve users can declare their variables as default (any-typed) packets, while benefitting from type propagation, and without having to insert explicit FP<->Int casts: conversion from specialised to any-typed packets is always OK type-wise, and can be compiled into nothingness. Sophisticated users can still declare variables with explicit types: it'll help codegen, and lead to compile-time type errors instead of invisible pipeline-ping-ponging code. If necessary, they can insert *-to-fp and *-to-int casts (that compile into nothing as well). The casts could be inserted automatically and compiled away just the same, but I'm far from convinced this is a good idea: optional static type checking is something I really like about Python. Now, I'm not a BDUF fan, but the FP/Int dichotomy is very much an artefact of contemporary SSE implementations. Other platforms (ARM I believe, and I wouldn't be surprised if PPC were similar but saner) or microarchitectures (e.g. I'd expect single and double precision operations not to mix some time soon) may well be different. So, I'd like to find a simple way to extend the approach I sketched above to a more generic set of SIMD types -- and, already, some operations distinguish between single/double floats, while, for others, we should always pretend the values are single floats, according to my optimisation guides. I'm pretty sure we can just add more specialised types, as for array types (but that means we can't have packets of integers… instead, we'd have an union type, like CL:STRING, and that's proven to be somewhat hairy). An incidental upside of the finer SIMD type system is that printing could exploit this information; I wouldn't wish float-as-hexdump reading skills on anyone. There's my roadmap/braindump for the (hard) work remaining. The rest mostly involves forward-porting angavrilov's instruction definition fixes, and putting a nice lispy interface on front (and this is where many more people can get easily involved). The problem is that other developers prefer to work on more interesting/useful stuff, and I have more pressing responsibilities, mostly related to my wishing to graduate ;) Paul Khuong |
From: Jonathan A. <jon...@gm...> - 2013-02-25 11:11:43
|
Paul Khuong <pv...@pv...> writes: > It's been a while, so take this with a grain a salt. You know a lot more about Common Lisp and SBCL than I do, so take anything I say with a shovel of salt. > We really want[1] to avoid ping-ponging between the FP and Integer SIMD > pipes if we can, so we must track whether a given value is FP or Int, at > least during compilation. Now, we also like constant folding -- more > generally, I dislike stuff that only exists as static information, > without reflecting any runtime reality -- so if we want this (necessary > for high-performance code[2]) optimisation to work, we now have to track > FP/Int-ness at runtime. > > [1] if only for the reg-reg moves we insert to shuffle values around. > [2] and otherwise, why would anyone bother with intrinsics? > > Tracking FP/Int-ness at runtime isn't too much of an issue, but it does > make it really clear that approximating FP/Int-ness with some sort of > kludgey propagation is a Bad Idea. So, we'll have to involve the regular > type system/type propagation logic. And this is where it gets hairy. I wasn't previously aware of the bypass delays between FP and Int domains - that is, e.g. mixing movapd with addps incurs a latency penalty - so I was assuming that it could be left to the programmer to call a typed intrinsic on a generic SSE type. But I wasn't accounting for your point [1] where the compiler has to know whether to generate e.g. movaps vs movdqa, or movups vs movdqu. I'm not clear though, on why we need to track FP/int at runtime? Can we not just use static type declarations to generate appropriate move instructions for the domain, but just default to movaps if no more information is available and accept the couple of clock cycles cost. The move instructions will still work, and since the intrinsics are typed (e.g. sse:add-ps vs sse:add-pi32 from cl-simd) this is up to the user to get right. > What happens with type union/intersection/negation of typed SIMD > packets? I'm thinking of mostly replicating the specialised array logic, > so, hopefully, I can leverage Christophe's brain when things go wrong. > > What kind of interface can we expose to users, so that naïve code works, > and isn't too horrible, but also for efficient code to remain > convenient? Half-assing it with static types that correspond to nothing > at runtime made this step easier to fake, but that doesn't work anymore. > I believe I settled on an interface such that intrinsics accept > any-typed SIMD packets, but return typed (specialised) ones. This way, > naïve users can declare their variables as default (any-typed) packets, > while benefitting from type propagation, and without having to insert > explicit FP<->Int casts: conversion from specialised to any-typed > packets is always OK type-wise, and can be compiled into nothingness. > Sophisticated users can still declare variables with explicit types: > it'll help codegen, and lead to compile-time type errors instead of > invisible pipeline-ping-ponging code. If necessary, they can insert > *-to-fp and *-to-int casts (that compile into nothing as well). The > casts could be inserted automatically and compiled away just the same, > but I'm far from convinced this is a good idea: optional static type > checking is something I really like about Python. > Now, I'm not a BDUF fan, but the FP/Int dichotomy is very much an > artefact of contemporary SSE implementations. Other platforms (ARM I > believe, and I wouldn't be surprised if PPC were similar but saner) or > microarchitectures (e.g. I'd expect single and double precision > operations not to mix some time soon) may well be different. So, I'd > like to find a simple way to extend the approach I sketched above to a > more generic set of SIMD types -- and, already, some operations > distinguish between single/double floats, while, for others, we should > always pretend the values are single floats, according to my > optimisation guides. I'm pretty sure we can just add more specialised > types, as for array types (but that means we can't have packets of > integers… instead, we'd have an union type, like CL:STRING, and that's > proven to be somewhat hairy). An incidental upside of the finer SIMD > type system is that printing could exploit this information; I wouldn't > wish float-as-hexdump reading skills on anyone. I'm not qualified to judge how to extend SBCL's type system, but having a generic SIMD type hierarchy would surely be useful, as it could provide the infrastructure for e.g. auto-vectorizing loops and reduces. > There's my roadmap/braindump for the (hard) work remaining. The rest > mostly involves forward-porting angavrilov's instruction definition > fixes, and putting a nice lispy interface on front (and this is where > many more people can get easily involved). The problem is that other > developers prefer to work on more interesting/useful stuff, and I have > more pressing responsibilities, mostly related to my wishing to graduate ;) What is interesting/useful depends upon your view point ;) I'd be happy to throw in some work on this, but my experience with SBCL is limited so I'd need some pointers. Jon Armond |
From: Paul K. <pv...@pv...> - 2013-02-25 18:49:10
|
Jonathan Armond wrote: > Paul Khuong<pv...@pv...> writes: >> We really want[1] to avoid ping-ponging between the FP and Integer SIMD >> pipes if we can, so we must track whether a given value is FP or Int, at >> least during compilation. Now, we also like constant folding -- more >> generally, I dislike stuff that only exists as static information, >> without reflecting any runtime reality -- so if we want this (necessary >> for high-performance code[2]) optimisation to work, we now have to track >> FP/Int-ness at runtime. [...] > I'm not clear though, on why we need to track FP/int at runtime? Can we > not just use static type declarations to generate appropriate move > instructions for the domain, but just default to movaps if no more > information is available and accept the couple of clock cycles cost. Constant folding is good. But constants are runtime values, and we don't want to lose static information when replacing forms with a value. Otherwise, constant folding can become a direct pessimisation (more than it already can), and that's suboptimal. This would also come up when re-creating constants: should that all-0 vector be loaded with xorps or pxor? > I'm not qualified to judge how to extend SBCL's type system, but having > a generic SIMD type hierarchy would surely be useful, as it could > provide the infrastructure for e.g. auto-vectorizing loops and reduces. I'm not sure it'd help with autovectorisation. Anyway, my first instinct was to exploit the immutability of SIMD packs: there's no variance/covariance issue. Instead of forcing invariant pack types, we could actually have a nice (subtypep T U) <=> (subtypep '(packet T) '(packet U)). Sadly, we still have to map these to a bounded set of primitive (representation-oriented) types, much like specialised array types, and I'd expect generic operations to work in terms of primitive types at runtime. More fundamentally, the subtyping relation makes little sense: (unsigned-byte 8) is a subtype of (unsigned-byte 16), but they're not necessarily compatible in terms of representation when packed in an SIMD register. So, instead, I'm considering a very fine set of SIMD types: float/unsigned-byte/signed-byte/boolean [not so sure about that last one, but might as well exploit the remaining .4 bit] of length {1, 2, 4, 8, 16, 32, 64, 128, 256, 512} bit. That's easily encoded as a bitset in 12 bits, so there's more than enough space in the header to represent the total size of the packet (64/128/256/512 bits) and we can perform a lot of type operations via bit-wise logic. I believe that's generic enough to directly expose to the users; SBCL's array type logic tracks declared as well as upgraded types, but I don't think that's necessary with so many SIMD types. There's also the issue of which SIMD *class* to expose. I'm thinking a single, generic, pack class should suffice. It's not like I expect many people to directly combine generic function dispatch and SIMD processing. > What is interesting/useful depends upon your view point ;) > I'd be happy to throw in some work on this, but my experience with SBCL > is limited so I'd need some pointers. I'll have to get back in there and fail for a while before providing useful pointers. I'll try to find some time soon, but I've been meaning to give it a go for a couple years now… Paul Khuong |
From: Nikodemus S. <nik...@ra...> - 2013-02-26 06:26:07
|
On 25 February 2013 20:48, Paul Khuong <pv...@pv...> wrote: >>> We really want[1] to avoid ping-ponging between the FP and Integer SIMD >>> pipes if we can, so we must track whether a given value is FP or Int, at (Playing somewhat the devil's advocate there.) Everyone else seems to do pretty well just exposing Intel's as intrinsics as-is, (which explicitly specify the type and leave it to user to avoid ping-ponging). What do we actually win by trying to generalize? Cheers, -- Nikodemus |
From: Paul K. <pv...@pv...> - 2013-02-26 14:00:35
|
Nikodemus Siivola wrote: > On 25 February 2013 20:48, Paul Khuong<pv...@pv...> wrote: > >>>> We really want[1] to avoid ping-ponging between the FP and Integer SIMD >>>> pipes if we can, so we must track whether a given value is FP or Int, at > > (Playing somewhat the devil's advocate there.) > > Everyone else seems to do pretty well just exposing Intel's as > intrinsics as-is, (which explicitly specify the type and leave it to > user to avoid ping-ponging). What do we actually win by trying to > generalize? Here's the missing paragraph-footnote: >>>> [1] if only for the reg-reg moves we insert to shuffle values around. This isn't about generalising anything. So, why don't we just expose an integer-pack and a float-pack types/class and be done with it? Because that'd be baking a microarchitectural characteristic in our interface, and things will become ugly if we later need to distinguish between, e.g. single and double float values when moving them around. Also, because I don't think that'll let us share any code with other platforms, if they grow SIMD support. Even the type-level interface exposed by C is architecture (never mind microarchitecture) -generic. At the very least, I think we should be able to over-declare (e.g. pack of single-float instead of pack of floats) the float types and still get good code. I'd also like it to be possible to underdeclare (e.g. pack of *) during development. For even the first step to happen, we have to either be very clever, or expose the pack primitive types to the user. I'm not sure that exposing *micro*architecture-specific upgraded pack types is a good idea, hence my preference for an overly-fine set of upgraded pack types. Paul Khuong |
From: Jonathan A. <jon...@gm...> - 2013-02-26 10:56:02
|
Nikodemus Siivola <nik...@ra...> writes: > On 25 February 2013 20:48, Paul Khuong <pv...@pv...> wrote: > >>>> We really want[1] to avoid ping-ponging between the FP and Integer SIMD >>>> pipes if we can, so we must track whether a given value is FP or Int, at > > (Playing somewhat the devil's advocate there.) > > Everyone else seems to do pretty well just exposing Intel's as > intrinsics as-is, (which explicitly specify the type and leave it to > user to avoid ping-ponging). What do we actually win by trying to > generalize? A uniform interface for SIMD operations on different architectures? However, I'm suspicious about the benefit here because no doubt one has to have intimate knowledge of the instruction set to get optimal perfomance (the point of using SIMD anyway) which a generic interface would hide. |
From: Jonathan A. <jon...@gm...> - 2013-03-02 11:33:29
|
Paul Khuong <pv...@pv...> writes: > Nikodemus Siivola wrote: >> On 25 February 2013 20:48, Paul Khuong<pv...@pv...> wrote: >> >>>>> We really want[1] to avoid ping-ponging between the FP and Integer SIMD >>>>> pipes if we can, so we must track whether a given value is FP or Int, at >> >> (Playing somewhat the devil's advocate there.) >> >> Everyone else seems to do pretty well just exposing Intel's as >> intrinsics as-is, (which explicitly specify the type and leave it to >> user to avoid ping-ponging). What do we actually win by trying to >> generalize? > > Here's the missing paragraph-footnote: > >>>> [1] if only for the reg-reg moves we insert to shuffle values > around. > Even the type-level interface exposed by C is architecture (never mind > microarchitecture) -generic. At the very least, I think we should be > able to over-declare (e.g. pack of single-float instead of pack of > floats) the float types and still get good code. I'd also like it to be > possible to underdeclare (e.g. pack of *) during development. In C you have separate types __m128, __m128d, __m128i specific to Intel SSE, plus __m256 etc for AVX. How is that generic? Granted it's all just an XMM register but the compiler knows whether you are talking about packed floats/ints. > For even the first step to happen, we have to either be very clever, or > expose the pack primitive types to the user. I'm not sure that exposing > *micro*architecture-specific upgraded pack types is a good idea, hence > my preference for an overly-fine set of upgraded pack types. I've been using an SBCL with :sb-sse-intrinsics based on your's and angavrilov's patches, and his cl-simd, to write an SSE version of a 3D math library (http://github.com/jarmond/cl-math3d). It seems to work reasonably well (and generates the right mov's). I presume this is because values are always loaded with a typed intrinsic (or declared as such) and the TNs are tagged with the primitive type. I am also in process of adding the opcodes and intrinsics for SSE3 and up. I'm not sure constant-folding is that useful in SIMD code (if I understand what you mean by it), since you will always be tuning it by hand anyway. On something of a tangent: is there some infrastructure within SBCL for specifying capabilities of the target CPU when compiling (user code, not SBCL itself, that is)? I'm thinking that some SIMD instructions could be useful in optimizing e.g. string searches or loops, on machines where they are supported. Jon |
From: Christophe R. <cs...@ca...> - 2013-03-02 16:33:20
|
Jonathan Armond <jon...@gm...> writes: > On something of a tangent: is there some infrastructure within SBCL for > specifying capabilities of the target CPU when compiling (user code, not > SBCL itself, that is)? I'm thinking that some SIMD instructions could be > useful in optimizing e.g. string searches or loops, on machines where > they are supported. It's a little klunky and not widely used, but the *backend-subfeatures* variable is checked in guard clauses for vops, so that in principle one could write alternative implementations for particular operations. Cheers, Christophe |
From: Paul K. <pv...@pv...> - 2013-03-02 18:11:51
|
Jonathan Armond wrote: > Paul Khuong<pv...@pv...> writes: >> Even the type-level interface exposed by C is architecture (never mind >> microarchitecture) -generic. At the very least, I think we should be >> able to over-declare (e.g. pack of single-float instead of pack of >> floats) the float types and still get good code. I'd also like it to be >> possible to underdeclare (e.g. pack of *) during development. > > In C you have separate types __m128, __m128d, __m128i specific to Intel > SSE, plus __m256 etc for AVX. How is that generic? I was thinking of the GCC interface. Even Intel's intrinsic types are microarchitecture generic: we currently don't have to track whether an XMM register holds single or double float values. It seems I'm the only one who likes being able to just use untyped XMM values when prototyping, so I think we can make that work quickly. It only need three named and primitive types: packs of single, double, or integer. The Lisp type parser can then only work with recognizable subtypes of single-float, double-float or integer. > I'm not sure constant-folding is that useful in SIMD code (if I > understand what you mean by it), since you will always be tuning it by > hand anyway. I'm not sure that it's useful either, but it certainly happens. I'd rather minimise the odds that such unintended rewrites be pessimisations. > On something of a tangent: is there some infrastructure within SBCL for > specifying capabilities of the target CPU when compiling (user code, not > SBCL itself, that is)? I'm thinking that some SIMD instructions could be > useful in optimizing e.g. string searches or loops, on machines where > they are supported. We can use backend-subfeatures, but that's purely compile-time. For runtime dispatch, we can do something like lazy dynamic linkage resolution with (setf fdefinition). An entry in *save-hooks* can reset the definition to a dispatch function before a new core is saved. Paul Khuong |
From: Paul K. <pv...@pv...> - 2013-03-04 05:55:44
|
Paul Khuong wrote: > It seems I'm the only one who likes being able to just use untyped XMM > values when prototyping, so I think we can make that work quickly. It > only need three named and primitive types: packs of single, double, or > integer. The Lisp type parser can then only work with recognizable > subtypes of single-float, double-float or integer. I think there's a decent implementation up at https://github.com/pkhuong/sbcl/commits/simd-pack-march-2013 . Review,s comments, reports from attempts at rebasing some subset of an intrinsics library, etc. would be good. I'm also not sure about the packaging stuff. There's a patch to export some stuff from SB-EXT, but only one platform implements the SIMD stuff. I don't want libraries to use packages like SB-KERNEL directly either, though, so would a fresh package be best? It works with a small architecture-specific set of SIMD element types. For SSE, that's SINGLE-FLOAT, DOUBLE-FLOAT, or INTEGER. SIMD-PACK types are either (SIMD-PACK [type]), where [type] is a subtype of {SINGLE-FLOAT, DOUBLE-FLOAT, INTEGER}, or * (the default), in which case it's the union of all these possibilities. Like array upgrading, different (micro)architectures will lead to different SIMD-PACK "upgraded" types, and thus potentially different code. That's expected, but we can still work with plain SIMD-PACK if we're primarily worried about correctness/development speed. The set of all (specialised) SIMD-PACK types is the SIMD-PACK class, if someone wants to work with CLOS as well. A heap-allocated SIMD-PACK has a TAG field, which lets us tell what the element-type is. This is useful when constant-folding happens, but also to print packs nicely (reading FP values in hex was suboptimal). Unknown element-types are downgraded at compile-time to the first applicable from integer, single-float, double-float. I.e., integer SIMD packs are the default type, unless the element-type is known not to be integer, in which case it's single-float, and, if not, double-float. There are still separate primitive types, but now separate storage classes as well, for each PACK type (single, double, integer), so we can choose whether to use movaps or movdqa decently, without looking at TNs' primtypes while emitting code. There's one peculiar thing: an SIMD-PACK type check only verifies SIMD-PACKness, without considering the element type. Intrinsics can simply accept (SIMD-PACK *), but return correctly-typed ones. Argument types could be declared more strictly, but the compiler might then get confused into emitting bad code… even if that doesn't happen, the user experience of getting correct code, but only when underdeclaring types is probably not ideal. If we ever grow AVX support, the objdef will be slightly more interesting (variable-length, most likely), but I don't expect any major issue. Better, Altivec seems doable without rewriting much. I only committed the bare minimum that should be in the compiler itself. The rest (except for instruction definitions) can go in a contrib or a library... I also haven't really tested the infrastructure yet with intrinsics, so some testing would be good, e.g. to make sure that sane moves are emitted consistently. Importing new/fixed instruction definitions from angravilov's branch and from Jonathan Armond's latest post would also have to happen in the compiler. More cogent documentations in the manual would be awesome as well (: It's unlikely that I'll find comparable SBCL-time for the next few months, but if no one complains, I'll commit this sometime this week. Paul Khuong |
From: Jonathan A. <jon...@gm...> - 2013-03-04 08:21:34
|
Paul Khuong <pv...@pv...> writes: > Paul Khuong wrote: > I think there's a decent implementation up at > https://github.com/pkhuong/sbcl/commits/simd-pack-march-2013 . Review,s > comments, reports from attempts at rebasing some subset of an intrinsics > library, etc. would be good. Great, I'll let you know if I run into anything strange. > I only committed the bare minimum that should be in the compiler itself. > The rest (except for instruction definitions) can go in a contrib or a > library... I also haven't really tested the infrastructure yet with > intrinsics, so some testing would be good, e.g. to make sure that sane > moves are emitted consistently. I'm going to take a stab at converting cl-simd (proposed name sb-simd) into something suitable for an SBCL contrib, based on the new simd-pack type. Unless there are any objections? > Importing new/fixed instruction definitions from angravilov's branch and > from Jonathan Armond's latest post would also have to happen in the > compiler. More cogent documentations in the manual would be awesome as > well (: I'm also almost done with SSE4.x instructions too, for completeness. - Jon |