From: Christophe R. <cs...@ca...> - 2001-10-12 09:21:48
|
Dear all, I can report an initial success at implementing vectorization in SBCL at the VOP level. The attached patch tells SBCL about the existence of SSE (Streaming SIMD Extension) registers, which are 128-bits, and a couple of (special-cased) instructions; the lisp files can then be compiled and loaded to exploit them. Note that I don't really suggest that this patch go in at this stage; certainly, I'd like it to be looked at by other people on this list and preferably also some people on cmucl-imp, as they know much more about the compiler than I do, and might be able to tell me that I've done something in a wrong or inelegant way, as well as (possibly) contribute other instructions (I hate architecture manuals). The experimentation focuses around the addition of two fixnum vectors into a result vector. Testing the sse2.lisp file shows that the ordinary x86 assembly version (the VECTOR+/SIMPLE-ARRAY-SIGNED-BYTE-30 VOP) is four to five times faster than the compiled lisp code (from the BAR function); pushing :sse2 onto sb-c::*backend-subfeatures* and recompiling sse2.lisp yields a further 20% speedup. However, I have reason to believe that further dramatic speedup can be obtained with SSE in this area, as I am currently moving data into the SSE registers with the movdqu instruction, which caters for unaligned data. If vector data could be aligned at 16-byte addresses, then the faster movdqa instruction can be used, which should give a further increase in speed... This all presupposes some way of convicing the compiler to emit these nice vectorized VOPs for normal code. I've had several thoughts on the issue, none of which are totally satisfactory, but I'm sure that we can come up with something plausible to put in the sb-ext package... I don't know if this is of any use to anyone, but I'm having fun. Cheers, Christophe -- Jesus College, Cambridge, CB5 8BL +44 1223 510 299 http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t) (set-dispatch-macro-character #\! #\$ #'pling-dollar) |
From: William H. N. <wil...@ai...> - 2001-10-12 16:00:18
|
On Fri, Oct 12, 2001 at 10:21:25AM +0100, Christophe Rhodes wrote: > This all presupposes some way of convicing the compiler to emit these > nice vectorized VOPs for normal code. I've had several thoughts on the > issue, none of which are totally satisfactory, but I'm sure that we > can come up with something plausible to put in the sb-ext package... > > I don't know if this is of any use to anyone, but I'm having fun. If you can make the compiler smart enough to do what high-performance compilers for other languages (especially Fortran) do, automatically compiling ordinary portable standard code into vector constructs when appropriate, that could be an impressive addition to the system. But for practical use it'd still have to compete with FFI bindings to the existing Fortran and C libraries for doing this kind of thing, so it's possible that I'd still consider its complexity-to-benefit ratio high enough that I'd reject it. Probably it would depend on whether it seemed to make the architecture of the compiler cleaner, more flexible, and/or more expressive, or more kludgy, more twisted, and more baroque. If vectorization only happens when people use nonportable code or nonportable hints (e.g. the SB-EXT stuff you mentioned), then it seems more like an interesting experiment than something that many other programmers are likely to use, especially since it competes with FFI bindings to existing libraries. In this case, I'm likely not to put it into the main system, which'd probably be a cruel blow to software structured as compiler patches. (Though of course it doesn't matter if this is intended as a temporary experiment instead of a piece of software that many people will use for a long time.) -- William Harold Newman <wil...@ai...> "Those who study history are doomed to watch others repeat it." - Susan E. Cohen PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C |
From: Christophe R. <cs...@ca...> - 2001-10-12 16:42:11
|
On Fri, Oct 12, 2001 at 10:38:13AM -0500, William Harold Newman wrote: > On Fri, Oct 12, 2001 at 10:21:25AM +0100, Christophe Rhodes wrote: > > This all presupposes some way of convicing the compiler to emit these > > nice vectorized VOPs for normal code. I've had several thoughts on the > > issue, none of which are totally satisfactory, but I'm sure that we > > can come up with something plausible to put in the sb-ext package... > > > > I don't know if this is of any use to anyone, but I'm having fun. > > If you can make the compiler smart enough to do what high-performance > compilers for other languages (especially Fortran) do, automatically > compiling ordinary portable standard code into vector constructs when > appropriate, that could be an impressive addition to the system. But > for practical use it'd still have to compete with FFI bindings to the > existing Fortran and C libraries for doing this kind of thing, so it's > possible that I'd still consider its complexity-to-benefit ratio high > enough that I'd reject it. That's fair enough :) > Probably it would depend on whether it > seemed to make the architecture of the compiler cleaner, more > flexible, and/or more expressive, or more kludgy, more twisted, and > more baroque. Heh. > If vectorization only happens when people use nonportable code or > nonportable hints (e.g. the SB-EXT stuff you mentioned), then it seems > more like an interesting experiment than something that many other > programmers are likely to use, especially since it competes with FFI > bindings to existing libraries. In this case, I'm likely not to put it > into the main system, which'd probably be a cruel blow to software > structured as compiler patches. Well, there are actually two different things here. One is the patch to the compiler (which I'm sure is not 100% correct at the moment) to allow it to emit SSE instructions on x86. This I hope is relatively uncontroversial as an aim -- even if nothing in the compiler itself uses these more advanced instructions. This involves making a new storage class, and so on and so forth, and defining the instructions. The other half is the vectorizing stuff (the two files, rather than the patch) which I agree is slightly more experimental. After thinking about it some more, I have hopes that I will be able to vectorize at least some simple portable code (map-into and variants) with simple deftransforms, though how much is as yet unclear. However, I recognize that it needs to be much more polished and carefully designed than it is at present :) Cheers, Christophe -- Jesus College, Cambridge, CB5 8BL +44 1223 510 299 http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t) (set-dispatch-macro-character #\! #\$ #'pling-dollar) |
From: Raymond T. <to...@rt...> - 2001-10-12 16:57:55
|
>>>>> "Christophe" == Christophe Rhodes <cs...@ca...> writes: Christophe> The experimentation focuses around the addition of two fixnum vectors Christophe> into a result vector. Testing the sse2.lisp file shows that the Christophe> ordinary x86 assembly version (the VECTOR+/SIMPLE-ARRAY-SIGNED-BYTE-30 Christophe> VOP) is four to five times faster than the compiled lisp code (from Christophe> the BAR function); pushing :sse2 onto sb-c::*backend-subfeatures* and Christophe> recompiling sse2.lisp yields a further 20% speedup. Did you check the generated assembly code of the bar function? I know that CMUCL has no clue on how to optimize these simple types of loops. It keeps loading the index, and the array pointer over and over and adding the index to the array pointer again and again. If we made the compiler smarter about loops like this so that it would just keep a pointer in a register and bump the pointer as it went, the results would be much better. I don't know how close it would come to SSE, though. This would be a decent test: write a vop that sums two fixnum arrays and compare this with your SSE version. Note also that I think because you used loop across, the code checks for the end of x and y on every iteration. You might get a better comparison by using something like (dotimes (i 1000) (setf (aref result i) (+ (aref x i) (aref y i)))) Ray |
From: Christophe R. <cs...@ca...> - 2001-10-23 00:08:39
|
On Fri, Oct 12, 2001 at 12:57:43PM -0400, Raymond Toy wrote: > >>>>> "Christophe" == Christophe Rhodes <cs...@ca...> writes: > > > > The experimentation focuses around the addition of two fixnum vectors > > into a result vector. Testing the sse2.lisp file shows that the > > ordinary x86 assembly version (the VECTOR+/SIMPLE-ARRAY-SIGNED-BYTE-30 > > VOP) is four to five times faster than the compiled lisp code (from > > the BAR function); pushing :sse2 onto sb-c::*backend-subfeatures* and > > recompiling sse2.lisp yields a further 20% speedup. > > Note also that I think because you used loop across, the code checks > for the end of x and y on every iteration. You might get a better > comparison by using something like > > (dotimes (i 1000) > (setf (aref result i) (+ (aref x i) (aref y i)))) Good point. I'm just back from overseas, so I haven't had the chance to do extensive testing, but it would seem that this dotimes is essentially as good as my (amateurish) hand-written assembly VOP; the unaligned SSE is still 10-20% faster than this, with the possibility of improvement should the allocator and garbage-collector be taught about aligning vector data at 16-byte boundaries. Cheers, Christophe -- Jesus College, Cambridge, CB5 8BL +44 1223 510 299 http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t) (set-dispatch-macro-character #\! #\$ #'pling-dollar) |
From: Raymond T. <to...@rt...> - 2001-10-23 13:23:04
|
>>>>> "Christophe" == Christophe Rhodes <cs...@ca...> writes: Christophe> I'm just back from overseas, so I haven't had the chance to do Christophe> extensive testing, but it would seem that this dotimes is essentially Christophe> as good as my (amateurish) hand-written assembly VOP; the unaligned Christophe> SSE is still 10-20% faster than this, with the possibility of Christophe> improvement should the allocator and garbage-collector be taught about Christophe> aligning vector data at 16-byte boundaries. Objects are already allocated on 16-byte boundaries, so getting the allocator for specialized vectors to allocate the space on a 16-byte boundary should be straightforward: Just add a junk word between the header word and the actual data. Ray |
From: Christophe R. <cs...@ca...> - 2001-10-23 13:43:38
|
On Tue, Oct 23, 2001 at 09:22:59AM -0400, Raymond Toy wrote: > >>>>> "Christophe" == Christophe Rhodes <cs...@ca...> writes: > > > I'm just back from overseas, so I haven't had the chance to do > > extensive testing, but it would seem that this dotimes is essentially > > as good as my (amateurish) hand-written assembly VOP; the unaligned > > SSE is still 10-20% faster than this, with the possibility of > > improvement should the allocator and garbage-collector be taught about > > aligning vector data at 16-byte boundaries. > > Objects are already allocated on 16-byte boundaries, so getting the > allocator for specialized vectors to allocate the space on a 16-byte > boundary should be straightforward: Just add a junk word between the > header word and the actual data. Are they really? I couldn't see that... Using (logand (sb-kernel:get-lisp-obj-address #(1 2 3)) 8) gets me a fairly random sequence of 0 and 8 when executed multiple times, but it's perfectly possible that that's not doing what I think it's doing... Cheers, Christophe -- Jesus College, Cambridge, CB5 8BL +44 1223 510 299 http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t) (set-dispatch-macro-character #\! #\$ #'pling-dollar) |
From: Raymond T. <to...@rt...> - 2001-10-23 14:09:44
|
>>>>> "Christophe" == Christophe Rhodes <cs...@ca...> writes: Christophe> Are they really? I couldn't see that... Christophe> Using Christophe> (logand (sb-kernel:get-lisp-obj-address #(1 2 3)) 8) Christophe> gets me a fairly random sequence of 0 and 8 when executed multiple Christophe> times, but it's perfectly possible that that's not doing what I think Christophe> it's doing... Sorry! You're right. 3 tag bits means 8-byte boundaries. In this case, I'm not sure how to get the allocator to work on 16-byte boundaries. Maybe use 4 tag bits? :-) Ray |