From: Bill H. <goo...@go...> - 2011-01-21 01:43:52
|
Hi all, I've been fiddling with SBCL recently and timing various things to see how fast it is. Obviously it is pretty impressive, but I've found some unusual performance issues and wanted to ask on the list whether I was doing something wrong, or if there's a specific reason these issues exist. Please forgive my lousy Lisp code, I only started writing Common Lisp recently. Here are some simple definitions I start with: (defmacro add-mod64 (x y) ;; simulate addition of 64 bit signed integers (the compiler seems to know to optimise this, so long as it is a macro and not a function) `(ldb (byte 64 0) (+ ,x ,y))) (defmacro sub-mod64 (x y) ;; simulate addition of 64 bit signed integers (the compiler seems to know to optimise this, so long as it is a macro and not a function) `(ldb (byte 64 0) (- ,x ,y))) (defmacro add-gen (x y) ;; define a generic addition which dispatches on type (the compiler seems to know to optimise this away, so long as it is a macro and not a function) `(if (typep ,x '(signed-byte 64)) (add-mod64 ,x ,y) (+ ,x ,y))) (defmacro sub-gen (x y) ;; define a generic subtraction which dispatches on type (the compiler seems to know to optimise this away, so long as it is a macro and not a function) `(if (typep ,x '(signed-byte 64)) (sub-mod64 ,x ,y) (- ,x ,y))) Now here is the function I'm timing: (defun sumit (n) ;; a function which sets up a simple loop which does an addition inside the loop (declare (optimize (speed 3) (safety 0))) (let ((j 1)) (declare (type (signed-byte 64) n j)) (dotimes (i n j) (setq j (add-gen j i))))) Here's my "driver". I just iterate calling the above function a bunch of times. This is the function I'm timing. It's only a made up example to time SBCL. It's not meant to do anything interesting. (let ((j 0)) (declare (type (signed-byte 64) j)) (dotimes (k 100000 j) (setq j (sumit 100000)))) Now that takes about 40s or thereabouts on my machine (x86_64 AMD). OK, now here is the odd thing. Let's replace the definition of sumit with the following: (defun sumit (n) ;; a function which sets up a simple loop which does an addition inside the loop (declare (optimize (speed 3) (safety 0))) (let ((j 1)) (declare (type (signed-byte 64) n j)) (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've changed, with j becoming (sub-gen j 1) (setq j (add-gen j i))))) I time it using my driver function as before. However, this time it takes 8s, i.e. it's 5 times faster!! I've tried replacing j with (the (signed-byte 64) j) and also with (ldb (byte 64 0) j), but neither of these improves the time substantially. But subtracting that 1 from the return value works. Oddly, subtracting 0 from the return value does NOT speed it up! So what gives? Also, I've noticed that a function based on "do" or "loop" takes twice as long as one based on "dotimes". If I unroll the loops manually, I can get some improvement in these cases, which seems to imply a missing opportunity for optimisation in the compiler. A "for loop" similar to the above would be unrolled and made about 6 times faster in C. Unrolling manually here in SBCL doesn't give that much of an improvement though, perhaps 2 times faster. Here is an example of the sumit function defined in terms of "loop" instead of "dotimes" (again I have to subtract 1 on return from the function to make it "fast"): (defun sumit (n) ;; a function which sets up a simple loop which does an addition inside the loop (declare (optimize (speed 3) (safety 0))) (let ((j 1)) (declare (type (signed-byte 64) n j)) (let ((i 0)) (declare (type (signed-byte 64) i)) (loop (setq j (add-gen j i)) (setq i (add-gen i 1)) (when (not (< i n)) (return)))) (sub-gen j 1))) Am I doing something wrong? Or are these behaviours expected? Or are there missed opportunities for optimisation in the SBCL compiler? Bill. |
From: Stas B. <sta...@gm...> - 2011-01-21 02:31:29
|
Bill Hart <goo...@go...> writes: > Hi all, > > I've been fiddling with SBCL recently and timing various things to see > how fast it is. Obviously it is pretty impressive, but I've found some > unusual performance issues and wanted to ask on the list whether I was > doing something wrong, or if there's a specific reason these issues > exist. Please forgive my lousy Lisp code, I only started writing > Common Lisp recently. > > Here are some simple definitions I start with: > > (defmacro add-mod64 (x y) > ;; simulate addition of 64 bit signed integers (the compiler seems to > know to optimise this, so long as it is a macro and not a function) > `(ldb (byte 64 0) (+ ,x ,y))) > > (defmacro sub-mod64 (x y) > ;; simulate addition of 64 bit signed integers (the compiler seems to > know to optimise this, so long as it is a macro and not a function) > `(ldb (byte 64 0) (- ,x ,y))) > > (defmacro add-gen (x y) > ;; define a generic addition which dispatches on type (the compiler > seems to know to optimise this away, so long as it is a macro and not > a function) > `(if (typep ,x '(signed-byte 64)) > (add-mod64 ,x ,y) > (+ ,x ,y))) > > (defmacro sub-gen (x y) > ;; define a generic subtraction which dispatches on type (the compiler > seems to know to optimise this away, so long as it is a macro and not > a function) > `(if (typep ,x '(signed-byte 64)) > (sub-mod64 ,x ,y) > (- ,x ,y))) > > Now here is the function I'm timing: > > (defun sumit (n) > ;; a function which sets up a simple loop which does an addition > inside the loop > (declare (optimize (speed 3) (safety 0))) > (let ((j 1)) > (declare (type (signed-byte 64) n j)) > (dotimes (i n j) > (setq j (add-gen j i))))) > > Here's my "driver". I just iterate calling the above function a bunch > of times. This is the function I'm timing. It's only a made up example > to time SBCL. It's not meant to do anything interesting. > > (let ((j 0)) > (declare (type (signed-byte 64) j)) > (dotimes (k 100000 j) > (setq j (sumit 100000)))) > > Now that takes about 40s or thereabouts on my machine (x86_64 AMD). > > OK, now here is the odd thing. Let's replace the definition of sumit > with the following: > > (defun sumit (n) > ;; a function which sets up a simple loop which does an addition > inside the loop > (declare (optimize (speed 3) (safety 0))) > (let ((j 1)) > (declare (type (signed-byte 64) n j)) > (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've > changed, with j becoming (sub-gen j 1) > (setq j (add-gen j i))))) > > I time it using my driver function as before. However, this time it > takes 8s, i.e. it's 5 times faster!! > > I've tried replacing j with (the (signed-byte 64) j) and also with > (ldb (byte 64 0) j), but neither of these improves the time > substantially. But subtracting that 1 from the return value works. > Oddly, subtracting 0 from the return value does NOT speed it up! > > So what gives? > > Also, I've noticed that a function based on "do" or "loop" takes twice > as long as one based on "dotimes". > > If I unroll the loops manually, I can get some improvement in these > cases, which seems to imply a missing opportunity for optimisation in > the compiler. A "for loop" similar to the above would be unrolled and > made about 6 times faster in C. Unrolling manually here in SBCL > doesn't give that much of an improvement though, perhaps 2 times > faster. > > Here is an example of the sumit function defined in terms of "loop" > instead of "dotimes" (again I have to subtract 1 on return from the > function to make it "fast"): > > (defun sumit (n) > ;; a function which sets up a simple loop which does an addition > inside the loop > (declare (optimize (speed 3) (safety 0))) > (let ((j 1)) > (declare (type (signed-byte 64) n j)) > (let ((i 0)) > (declare (type (signed-byte 64) i)) > (loop > (setq j (add-gen j i)) > (setq i (add-gen i 1)) > (when (not (< i n)) (return)))) > (sub-gen j 1))) > > Am I doing something wrong? Or are these behaviours expected? Or are > there missed opportunities for optimisation in the SBCL compiler? First, all your macros don't make any sense and should be just replaced with + and -. Second, you declare N variable in the wrong place. So, we get the following definitions: (defun sumit (n) (declare (optimize speed (safety 0)) (type (signed-byte 64) n)) (let ((j 1)) (declare (type (signed-byte 64) j)) (dotimes (i n (1- j)) (setq j (+ j i))))) (defun sumit-2 (n) (declare (optimize speed (safety 0)) (type (signed-byte 64) n)) (let ((j 1)) (declare (type (signed-byte 64) j)) (dotimes (i n j) (setq j (+ j i))))) The latter checks for bignum allocation every iteration, while the former checks only at the end. Why is that, I don't know. -- With Best Regards, Stas. |
From: Paul K. <pv...@pv...> - 2011-01-21 08:33:18
|
On 2011-01-20, at 9:31 PM, Stas Boukarev wrote: > Bill Hart <goo...@go...> writes: > >> (defun sumit (n) >> ;; a function which sets up a simple loop which does an addition >> inside the loop >> (declare (optimize (speed 3) (safety 0))) >> (let ((j 1)) >> (declare (type (signed-byte 64) n j)) >> (dotimes (i n j) >> (setq j (add-gen j i))))) >> >> Here's my "driver". I just iterate calling the above function a bunch >> of times. This is the function I'm timing. It's only a made up example >> to time SBCL. It's not meant to do anything interesting. >> >> (let ((j 0)) >> (declare (type (signed-byte 64) j)) >> (dotimes (k 100000 j) >> (setq j (sumit 100000)))) >> >> Now that takes about 40s or thereabouts on my machine (x86_64 AMD). >> >> OK, now here is the odd thing. Let's replace the definition of sumit >> with the following: >> >> (defun sumit (n) >> ;; a function which sets up a simple loop which does an addition >> inside the loop >> (declare (optimize (speed 3) (safety 0))) >> (let ((j 1)) >> (declare (type (signed-byte 64) n j)) >> (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've >> changed, with j becoming (sub-gen j 1) >> (setq j (add-gen j i))))) >> >> I time it using my driver function as before. However, this time it >> takes 8s, i.e. it's 5 times faster!! [...] >> Also, I've noticed that a function based on "do" or "loop" takes twice >> as long as one based on "dotimes". [...] > So, we get the following definitions: > > (defun sumit (n) > (declare (optimize speed (safety 0)) > (type (signed-byte 64) n)) > (let ((j 1)) > (declare (type (signed-byte 64) j)) > (dotimes (i n (1- j)) > (setq j (+ j i))))) > > (defun sumit-2 (n) > (declare (optimize speed (safety 0)) > (type (signed-byte 64) n)) > (let ((j 1)) > (declare (type (signed-byte 64) j)) > (dotimes (i n j) > (setq j (+ j i))))) > > The latter checks for bignum allocation every iteration, while the > former checks only at the end. Why is that, I don't know. DOTIMES expands into DO; you can look at the macroexpansion to see what SBCL does differently than you. As for the main question, this is an artefact of the way we perform analyses pretty much throughout compilation (in this case, representation selection). SBCL works on variables, instead of values (or variables at program points) which would be similar to what SSA- or CPS- based analyses do. This means that the way the variable j is represented must be constant throughout each function. In SUMIT, j can trivially take the cheapest representation for arithmetic (an unboxed number), while the result of (1- j) may end up being boxed. In SUMIT-2, on the other hand, j is also returned directly (as a boxed value); the representation assignment pass then gives a boxed representation to j. You can look at the disassembly to see that SUMIT-2 has moved the conditional boxing *inside* the loop. To pinpoint the effect further, you can play with the following function (in-package "SB-VM") (defknown sb64-identity ((signed-byte 64)) (signed-byte 64) (flushable always-translatable)) (define-vop (sb64-identity) (:translate sb64-identity) (:policy :fast-safe) (:args (x :scs (signed-reg signed-stack) :target r)) (:arg-types signed-num) (:results (r :scs (signed-reg))) (:result-types signed-num) (:generator 1 (move r x))) (defun sb64-identity (x) (declare (type (signed-byte 64) x)) (sb64-identity x)) Then, (defun sumit-2 (n) (declare (optimize speed (safety 0)) (type (signed-byte 64) n)) (let ((j 1)) (declare (type (signed-byte 64) j)) (dotimes (i n (sb-vm::sb64-identity j)) (setq j (+ j i))))) works as well as sumit: only the result of sb64-identity gets boxed, outside the loop, and, since we inserted this dummy copy node, j itself can take another representation. I'm starting to think we could do something like that automatically (and slowly move toward SSA style analyses and even representations) in IR2. Paul Khuong |
From: Bill H. <goo...@go...> - 2011-01-21 12:07:26
|
On 21 January 2011 08:33, Paul Khuong <pv...@pv...> wrote: > On 2011-01-20, at 9:31 PM, Stas Boukarev wrote: > >> Bill Hart <goo...@go...> writes: >> >>> (defun sumit (n) >>> ;; a function which sets up a simple loop which does an addition >>> inside the loop >>> (declare (optimize (speed 3) (safety 0))) >>> (let ((j 1)) >>> (declare (type (signed-byte 64) n j)) >>> (dotimes (i n j) >>> (setq j (add-gen j i))))) >>> >>> Here's my "driver". I just iterate calling the above function a bunch >>> of times. This is the function I'm timing. It's only a made up example >>> to time SBCL. It's not meant to do anything interesting. >>> >>> (let ((j 0)) >>> (declare (type (signed-byte 64) j)) >>> (dotimes (k 100000 j) >>> (setq j (sumit 100000)))) >>> >>> Now that takes about 40s or thereabouts on my machine (x86_64 AMD). >>> >>> OK, now here is the odd thing. Let's replace the definition of sumit >>> with the following: >>> >>> (defun sumit (n) >>> ;; a function which sets up a simple loop which does an addition >>> inside the loop >>> (declare (optimize (speed 3) (safety 0))) >>> (let ((j 1)) >>> (declare (type (signed-byte 64) n j)) >>> (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've >>> changed, with j becoming (sub-gen j 1) >>> (setq j (add-gen j i))))) >>> >>> I time it using my driver function as before. However, this time it >>> takes 8s, i.e. it's 5 times faster!! > [...] >>> Also, I've noticed that a function based on "do" or "loop" takes twice >>> as long as one based on "dotimes". > [...] >> So, we get the following definitions: >> >> (defun sumit (n) >> (declare (optimize speed (safety 0)) >> (type (signed-byte 64) n)) >> (let ((j 1)) >> (declare (type (signed-byte 64) j)) >> (dotimes (i n (1- j)) >> (setq j (+ j i))))) >> >> (defun sumit-2 (n) >> (declare (optimize speed (safety 0)) >> (type (signed-byte 64) n)) >> (let ((j 1)) >> (declare (type (signed-byte 64) j)) >> (dotimes (i n j) >> (setq j (+ j i))))) >> >> The latter checks for bignum allocation every iteration, while the >> former checks only at the end. Why is that, I don't know. > > DOTIMES expands into DO; you can look at the macroexpansion to see what SBCL does differently than you. > > As for the main question, this is an artefact of the way we perform analyses pretty much throughout compilation (in this case, representation selection). SBCL works on variables, instead of values (or variables at program points) which would be similar to what SSA- or CPS- based analyses do. This means that the way the variable j is represented must be constant throughout each function. In SUMIT, j can trivially take the cheapest representation for arithmetic (an unboxed number), while the result of (1- j) may end up being boxed. In SUMIT-2, on the other hand, j is also returned directly (as a boxed value); the representation assignment pass then gives a boxed representation to j. > > You can look at the disassembly to see that SUMIT-2 has moved the conditional boxing *inside* the loop. To pinpoint the effect further, you can play with the following function > (in-package "SB-VM") > (defknown sb64-identity ((signed-byte 64)) (signed-byte 64) (flushable always-translatable)) > (define-vop (sb64-identity) > (:translate sb64-identity) > (:policy :fast-safe) > (:args (x :scs (signed-reg signed-stack) :target r)) > (:arg-types signed-num) > (:results (r :scs (signed-reg))) > (:result-types signed-num) > (:generator 1 > (move r x))) > (defun sb64-identity (x) > (declare (type (signed-byte 64) x)) > (sb64-identity x)) > > Then, > (defun sumit-2 (n) > (declare (optimize speed (safety 0)) > (type (signed-byte 64) n)) > (let ((j 1)) > (declare (type (signed-byte 64) j)) > (dotimes (i n (sb-vm::sb64-identity j)) > (setq j (+ j i))))) > > works as well as sumit: only the result of sb64-identity gets boxed, outside the loop, and, since we inserted this dummy copy node, j itself can take another representation. I'm starting to think we could do something like that automatically (and slowly move toward SSA style analyses and even representations) in IR2. > > Paul Khuong Thank you for the answer. I see that solution provides a workaround for having to subtract one. A dummy copy certainly seems more manageable. It's certainly not a solution I was going to think of. (The function sb64-identity looks like it will just call itself indefinitely.) I didn't completely understand why returning (the (signed-byte 64) j) didn't work. But I can accept that these sorts of things occur as the result of the style of analysis that is currently used. By the way, is the SB-VM package documented anywhere. Given that I'm essentially using SBCL as a VM backend for an interpreter, is it worth delving a bit deeper and using the VM directly? It sounds like it gets modified a fair bit and may be up for a rewrite from what you were saying. So I guess it is discouraged to build stuff against the VM? Bill. |
From: Bill H. <goo...@go...> - 2011-01-21 11:34:27
|
On 21 January 2011 02:31, Stas Boukarev <sta...@gm...> wrote: > Bill Hart <goo...@go...> writes: > >> Hi all, >> >> I've been fiddling with SBCL recently and timing various things to see >> how fast it is. Obviously it is pretty impressive, but I've found some >> unusual performance issues and wanted to ask on the list whether I was >> doing something wrong, or if there's a specific reason these issues >> exist. Please forgive my lousy Lisp code, I only started writing >> Common Lisp recently. >> >> Here are some simple definitions I start with: >> >> (defmacro add-mod64 (x y) >> ;; simulate addition of 64 bit signed integers (the compiler seems to >> know to optimise this, so long as it is a macro and not a function) >> `(ldb (byte 64 0) (+ ,x ,y))) >> >> (defmacro sub-mod64 (x y) >> ;; simulate addition of 64 bit signed integers (the compiler seems to >> know to optimise this, so long as it is a macro and not a function) >> `(ldb (byte 64 0) (- ,x ,y))) >> >> (defmacro add-gen (x y) >> ;; define a generic addition which dispatches on type (the compiler >> seems to know to optimise this away, so long as it is a macro and not >> a function) >> `(if (typep ,x '(signed-byte 64)) >> (add-mod64 ,x ,y) >> (+ ,x ,y))) >> >> (defmacro sub-gen (x y) >> ;; define a generic subtraction which dispatches on type (the compiler >> seems to know to optimise this away, so long as it is a macro and not >> a function) >> `(if (typep ,x '(signed-byte 64)) >> (sub-mod64 ,x ,y) >> (- ,x ,y))) >> >> Now here is the function I'm timing: >> >> (defun sumit (n) >> ;; a function which sets up a simple loop which does an addition >> inside the loop >> (declare (optimize (speed 3) (safety 0))) >> (let ((j 1)) >> (declare (type (signed-byte 64) n j)) >> (dotimes (i n j) >> (setq j (add-gen j i))))) >> >> Here's my "driver". I just iterate calling the above function a bunch >> of times. This is the function I'm timing. It's only a made up example >> to time SBCL. It's not meant to do anything interesting. >> >> (let ((j 0)) >> (declare (type (signed-byte 64) j)) >> (dotimes (k 100000 j) >> (setq j (sumit 100000)))) >> >> Now that takes about 40s or thereabouts on my machine (x86_64 AMD). >> >> OK, now here is the odd thing. Let's replace the definition of sumit >> with the following: >> >> (defun sumit (n) >> ;; a function which sets up a simple loop which does an addition >> inside the loop >> (declare (optimize (speed 3) (safety 0))) >> (let ((j 1)) >> (declare (type (signed-byte 64) n j)) >> (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've >> changed, with j becoming (sub-gen j 1) >> (setq j (add-gen j i))))) >> >> I time it using my driver function as before. However, this time it >> takes 8s, i.e. it's 5 times faster!! >> >> I've tried replacing j with (the (signed-byte 64) j) and also with >> (ldb (byte 64 0) j), but neither of these improves the time >> substantially. But subtracting that 1 from the return value works. >> Oddly, subtracting 0 from the return value does NOT speed it up! >> >> So what gives? >> >> Also, I've noticed that a function based on "do" or "loop" takes twice >> as long as one based on "dotimes". >> >> If I unroll the loops manually, I can get some improvement in these >> cases, which seems to imply a missing opportunity for optimisation in >> the compiler. A "for loop" similar to the above would be unrolled and >> made about 6 times faster in C. Unrolling manually here in SBCL >> doesn't give that much of an improvement though, perhaps 2 times >> faster. >> >> Here is an example of the sumit function defined in terms of "loop" >> instead of "dotimes" (again I have to subtract 1 on return from the >> function to make it "fast"): >> >> (defun sumit (n) >> ;; a function which sets up a simple loop which does an addition >> inside the loop >> (declare (optimize (speed 3) (safety 0))) >> (let ((j 1)) >> (declare (type (signed-byte 64) n j)) >> (let ((i 0)) >> (declare (type (signed-byte 64) i)) >> (loop >> (setq j (add-gen j i)) >> (setq i (add-gen i 1)) >> (when (not (< i n)) (return)))) >> (sub-gen j 1))) >> >> Am I doing something wrong? Or are these behaviours expected? Or are >> there missed opportunities for optimisation in the SBCL compiler? > > First, all your macros don't make any sense and should be just > replaced with + and -. Unfortunately it will just crash if I do that, as the (signed-byte 64) values will eventually overflow (not in this example but in other examples). Doing the arithmetic mod 2^64 is necessary to prevent overflow! > Second, you declare N variable in the wrong > place. Ok, thank you. > So, we get the following definitions: > > (defun sumit (n) > (declare (optimize speed (safety 0)) > (type (signed-byte 64) n)) > (let ((j 1)) > (declare (type (signed-byte 64) j)) > (dotimes (i n (1- j)) > (setq j (+ j i))))) > > (defun sumit-2 (n) > (declare (optimize speed (safety 0)) > (type (signed-byte 64) n)) > (let ((j 1)) > (declare (type (signed-byte 64) j)) > (dotimes (i n j) > (setq j (+ j i))))) > > The latter checks for bignum allocation every iteration, while the > former checks only at the end. Why is that, I don't know. > > > -- > With Best Regards, Stas. > Bill. |
From: Stas B. <sta...@gm...> - 2011-01-21 12:05:44
|
Bill Hart <goo...@go...> writes: > On 21 January 2011 02:31, Stas Boukarev <sta...@gm...> wrote: >> Bill Hart <goo...@go...> writes: >> >>> Hi all, >>> >>> I've been fiddling with SBCL recently and timing various things to see >>> how fast it is. Obviously it is pretty impressive, but I've found some >>> unusual performance issues and wanted to ask on the list whether I was >>> doing something wrong, or if there's a specific reason these issues >>> exist. Please forgive my lousy Lisp code, I only started writing >>> Common Lisp recently. >>> >>> Here are some simple definitions I start with: >>> >>> (defmacro add-mod64 (x y) >>> ;; simulate addition of 64 bit signed integers (the compiler seems to >>> know to optimise this, so long as it is a macro and not a function) >>> `(ldb (byte 64 0) (+ ,x ,y))) >>> >>> (defmacro sub-mod64 (x y) >>> ;; simulate addition of 64 bit signed integers (the compiler seems to >>> know to optimise this, so long as it is a macro and not a function) >>> `(ldb (byte 64 0) (- ,x ,y))) >>> >>> (defmacro add-gen (x y) >>> ;; define a generic addition which dispatches on type (the compiler >>> seems to know to optimise this away, so long as it is a macro and not >>> a function) >>> `(if (typep ,x '(signed-byte 64)) >>> (add-mod64 ,x ,y) >>> (+ ,x ,y))) >>> >>> (defmacro sub-gen (x y) >>> ;; define a generic subtraction which dispatches on type (the compiler >>> seems to know to optimise this away, so long as it is a macro and not >>> a function) >>> `(if (typep ,x '(signed-byte 64)) >>> (sub-mod64 ,x ,y) >>> (- ,x ,y))) >>> >>> Now here is the function I'm timing: >>> >>> (defun sumit (n) >>> ;; a function which sets up a simple loop which does an addition >>> inside the loop >>> (declare (optimize (speed 3) (safety 0))) >>> (let ((j 1)) >>> (declare (type (signed-byte 64) n j)) >>> (dotimes (i n j) >>> (setq j (add-gen j i))))) >>> >>> Here's my "driver". I just iterate calling the above function a bunch >>> of times. This is the function I'm timing. It's only a made up example >>> to time SBCL. It's not meant to do anything interesting. >>> >>> (let ((j 0)) >>> (declare (type (signed-byte 64) j)) >>> (dotimes (k 100000 j) >>> (setq j (sumit 100000)))) >>> >>> Now that takes about 40s or thereabouts on my machine (x86_64 AMD). >>> >>> OK, now here is the odd thing. Let's replace the definition of sumit >>> with the following: >>> >>> (defun sumit (n) >>> ;; a function which sets up a simple loop which does an addition >>> inside the loop >>> (declare (optimize (speed 3) (safety 0))) >>> (let ((j 1)) >>> (declare (type (signed-byte 64) n j)) >>> (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've >>> changed, with j becoming (sub-gen j 1) >>> (setq j (add-gen j i))))) >>> >>> I time it using my driver function as before. However, this time it >>> takes 8s, i.e. it's 5 times faster!! >>> >>> I've tried replacing j with (the (signed-byte 64) j) and also with >>> (ldb (byte 64 0) j), but neither of these improves the time >>> substantially. But subtracting that 1 from the return value works. >>> Oddly, subtracting 0 from the return value does NOT speed it up! >>> >>> So what gives? >>> >>> Also, I've noticed that a function based on "do" or "loop" takes twice >>> as long as one based on "dotimes". >>> >>> If I unroll the loops manually, I can get some improvement in these >>> cases, which seems to imply a missing opportunity for optimisation in >>> the compiler. A "for loop" similar to the above would be unrolled and >>> made about 6 times faster in C. Unrolling manually here in SBCL >>> doesn't give that much of an improvement though, perhaps 2 times >>> faster. >>> >>> Here is an example of the sumit function defined in terms of "loop" >>> instead of "dotimes" (again I have to subtract 1 on return from the >>> function to make it "fast"): >>> >>> (defun sumit (n) >>> ;; a function which sets up a simple loop which does an addition >>> inside the loop >>> (declare (optimize (speed 3) (safety 0))) >>> (let ((j 1)) >>> (declare (type (signed-byte 64) n j)) >>> (let ((i 0)) >>> (declare (type (signed-byte 64) i)) >>> (loop >>> (setq j (add-gen j i)) >>> (setq i (add-gen i 1)) >>> (when (not (< i n)) (return)))) >>> (sub-gen j 1))) >>> >>> Am I doing something wrong? Or are these behaviours expected? Or are >>> there missed opportunities for optimisation in the SBCL compiler? >> >> First, all your macros don't make any sense and should be just >> replaced with + and -. > > Unfortunately it will just crash if I do that, as the (signed-byte 64) > values will eventually overflow (not in this example but in other > examples). Doing the arithmetic mod 2^64 is necessary to prevent > overflow! And that's why you shouldn't use SAFETY 0. -- With Best Regards, Stas. |
From: Bill H. <goo...@go...> - 2011-01-21 12:17:36
|
When I remove (safety 0) or replace it with any other safety level, it is no longer fast. Presumably it spends all its time checking the types and for overflows. Besides, the functionality I actually want here is addition and subtraction modulo 2^64. I am trying to simulate machine words! I probably should have explained that in my original post. Perhaps I am missing something, but I haven't been able to see how to get the functionality/performance I am after without (safety 0). Bill. On 21 January 2011 12:05, Stas Boukarev <sta...@gm...> wrote: > Bill Hart <goo...@go...> writes: > >> On 21 January 2011 02:31, Stas Boukarev <sta...@gm...> wrote: >>> Bill Hart <goo...@go...> writes: >>> >>>> Hi all, >>>> >>>> I've been fiddling with SBCL recently and timing various things to see >>>> how fast it is. Obviously it is pretty impressive, but I've found some >>>> unusual performance issues and wanted to ask on the list whether I was >>>> doing something wrong, or if there's a specific reason these issues >>>> exist. Please forgive my lousy Lisp code, I only started writing >>>> Common Lisp recently. >>>> >>>> Here are some simple definitions I start with: >>>> >>>> (defmacro add-mod64 (x y) >>>> ;; simulate addition of 64 bit signed integers (the compiler seems to >>>> know to optimise this, so long as it is a macro and not a function) >>>> `(ldb (byte 64 0) (+ ,x ,y))) >>>> >>>> (defmacro sub-mod64 (x y) >>>> ;; simulate addition of 64 bit signed integers (the compiler seems to >>>> know to optimise this, so long as it is a macro and not a function) >>>> `(ldb (byte 64 0) (- ,x ,y))) >>>> >>>> (defmacro add-gen (x y) >>>> ;; define a generic addition which dispatches on type (the compiler >>>> seems to know to optimise this away, so long as it is a macro and not >>>> a function) >>>> `(if (typep ,x '(signed-byte 64)) >>>> (add-mod64 ,x ,y) >>>> (+ ,x ,y))) >>>> >>>> (defmacro sub-gen (x y) >>>> ;; define a generic subtraction which dispatches on type (the compiler >>>> seems to know to optimise this away, so long as it is a macro and not >>>> a function) >>>> `(if (typep ,x '(signed-byte 64)) >>>> (sub-mod64 ,x ,y) >>>> (- ,x ,y))) >>>> >>>> Now here is the function I'm timing: >>>> >>>> (defun sumit (n) >>>> ;; a function which sets up a simple loop which does an addition >>>> inside the loop >>>> (declare (optimize (speed 3) (safety 0))) >>>> (let ((j 1)) >>>> (declare (type (signed-byte 64) n j)) >>>> (dotimes (i n j) >>>> (setq j (add-gen j i))))) >>>> >>>> Here's my "driver". I just iterate calling the above function a bunch >>>> of times. This is the function I'm timing. It's only a made up example >>>> to time SBCL. It's not meant to do anything interesting. >>>> >>>> (let ((j 0)) >>>> (declare (type (signed-byte 64) j)) >>>> (dotimes (k 100000 j) >>>> (setq j (sumit 100000)))) >>>> >>>> Now that takes about 40s or thereabouts on my machine (x86_64 AMD). >>>> >>>> OK, now here is the odd thing. Let's replace the definition of sumit >>>> with the following: >>>> >>>> (defun sumit (n) >>>> ;; a function which sets up a simple loop which does an addition >>>> inside the loop >>>> (declare (optimize (speed 3) (safety 0))) >>>> (let ((j 1)) >>>> (declare (type (signed-byte 64) n j)) >>>> (dotimes (i n (sub-gen j 1)) ; <----- this is the line I've >>>> changed, with j becoming (sub-gen j 1) >>>> (setq j (add-gen j i))))) >>>> >>>> I time it using my driver function as before. However, this time it >>>> takes 8s, i.e. it's 5 times faster!! >>>> >>>> I've tried replacing j with (the (signed-byte 64) j) and also with >>>> (ldb (byte 64 0) j), but neither of these improves the time >>>> substantially. But subtracting that 1 from the return value works. >>>> Oddly, subtracting 0 from the return value does NOT speed it up! >>>> >>>> So what gives? >>>> >>>> Also, I've noticed that a function based on "do" or "loop" takes twice >>>> as long as one based on "dotimes". >>>> >>>> If I unroll the loops manually, I can get some improvement in these >>>> cases, which seems to imply a missing opportunity for optimisation in >>>> the compiler. A "for loop" similar to the above would be unrolled and >>>> made about 6 times faster in C. Unrolling manually here in SBCL >>>> doesn't give that much of an improvement though, perhaps 2 times >>>> faster. >>>> >>>> Here is an example of the sumit function defined in terms of "loop" >>>> instead of "dotimes" (again I have to subtract 1 on return from the >>>> function to make it "fast"): >>>> >>>> (defun sumit (n) >>>> ;; a function which sets up a simple loop which does an addition >>>> inside the loop >>>> (declare (optimize (speed 3) (safety 0))) >>>> (let ((j 1)) >>>> (declare (type (signed-byte 64) n j)) >>>> (let ((i 0)) >>>> (declare (type (signed-byte 64) i)) >>>> (loop >>>> (setq j (add-gen j i)) >>>> (setq i (add-gen i 1)) >>>> (when (not (< i n)) (return)))) >>>> (sub-gen j 1))) >>>> >>>> Am I doing something wrong? Or are these behaviours expected? Or are >>>> there missed opportunities for optimisation in the SBCL compiler? >>> >>> First, all your macros don't make any sense and should be just >>> replaced with + and -. >> >> Unfortunately it will just crash if I do that, as the (signed-byte 64) >> values will eventually overflow (not in this example but in other >> examples). Doing the arithmetic mod 2^64 is necessary to prevent >> overflow! > And that's why you shouldn't use SAFETY 0. > > -- > With Best Regards, Stas. > |
From: Stas B. <sta...@gm...> - 2011-01-21 12:24:26
|
Bill Hart <goo...@go...> writes: > When I remove (safety 0) or replace it with any other safety level, it > is no longer fast. Presumably it spends all its time checking the > types and for overflows. Safety 0 may be fast, but it also may be silently incorrect. > Besides, the functionality I actually want here is addition and > subtraction modulo 2^64. I am trying to simulate machine words! I > probably should have explained that in my original post. Then why do you have LDB (byte 64 0) only if the first argument is 64 bit? > Perhaps I am missing something, but I haven't been able to see how to > get the functionality/performance I am after without (safety 0). -- With Best Regards, Stas. |
From: Bill H. <goo...@go...> - 2011-01-21 12:28:33
|
I don't understand the question. Does this help: http://jcsu.jesus.cam.ac.uk/~csr21/papers/modular/modular.pdf Bill. On 21 January 2011 12:24, Stas Boukarev <sta...@gm...> wrote: > Bill Hart <goo...@go...> writes: > >> When I remove (safety 0) or replace it with any other safety level, it >> is no longer fast. Presumably it spends all its time checking the >> types and for overflows. > Safety 0 may be fast, but it also may be silently incorrect. > >> Besides, the functionality I actually want here is addition and >> subtraction modulo 2^64. I am trying to simulate machine words! I >> probably should have explained that in my original post. > Then why do you have LDB (byte 64 0) only if the first argument is 64 bit? > >> Perhaps I am missing something, but I haven't been able to see how to >> get the functionality/performance I am after without (safety 0). > > -- > With Best Regards, Stas. > |
From: Stas B. <sta...@gm...> - 2011-01-21 12:33:41
|
Bill Hart <goo...@go...> writes: > I don't understand the question. > > Does this help: Your sub-gen and add-gen macros use modular arithmetic only if the first argument is of type (signed-byte 64), why is that? > http://jcsu.jesus.cam.ac.uk/~csr21/papers/modular/modular.pdf Yes, this is a good paper. > On 21 January 2011 12:24, Stas Boukarev <sta...@gm...> wrote: >> Bill Hart <goo...@go...> writes: >> >>> When I remove (safety 0) or replace it with any other safety level, it >>> is no longer fast. Presumably it spends all its time checking the >>> types and for overflows. >> Safety 0 may be fast, but it also may be silently incorrect. >> >>> Besides, the functionality I actually want here is addition and >>> subtraction modulo 2^64. I am trying to simulate machine words! I >>> probably should have explained that in my original post. >> Then why do you have LDB (byte 64 0) only if the first argument is 64 bit? >> >>> Perhaps I am missing something, but I haven't been able to see how to >>> get the functionality/performance I am after without (safety 0). >> >> -- >> With Best Regards, Stas. >> -- With Best Regards, Stas. |
From: Bill H. <goo...@go...> - 2011-01-21 12:37:49
|
On 21 January 2011 12:33, Stas Boukarev <sta...@gm...> wrote: > Bill Hart <goo...@go...> writes: > >> I don't understand the question. >> >> Does this help: > Your sub-gen and add-gen macros use modular arithmetic only if the first > argument is of type (signed-byte 64), why is that? Oh I understand the question now. In my application (which I didn't explain well), addition can only be called if both types are the same. So I only need to check one of the types here. Effectively the fact that the types are the same is checked elsewhere. But you are right, it would have been more clear had I simply written the example for you guys to check both types. Sorry about that! > >> http://jcsu.jesus.cam.ac.uk/~csr21/papers/modular/modular.pdf > Yes, this is a good paper. > >> On 21 January 2011 12:24, Stas Boukarev <sta...@gm...> wrote: >>> Bill Hart <goo...@go...> writes: >>> >>>> When I remove (safety 0) or replace it with any other safety level, it >>>> is no longer fast. Presumably it spends all its time checking the >>>> types and for overflows. >>> Safety 0 may be fast, but it also may be silently incorrect. >>> >>>> Besides, the functionality I actually want here is addition and >>>> subtraction modulo 2^64. I am trying to simulate machine words! I >>>> probably should have explained that in my original post. >>> Then why do you have LDB (byte 64 0) only if the first argument is 64 bit? >>> >>>> Perhaps I am missing something, but I haven't been able to see how to >>>> get the functionality/performance I am after without (safety 0). >>> >>> -- >>> With Best Regards, Stas. >>> > > -- > With Best Regards, Stas. > |
From: Stas B. <sta...@gm...> - 2011-01-21 12:44:19
|
Bill Hart <goo...@go...> writes: > On 21 January 2011 12:33, Stas Boukarev <sta...@gm...> wrote: >> Bill Hart <goo...@go...> writes: >> >>> I don't understand the question. >>> >>> Does this help: >> Your sub-gen and add-gen macros use modular arithmetic only if the first >> argument is of type (signed-byte 64), why is that? > > Oh I understand the question now. In my application (which I didn't > explain well), addition can only be called if both types are the same. > So I only need to check one of the types here. Effectively the fact > that the types are the same is checked elsewhere. > > But you are right, it would have been more clear had I simply written > the example for you guys to check both types. Sorry about that! Then, why do you have modular arithmetic only when both arguments are of type (signed-byte 64), and do ordinary arithmetic for larger integers? -- With Best Regards, Stas. |
From: Bill H. <goo...@go...> - 2011-01-21 13:06:20
|
On 21 January 2011 12:44, Stas Boukarev <sta...@gm...> wrote: > Bill Hart <goo...@go...> writes: > >> On 21 January 2011 12:33, Stas Boukarev <sta...@gm...> wrote: >>> Bill Hart <goo...@go...> writes: >>> >>>> I don't understand the question. >>>> >>>> Does this help: >>> Your sub-gen and add-gen macros use modular arithmetic only if the first >>> argument is of type (signed-byte 64), why is that? >> >> Oh I understand the question now. In my application (which I didn't >> explain well), addition can only be called if both types are the same. >> So I only need to check one of the types here. Effectively the fact >> that the types are the same is checked elsewhere. >> >> But you are right, it would have been more clear had I simply written >> the example for you guys to check both types. Sorry about that! > Then, why do you have modular arithmetic only when both arguments are of > type (signed-byte 64), and do ordinary arithmetic for larger integers? > In my interpreter (which is essentially statically typed) I have two types which I call int64's and int's, the latter being bignums. The interpreter rewrites the code in the source language into Lisp. In order to deal with addition in a uniform way, I wrote this "generic" macro. When both operands are ordinary integers the usual addition is done, but when they are int64's, the two's complement addition is done. Unfortunately I did a bad job of cutting the relevant bits of code from my project. The generic macros are actually a red herring and not relevant to my original question. Fortunately the SBCL compiler does recognise that it can shortcut these macros and so there is no performance loss. Obviously in the case where I actually want bignums it will be slow. But there probably isn't any way to get faster bignums (short of using GMP which is LGPL'd -- fortunately I'm writing a BSD licensed bignum library at the same time :-)). Bill. |
From: Bill H. <goo...@go...> - 2011-01-21 18:08:24
|
After fiddling with macroexpand I seem to have a reasonable handle on how to write loops efficiently with do or even a tagbody instead of just dotimes. So thanks for that advice! I'm now fiddling with the following code: (let ((k 0)) (declare (type (signed-byte 64) k)) (dotimes (i 100000) (dotimes (j 100000) (setq k (add-mod64 k 1))))) I am surprised to find it runs about 16 times slower than if I put the inner dotimes inside a function and call the function, as per my earlier example. Initially I assumed this was because only functions get the full compiler treatment in SBCL. But now I'm not sure. So I guess the question is why this would not be as fast as my earlier example? (By the way, I found a page on the compiler internals and I see IR2 is not a rewrite of IR1 but is what you call one of the intermediate representations used by SBCL. Sorry for my earlier confusion about this.) Bill. On 21 January 2011 13:06, Bill Hart <goo...@go...> wrote: > On 21 January 2011 12:44, Stas Boukarev <sta...@gm...> wrote: >> Bill Hart <goo...@go...> writes: >> >>> On 21 January 2011 12:33, Stas Boukarev <sta...@gm...> wrote: >>>> Bill Hart <goo...@go...> writes: >>>> >>>>> I don't understand the question. >>>>> >>>>> Does this help: >>>> Your sub-gen and add-gen macros use modular arithmetic only if the first >>>> argument is of type (signed-byte 64), why is that? >>> >>> Oh I understand the question now. In my application (which I didn't >>> explain well), addition can only be called if both types are the same. >>> So I only need to check one of the types here. Effectively the fact >>> that the types are the same is checked elsewhere. >>> >>> But you are right, it would have been more clear had I simply written >>> the example for you guys to check both types. Sorry about that! >> Then, why do you have modular arithmetic only when both arguments are of >> type (signed-byte 64), and do ordinary arithmetic for larger integers? >> > > In my interpreter (which is essentially statically typed) I have two > types which I call int64's and int's, the latter being bignums. The > interpreter rewrites the code in the source language into Lisp. In > order to deal with addition in a uniform way, I wrote this "generic" > macro. When both operands are ordinary integers the usual addition is > done, but when they are int64's, the two's complement addition is > done. > > Unfortunately I did a bad job of cutting the relevant bits of code > from my project. The generic macros are actually a red herring and not > relevant to my original question. Fortunately the SBCL compiler does > recognise that it can shortcut these macros and so there is no > performance loss. Obviously in the case where I actually want bignums > it will be slow. But there probably isn't any way to get faster > bignums (short of using GMP which is LGPL'd -- fortunately I'm writing > a BSD licensed bignum library at the same time :-)). > > Bill. > |
From: Bill H. <goo...@go...> - 2011-01-23 23:28:31
|
I have some more performance issues that I've noticed. The first example is this function: (defun mulit (n) (declare (optimize (speed 3) (safety 0))) (let ((d 1.23438979438d0)) (declare (double-float d)) (dotimes (i n d) (setf d (* d 1.0000000000023d0))))) The equivalent C function seems to run about 5 times faster. Looking at the disassembly in SBCL it is certainly doing much more work. A second issue is the following. If I run this program: (defun sumit (n) (declare (optimize (speed 3) (safety 0)) (fixnum n)) (let ((j 1)) (declare (fixnum j)) (dotimes (i n) (setq j (+ j i))) j)) then (sumit 1000000000) takes 1 second. Now if I do the following: (defvar j) (defun sumit (n) (declare (optimize (speed 3) (safety 0)) (fixnum n)) (let ((j 1)) (declare (fixnum j)) (dotimes (i n) (setq j (+ j i))) j)) Here sumit is precisely the same as the former definition. Then (sumit 1000000000) takes 4.5 seconds. The first j remains unbound, so I don't quite understand why defining it should slow down subsequent code by so much. Finally, here's a curiosity. A program which effectively computes j * j + 23984729373291487298 * j runs about twice as slow in SBCL as C (here the big constant is arbitrary). It seems that in this example C would recognise that the answer is the same as j*(j + 23984729373291487298), thus performing a single multiplication and addition instead of two multiplications. Bill. On 21 January 2011 18:08, Bill Hart <goo...@go...> wrote: > After fiddling with macroexpand I seem to have a reasonable handle on > how to write loops efficiently with do or even a tagbody instead of > just dotimes. So thanks for that advice! > > I'm now fiddling with the following code: > > (let ((k 0)) > (declare (type (signed-byte 64) k)) > (dotimes (i 100000) > (dotimes (j 100000) > (setq k (add-mod64 k 1))))) > > I am surprised to find it runs about 16 times slower than if I put the > inner dotimes inside a function and call the function, as per my > earlier example. > > Initially I assumed this was because only functions get the full > compiler treatment in SBCL. But now I'm not sure. > > So I guess the question is why this would not be as fast as my earlier example? > > (By the way, I found a page on the compiler internals and I see IR2 is > not a rewrite of IR1 but is what you call one of the intermediate > representations used by SBCL. Sorry for my earlier confusion about > this.) > > Bill. > > On 21 January 2011 13:06, Bill Hart <goo...@go...> wrote: >> On 21 January 2011 12:44, Stas Boukarev <sta...@gm...> wrote: >>> Bill Hart <goo...@go...> writes: >>> >>>> On 21 January 2011 12:33, Stas Boukarev <sta...@gm...> wrote: >>>>> Bill Hart <goo...@go...> writes: >>>>> >>>>>> I don't understand the question. >>>>>> >>>>>> Does this help: >>>>> Your sub-gen and add-gen macros use modular arithmetic only if the first >>>>> argument is of type (signed-byte 64), why is that? >>>> >>>> Oh I understand the question now. In my application (which I didn't >>>> explain well), addition can only be called if both types are the same. >>>> So I only need to check one of the types here. Effectively the fact >>>> that the types are the same is checked elsewhere. >>>> >>>> But you are right, it would have been more clear had I simply written >>>> the example for you guys to check both types. Sorry about that! >>> Then, why do you have modular arithmetic only when both arguments are of >>> type (signed-byte 64), and do ordinary arithmetic for larger integers? >>> >> >> In my interpreter (which is essentially statically typed) I have two >> types which I call int64's and int's, the latter being bignums. The >> interpreter rewrites the code in the source language into Lisp. In >> order to deal with addition in a uniform way, I wrote this "generic" >> macro. When both operands are ordinary integers the usual addition is >> done, but when they are int64's, the two's complement addition is >> done. >> >> Unfortunately I did a bad job of cutting the relevant bits of code >> from my project. The generic macros are actually a red herring and not >> relevant to my original question. Fortunately the SBCL compiler does >> recognise that it can shortcut these macros and so there is no >> performance loss. Obviously in the case where I actually want bignums >> it will be slow. But there probably isn't any way to get faster >> bignums (short of using GMP which is LGPL'd -- fortunately I'm writing >> a BSD licensed bignum library at the same time :-)). >> >> Bill. >> > |
From: Paul K. <pv...@pv...> - 2011-01-24 00:34:14
|
On 2011-01-23, at 6:28 PM, Bill Hart wrote: > I have some more performance issues that I've noticed. > > The first example is this function: > > (defun mulit (n) > (declare (optimize (speed 3) (safety 0))) > (let ((d 1.23438979438d0)) > (declare (double-float d)) > (dotimes (i n d) (setf d (* d 1.0000000000023d0))))) > > The equivalent C function seems to run about 5 times faster. Looking > at the disassembly in SBCL it is certainly doing much more work. It would certainly help to declare the type of N here (the optimization notes should have helped you see that). Going through EXPT or pow would probably be even faster. > A second issue is the following. If I run this program: > > (defun sumit (n) > (declare (optimize (speed 3) (safety 0)) (fixnum n)) > (let ((j 1)) > (declare (fixnum j)) > (dotimes (i n) > (setq j (+ j i))) > j)) > > then (sumit 1000000000) takes 1 second. > > Now if I do the following: > > (defvar j) > > (defun sumit (n) > (declare (optimize (speed 3) (safety 0)) (fixnum n)) > (let ((j 1)) > (declare (fixnum j)) > (dotimes (i n) > (setq j (+ j i))) > j)) > > Here sumit is precisely the same as the former definition. > > Then (sumit 1000000000) takes 4.5 seconds. > > The first j remains unbound, so I don't quite understand why defining > it should slow down subsequent code by so much. You might want to read the specification for defvar. There's a reason for the *earmuffs*: after defvar, j is always bound as a dynamic parameter. > Finally, here's a curiosity. A program which effectively computes j * > j + 23984729373291487298 * j runs about twice as slow in SBCL as C > (here the big constant is arbitrary). It seems that in this example C > would recognise that the answer is the same as j*(j + > 23984729373291487298), thus performing a single multiplication and > addition instead of two multiplications. I would expect most of the difference to come from the fact that CL mandates integer arithmetic, while C only has to let things wrap around (or, for signed types, can simply assume overflow doesn't happen). As for the rest, SBCL doesn't perform much common subexpression elimination, and not at all the sort of transformation shown above. I don't know what you're trying to do, but I can't believe that what you've been doing the last couple days is the most efficient course of action to attain that goal. I suggest cutting to the chase and just ask the questions you want answered. Paul Khuong |
From: Bill H. <goo...@go...> - 2011-01-24 02:31:02
|
On 24 January 2011 00:34, Paul Khuong <pv...@pv...> wrote: > On 2011-01-23, at 6:28 PM, Bill Hart wrote: > >> I have some more performance issues that I've noticed. >> >> The first example is this function: >> >> (defun mulit (n) >> (declare (optimize (speed 3) (safety 0))) >> (let ((d 1.23438979438d0)) >> (declare (double-float d)) >> (dotimes (i n d) (setf d (* d 1.0000000000023d0))))) >> >> The equivalent C function seems to run about 5 times faster. Looking >> at the disassembly in SBCL it is certainly doing much more work. > > It would certainly help to declare the type of N here (the optimization notes should have helped you see that). You are absolutely right. I don't know how I missed this. What is puzzling is that the code I have been working on handles loops containing integer arithmetic in precisely the same way as loops containing floating point arithmetic. I only observed the slowdown in the case of the doubles. I will have to pull apart my code to see why I had different behaviour in the two cases. Also for some reason optimization notes seem to be off for me. Thus I didn't get any warnings about this. I will investigate how to turn them on. > Going through EXPT or pow would probably be even faster. Naturally. Here the idea was only to benchmark basic operations, in this case loops containing floating point operations. > >> A second issue is the following. If I run this program: >> >> (defun sumit (n) >> (declare (optimize (speed 3) (safety 0)) (fixnum n)) >> (let ((j 1)) >> (declare (fixnum j)) >> (dotimes (i n) >> (setq j (+ j i))) >> j)) >> >> then (sumit 1000000000) takes 1 second. >> >> Now if I do the following: >> >> (defvar j) >> >> (defun sumit (n) >> (declare (optimize (speed 3) (safety 0)) (fixnum n)) >> (let ((j 1)) >> (declare (fixnum j)) >> (dotimes (i n) >> (setq j (+ j i))) >> j)) >> >> Here sumit is precisely the same as the former definition. >> >> Then (sumit 1000000000) takes 4.5 seconds. >> >> The first j remains unbound, so I don't quite understand why defining >> it should slow down subsequent code by so much. > > You might want to read the specification for defvar. There's a reason for the *earmuffs*: after defvar, j is always bound as a dynamic parameter. I need to read some better references. There was mention of using the earmuffs. It was also mentioned that dynamic parameters might be slower. But I didn't see any mention of interaction with (what I thought were) lexically scoped variables of the same name. > >> Finally, here's a curiosity. A program which effectively computes j * >> j + 23984729373291487298 * j runs about twice as slow in SBCL as C >> (here the big constant is arbitrary). It seems that in this example C >> would recognise that the answer is the same as j*(j + >> 23984729373291487298), thus performing a single multiplication and >> addition instead of two multiplications. > > I would expect most of the difference to come from the fact that CL mandates integer arithmetic, while C only has to let things wrap around (or, for signed types, can simply assume overflow doesn't happen). I was of course making use of the fast machine arithmetic that SBCL gives access to as mentioned in an earlier post. So the wraparound behaviour is the same in both the Lisp and C code I was working with. > As for the rest, SBCL doesn't perform much common subexpression elimination, and not at all the sort of transformation shown above. Ok. I don't know if it is considered undesirable for a Lisp to perform such optimisations or not. I considered it a curiosity that C even bothered with this sort of thing. > > I don't know what you're trying to do, but I can't believe that what you've been doing the last couple days is the most efficient course of action to attain that goal. I suggest cutting to the chase and just ask the questions you want answered. > It wasn't my intention to ask a pile of general Lisp questions here. I'm certain the SBCL devs are very busy. As I mentioned, I've been writing an interpreter on top of SBCL. Along the way I've been benchmarking the results to see that I'm not missing optimisation opportunities. As you can see, a few times I have been very puzzled by the results. (This is not to say that I expect that SBCL will be as fast as compiled C in every possible instance.) Sadly, most of the instances of what I thought might be performance inconsistencies in SBCL have turned out to be problems with my understanding of SBCL or of Lisp itself. On the positive side, the answers I've received have been extremely helpful and my interpreter is almost 500 lines of code and already doing not completely trivial things. For what it's worth, thanks for the helpful answers. Bill. |