From: <lutz.euler@fr...>  20060723 13:09:39
Attachments:
patchlogcount
Message as HTML

Hi, I noticed that currently the VOP for LOGCOUNT on x8664 is written such that it counts the 32bit halves of the 64bit word one after the other, each time using the same code as on x86 and adding the results. This looks quite inefficient. Besides changing it to work with 64bit operands the VOP can be improved using some other well known tricks, most of which also apply to x86. So I would like to propose the attached patch that makes LOGCOUNT run 2.5 times as fast on x8664 and around 15 % faster on x86 while the generated code is quite a bit smaller. Please see the comments in the code for the implementation details. Surprisingly, LOGCOUNT on bignums is noticeably slower than COUNT on bit vectors, although both are based on the same LOGCOUNT VOP. The cause of this is that BIGNUMLOGCOUNT iterates over the words of the bignum, negating each word individually before counting the ones in it if the bignum is negative. (LOGCOUNT is specified to count the number of ones in a positive integer and the number of zeroes in a negative integer.) The patch contains a new version of BIGNUMLOGCOUNT that has the conditional expression moved out of the loop: Independent of the sign of the bignum the loop counts the ones in the bignum. Afterwards, for negative bignums the number of zeroes is calculated by subtracting the result from the number of words of the bignum times the number of bits in a word. Since this multiplication is compiled as a shift this is efficient even for small bignums. This speeds up BIGNUMLOGCOUNT by about 25 % on x86 and by nearly 10 % on x8664, in addition to the gains above. I would expect a small gain on the other platforms, too. (The speedup on x86 is so large because the new version needs one local variable less so that they all fit into registers; the current version needs to keep the loop counter in a stack slot.) Finally, the patch adds a test to "arith.pure.lisp" that is more thorough than the one that is already in "arith.impure.lisp". 
From: Juho Snellman <jsnell@ik...>  20060728 01:47:54

lutz.euler@... (Lutz Euler) writes: > So I would like to propose the attached patch that makes LOGCOUNT run > 2.5 times as fast on x8664 and around 15 % faster on x86 while the > generated code is quite a bit smaller. Please see the comments in the > code for the implementation details. Nice! Committed this as 0.9.15.1.  Juho Snellman 
From: Sidney Markowitz <sidney@si...>  20060728 19:41:01

Juho Snellman wrote: > lutz.euler@... (Lutz Euler) writes: >> So I would like to propose the attached patch that makes LOGCOUNT run >> 2.5 times as fast on x8664 and around 15 % faster on x86 while the >> generated code is quite a bit smaller. Please see the comments in the >> code for the implementation details. > > Nice! Committed this as 0.9.15.1. I sent the following to Lutz, Cc'd to the sbcldevel list yesterday, but it doesn't seem to have reached the list. Perhaps the first email when one first joins the list is held for moderation? Anyway, here is what I sent, relevant to Lutz's patch:  Lutz Euler wrote: > So I would like to propose the attached patch that makes LOGCOUNT run > 2.5 times as fast on x8664 and around 15 % faster on x86 while the > generated code is quite a bit smaller. Please see the comments in the > code for the implementation details. By coincidence I just came to this mailing list to submit a patch to speed up x86 LOGCOUNT, after seeing the thread on comp.lang.lisp entitled "Squeezing blood from a turnip". I've got some code that provides some additional speedup and shorter code compared to your patch. Here are times from a test that sums logcount of the integers 0 through 199999999, compiled with (type (unsignedbyte 32)) declarations, run on a 2GHz Intel dual core MacBook: existing LOGCOUNT: 1.92 seconds Lutz Euler's LOGCOUNT: 1.51 seconds My LOGCOUNT: 1.36 seconds Noop loop, summing n instead of (logcount n): 0.3 seconds The code is based on the faster and smaller algorithm at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel It's the one that is referred to as the "best" counting bits algorithm. It uses an integer multiply to do the final shift and adds in parallel. This VOP also gets rid of an extra initial MOV instruction that I saw in both the existing and your versions. To see it, try (defun t1 (n) (declare (type (unsignedbyte 32) n) (logcount n)) (disassemble #'t1) The VOP is short enough that I'm just going to paste it in here rather than attach it as a patch, since presumably it should be merged in with the rest of your patch anyway. I don't have a 64 bit machine to test a 64 bit equivalent, but according to the link I mentioned earlier, this is the best algorithm for counting bits in 32 bit words, even when 64 bit instructions are available. Perhaps there is a similar thing that can be done for counting words that can be up to 64 bits on a 64 bit machine. This code goes in src/compiler/x86/arith.lisp, replacing the existing definevop for logcount: (definevop (unsignedbyte32count) (:translate logcount) (:note "inline (unsignedbyte 32) logcount") (:policy :fastsafe) (:args (arg :scs (unsignedreg) :target w)) (:argtypes unsignednum) (:results (result :scs (unsignedreg))) (:resulttypes positivefixnum) (:temporary (:sc unsignedreg :from (:argument 0)) w) (:generator 16 (move result arg) (move w arg) (inst shr w 1) (inst and w #x55555555) (inst sub result w) (inst mov w result) (inst shr w 2) (inst and result #x33333333) (inst and w #x33333333) (inst add result w) (inst mov w result) (inst shr result 4) (inst add result w) (inst and result #x0f0f0f0f) (inst imul result #x01010101) (inst shr result 24)))  Sidney Markowitz http://www.sidney.com 
From: <lutz.euler@fr...>  20060729 12:25:14

Hi, Sidney Markowitz wrote: > Lutz Euler wrote: > > So I would like to propose the attached patch that makes LOGCOUNT run > > 2.5 times as fast on x8664 and around 15 % faster on x86 while the > > generated code is quite a bit smaller. Please see the comments in the > > code for the implementation details. > > By coincidence I just came to this mailing list to submit a patch to > speed up x86 LOGCOUNT, after seeing the thread on comp.lang.lisp > entitled "Squeezing blood from a turnip". I've got some code that > provides some additional speedup and shorter code compared to your patch. > > Here are times from a test that sums logcount of the integers 0 through > 199999999, compiled with (type (unsignedbyte 32)) declarations, run on > a 2GHz Intel dual core MacBook: > > existing LOGCOUNT: 1.92 seconds > Lutz Euler's LOGCOUNT: 1.51 seconds > My LOGCOUNT: 1.36 seconds > > Noop loop, summing n instead of (logcount n): 0.3 seconds > > The code is based on the faster and smaller algorithm at > > http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel > > It's the one that is referred to as the "best" counting bits algorithm. > It uses an integer multiply to do the final shift and adds in parallel. > > This VOP also gets rid of an extra initial MOV instruction that I saw in > both the existing and your versions. To see it, try > > (defun t1 (n) (declare (type (unsignedbyte 32) n) (logcount n)) > (disassemble #'t1) > > The VOP is short enough that I'm just going to paste it in here rather > than attach it as a patch, since presumably it should be merged in with > the rest of your patch anyway. I don't have a 64 bit machine to test a > 64 bit equivalent, but according to the link I mentioned earlier, this > is the best algorithm for counting bits in 32 bit words, even when 64 > bit instructions are available. Perhaps there is a similar thing that > can be done for counting words that can be up to 64 bits on a 64 bit > machine. > > This code goes in src/compiler/x86/arith.lisp, replacing the existing > definevop for logcount: > > > (definevop (unsignedbyte32count) > (:translate logcount) > (:note "inline (unsignedbyte 32) logcount") > (:policy :fastsafe) > (:args (arg :scs (unsignedreg) :target w)) > (:argtypes unsignednum) > (:results (result :scs (unsignedreg))) > (:resulttypes positivefixnum) > (:temporary (:sc unsignedreg :from (:argument 0)) w) > (:generator 16 > (move result arg) > (move w arg) > > (inst shr w 1) > (inst and w #x55555555) > (inst sub result w) > > (inst mov w result) > (inst shr w 2) > (inst and result #x33333333) > (inst and w #x33333333) > (inst add result w) > > (inst mov w result) > (inst shr result 4) > (inst add result w) > (inst and result #x0f0f0f0f) > (inst imul result #x01010101) > (inst shr result 24))) It is nice to see that someone indeed has a need for a fast logcount! (clbench for instance does not care). I am reading the comp.lang.lisp "turnip" thread, too. May I add some remarks? 1. Concerning the multiplication at the end: As you can see from the x8664 version in my patch I am well aware of this solution. I tested this method on my Athlon 64 in 32bit mode and it was faster than the shiftandadd method. Despite this, I intentionally did not put it in the x86 port because it would be slower than the shiftandaddmethod for lots of existing x86 processors, for example Pentium 3 and 4. About the Athlon I don't know, while the Athlon 64 definitely likes the multiplication method, both in 32bit and in 64bit mode. OTOH, personally I have no stake in x86, so if it is more important to be as fast as possible on new hardware I don't mind hurting older hardware. A backendsubfeature "costofwordsizedimul" or some such may not be warranted? On x8664 I use the multiplication method because it saves two more clocks than in x86 and there is only one implementation of x8664 that has a slow multiplier (Pentium 4) and that processor is already very slow in 64bit mode generally so nobody would be using it with a 64bit SBCL anyway, or so I presume ... 2. Concerning the spared MOV at the beginning: Yes, that's better. According to my experiments (on x8664)the register allocator is not always able to find the allocation that minimizes the number of MOVs, depending on surrounding code. For instance, compare the disassembly of LOGCOUNT and of SBBIGNUM::LOGCOUNT. On x8664, the best way seemed to be to have both temporaries live as long as possible and I carried this conclusion over to the single temporary that x86 needs. 3. You can speed up your code by shifting the source of MOVs and not the destination: You have: > (move w arg) > > (inst shr w 1) and later: > (inst mov w result) > (inst shr w 2) Both have a latency of two clocks (on all superscalar x86 and x8664 processors that I know of). If you instead do (move w arg) (shr arg 1) and so on, the two instructions can be executed in parallel. It is easy to rewrite the x86 code to take this into account, as you did already for the third shift in your version above. I did it in my patch for all shifts (see the comments there). 4. Further optimizations When I wrote the patch, my intentions were to provide improvements both in code size and execution speed. If you don't mind much larger code size and if you have to count bit vectors or bignums of larger size, say at least two or three words, an additional speedup should be possible; I would estimate a factor of two. 4.1. More parallelism Write a VOP that takes two words at the same time and interleave the counting algorithms, additionally sharing the multiplication between them. As you see, the code has approximately 15 instructions but a latency of 12, so the parallelism available in your processor is by far not used. (It may be possible to even do three words at a time advantageously). Of course, you need to adapt the bignum and bitvector arithmetic loops to use this for large operands. Also, on x86, this may be hindered by too few registers being available. See for example "http://aggregate.org/MAGIC";. The entry on "population count" contains: "It is worthwhile noting that the SWAR population count algorithm given above can be improved upon for the case of counting the population of multiword bit sets. How? The last few steps in the reduction are using only a portion of the SWAR width to produce their results; thus, it would be possible to combine these steps across multiple words being reduced." 4.2. Reduction method I can't find the source of this currently (there is a very elaborate paper about in on the net), but you can use a single or a cascaded set of 32reductions (or 73 reductions) to further speed logcount up. A single 32 reduction looks like this: Input: 3 words to be counted Output: 2 words, "sum" and "carry", generated by using a bitwise full adder on the three input words (C notation): sum = x1 ^ x2 ^ x3 carry = (x1 & x2)  (x1 & x3)  (x2 & x3) (It may be faster to use carry = (x1 & x2)  (x3 & (x1 ^ x2)) instead.) Now each bit set in "sum" counts one in the final sum, each bit set in "carry" counts two. (A 73 reduction would be a combination of three full adders to add 7 input words, giving three words with weight 1, 2 and 4.) Repeat this across all your input, doing a logcount on all the "sum"s and in parallel on all the "carry"s. Finally add the sum of the logcounts of the "sum"s and the sum of the logcounts of the "carry"s, the latter shifted left by one. Again, this may be no gain on x86 because the number of registers is too small. 4.3. Better hardware Wait for AMD's K8L that AFAIK is supposed to have a logcount operation in hardware or microcode that should be much faster than anything that can be done in software. Happy hacking! Lutz Euler 
From: Sidney Markowitz <sidney@si...>  20060728 07:24:44

Lutz Euler wrote: > So I would like to propose the attached patch that makes LOGCOUNT run > 2.5 times as fast on x8664 and around 15 % faster on x86 while the > generated code is quite a bit smaller. Please see the comments in the > code for the implementation details. By coincidence I just came to this mailing list to submit a patch to speed up x86 LOGCOUNT, after seeing the thread on comp.lang.lisp entitled "Squeezing blood from a turnip". I've got some code that provides some additional speedup and shorter code compared to your patch. Here are times from a test that sums logcount of the integers 0 through 199999999, compiled with (type (unsignedbyte 32)) declarations, run on a 2GHz Intel dual core MacBook: existing LOGCOUNT: 1.92 seconds Lutz Euler's LOGCOUNT: 1.51 seconds My LOGCOUNT: 1.36 seconds Noop loop, summing n instead of (logcount n): 0.3 seconds The code is based on the faster and smaller algorithm at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel It's the one that is referred to as the "best" counting bits algorithm. It uses an integer multiply to do the final shift and adds in parallel. This VOP also gets rid of an extra initial MOV instruction that I saw in both the existing and your versions. To see it, try (defun t1 (n) (declare (type (unsignedbyte 32) n) (logcount n)) (disassemble #'t1) The VOP is short enough that I'm just going to paste it in here rather than attach it as a patch, since presumably it should be merged in with the rest of your patch anyway. I don't have a 64 bit machine to test a 64 bit equivalent, but according to the link I mentioned earlier, this is the best algorithm for counting bits in 32 bit words, even when 64 bit instructions are available. Perhaps there is a similar thing that can be done for counting words that can be up to 64 bits on a 64 bit machine. This code goes in src/compiler/x86/arith.lisp, replacing the existing definevop for logcount: (definevop (unsignedbyte32count) (:translate logcount) (:note "inline (unsignedbyte 32) logcount") (:policy :fastsafe) (:args (arg :scs (unsignedreg) :target w)) (:argtypes unsignednum) (:results (result :scs (unsignedreg))) (:resulttypes positivefixnum) (:temporary (:sc unsignedreg :from (:argument 0)) w) (:generator 16 (move result arg) (move w arg) (inst shr w 1) (inst and w #x55555555) (inst sub result w) (inst mov w result) (inst shr w 2) (inst and result #x33333333) (inst and w #x33333333) (inst add result w) (inst mov w result) (inst shr result 4) (inst add result w) (inst and result #x0f0f0f0f) (inst imul result #x01010101) (inst shr result 24)))  Sidney Markowitz http://www.sidney.com 
From: Sidney Markowitz <sidney@si...>  20060729 15:26:51

Lutz Euler wrote: > It is nice to see that someone indeed has a need for a fast logcount! > (clbench for instance does not care). I am reading the comp.lang.lisp > "turnip" thread, too. Thanks for the extra hints. I only had my Intel dual core machine to test on, and I had not yet read the 64 bit version of your patch when I posted. I defer to your knowledge of the other architectures and your more detailed knowledge about clock times and pipelining. I don't have a position on whether it is better to use the multiplication or the shift and add method if the former is slightly slower on older Pentiums... I haven't been involved with sbcl before now, so that's not for me to choose. I have a question about taking advantage of parallelism on the x86: When you say that (move w arg) (shr arg 1) allows the instructions to be executed in parallel, is that also true if I set up the order of something like (inst shr w 2) (inst and result #x33333333) (inst and w #x33333333) instead of (inst shr w 2) (inst and w #x33333333) (inst and result #x33333333) In other words, is there some parallelism in the instructions with immediates by having two instructions with different destinations?  Sidney Markowitz http://www.sidney.com 
From: <lutz.euler@fr...>  20060730 17:20:47

Hi Sidney, > Thanks for the extra hints. You're welcome. > I have a question about taking advantage of parallelism on the x86: just a short answer, see below, as I think this is getting offtopic for SBCL development. I will send you a mail offlist with some more details and/or links  please bear with me while I prepare that. > When you say that > > (move w arg) > (shr arg 1) > > allows the instructions to be executed in parallel, is that also true > if I set up the order of something like > > (inst shr w 2) > (inst and result #x33333333) > (inst and w #x33333333) > > instead of > > (inst shr w 2) > (inst and w #x33333333) > (inst and result #x33333333) > > In other words, is there some parallelism in the instructions with > immediates by having two instructions with different destinations? no and yes, that is: No, reordering instructions very often makes no difference: As the processor works not only "superscalar" but also "outoforder" I would expect the two sequences to execute at the same speed. Basically: Instructions will be executed, independent of the order in which they are given, as soon as the operands and execution resources they need are available. Yes, there is parallelism: Current x86 processors have at least two integer ALUs (I don't know the exact number for the Intel chips, Athlons have three), so for example two ANDs or an AND and a SHR can execute in the same clock cycle. Yours, Lutz Euler 