From: <lut...@fr...> - 2006-07-29 12:25:14
|
Hi, Sidney Markowitz wrote: > Lutz Euler wrote: > > So I would like to propose the attached patch that makes LOGCOUNT run > > 2.5 times as fast on x86-64 and around 15 % faster on x86 while the > > generated code is quite a bit smaller. Please see the comments in the > > code for the implementation details. > > By coincidence I just came to this mailing list to submit a patch to > speed up x86 LOGCOUNT, after seeing the thread on comp.lang.lisp > entitled "Squeezing blood from a turnip". I've got some code that > provides some additional speedup and shorter code compared to your patch. > > Here are times from a test that sums logcount of the integers 0 through > 199999999, compiled with (type (unsigned-byte 32)) declarations, run on > a 2GHz Intel dual core MacBook: > > existing LOGCOUNT: 1.92 seconds > Lutz Euler's LOGCOUNT: 1.51 seconds > My LOGCOUNT: 1.36 seconds > > No-op loop, summing n instead of (logcount n): 0.3 seconds > > The code is based on the faster and smaller algorithm at > > http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel > > It's the one that is referred to as the "best" counting bits algorithm. > It uses an integer multiply to do the final shift and adds in parallel. > > This VOP also gets rid of an extra initial MOV instruction that I saw in > both the existing and your versions. To see it, try > > (defun t1 (n) (declare (type (unsigned-byte 32) n) (logcount n)) > (disassemble #'t1) > > The VOP is short enough that I'm just going to paste it in here rather > than attach it as a patch, since presumably it should be merged in with > the rest of your patch anyway. I don't have a 64 bit machine to test a > 64 bit equivalent, but according to the link I mentioned earlier, this > is the best algorithm for counting bits in 32 bit words, even when 64 > bit instructions are available. Perhaps there is a similar thing that > can be done for counting words that can be up to 64 bits on a 64 bit > machine. > > This code goes in src/compiler/x86/arith.lisp, replacing the existing > define-vop for logcount: > > > (define-vop (unsigned-byte-32-count) > (:translate logcount) > (:note "inline (unsigned-byte 32) logcount") > (:policy :fast-safe) > (:args (arg :scs (unsigned-reg) :target w)) > (:arg-types unsigned-num) > (:results (result :scs (unsigned-reg))) > (:result-types positive-fixnum) > (:temporary (:sc unsigned-reg :from (:argument 0)) w) > (:generator 16 > (move result arg) > (move w arg) > > (inst shr w 1) > (inst and w #x55555555) > (inst sub result w) > > (inst mov w result) > (inst shr w 2) > (inst and result #x33333333) > (inst and w #x33333333) > (inst add result w) > > (inst mov w result) > (inst shr result 4) > (inst add result w) > (inst and result #x0f0f0f0f) > (inst imul result #x01010101) > (inst shr result 24))) It is nice to see that someone indeed has a need for a fast logcount! (cl-bench for instance does not care). I am reading the comp.lang.lisp "turnip" thread, too. May I add some remarks? 1. Concerning the multiplication at the end: As you can see from the x86-64 version in my patch I am well aware of this solution. I tested this method on my Athlon 64 in 32-bit mode and it was faster than the shift-and-add method. Despite this, I intentionally did not put it in the x86 port because it would be slower than the shift-and-add-method for lots of existing x86 processors, for example Pentium 3 and 4. About the Athlon I don't know, while the Athlon 64 definitely likes the multiplication method, both in 32-bit and in 64-bit mode. OTOH, personally I have no stake in x86, so if it is more important to be as fast as possible on new hardware I don't mind hurting older hardware. A backend-subfeature "cost-of-word-sized-imul" or some such may not be warranted? On x86-64 I use the multiplication method because it saves two more clocks than in x86 and there is only one implementation of x86-64 that has a slow multiplier (Pentium 4) and that processor is already very slow in 64-bit mode generally so nobody would be using it with a 64-bit SBCL anyway, or so I presume ... 2. Concerning the spared MOV at the beginning: Yes, that's better. According to my experiments (on x86-64)the register allocator is not always able to find the allocation that minimizes the number of MOVs, depending on surrounding code. For instance, compare the disassembly of LOGCOUNT and of SB-BIGNUM::LOGCOUNT. On x86-64, the best way seemed to be to have both temporaries live as long as possible and I carried this conclusion over to the single temporary that x86 needs. 3. You can speed up your code by shifting the source of MOVs and not the destination: You have: > (move w arg) > > (inst shr w 1) and later: > (inst mov w result) > (inst shr w 2) Both have a latency of two clocks (on all superscalar x86 and x86-64 processors that I know of). If you instead do (move w arg) (shr arg 1) and so on, the two instructions can be executed in parallel. It is easy to rewrite the x86 code to take this into account, as you did already for the third shift in your version above. I did it in my patch for all shifts (see the comments there). 4. Further optimizations When I wrote the patch, my intentions were to provide improvements both in code size and execution speed. If you don't mind much larger code size and if you have to count bit vectors or bignums of larger size, say at least two or three words, an additional speedup should be possible; I would estimate a factor of two. 4.1. More parallelism Write a VOP that takes two words at the same time and interleave the counting algorithms, additionally sharing the multiplication between them. As you see, the code has approximately 15 instructions but a latency of 12, so the parallelism available in your processor is by far not used. (It may be possible to even do three words at a time advantageously). Of course, you need to adapt the bignum and bit-vector arithmetic loops to use this for large operands. Also, on x86, this may be hindered by too few registers being available. See for example "http://aggregate.org/MAGIC". The entry on "population count" contains: "It is worthwhile noting that the SWAR population count algorithm given above can be improved upon for the case of counting the population of multi-word bit sets. How? The last few steps in the reduction are using only a portion of the SWAR width to produce their results; thus, it would be possible to combine these steps across multiple words being reduced." 4.2. Reduction method I can't find the source of this currently (there is a very elaborate paper about in on the net), but you can use a single or a cascaded set of 3-2-reductions (or 7-3 reductions) to further speed logcount up. A single 3-2 reduction looks like this: Input: 3 words to be counted Output: 2 words, "sum" and "carry", generated by using a bitwise full adder on the three input words (C notation): sum = x1 ^ x2 ^ x3 carry = (x1 & x2) | (x1 & x3) | (x2 & x3) (It may be faster to use carry = (x1 & x2) | (x3 & (x1 ^ x2)) instead.) Now each bit set in "sum" counts one in the final sum, each bit set in "carry" counts two. (A 7-3 reduction would be a combination of three full adders to add 7 input words, giving three words with weight 1, 2 and 4.) Repeat this across all your input, doing a logcount on all the "sum"s and in parallel on all the "carry"s. Finally add the sum of the logcounts of the "sum"s and the sum of the logcounts of the "carry"s, the latter shifted left by one. Again, this may be no gain on x86 because the number of registers is too small. 4.3. Better hardware Wait for AMD's K8L that AFAIK is supposed to have a logcount operation in hardware or microcode that should be much faster than anything that can be done in software. Happy hacking! Lutz Euler |