Hi,
Sidney Markowitz wrote:
> Lutz Euler wrote:
> > So I would like to propose the attached patch that makes LOGCOUNT run
> > 2.5 times as fast on x8664 and around 15 % faster on x86 while the
> > generated code is quite a bit smaller. Please see the comments in the
> > code for the implementation details.
>
> By coincidence I just came to this mailing list to submit a patch to
> speed up x86 LOGCOUNT, after seeing the thread on comp.lang.lisp
> entitled "Squeezing blood from a turnip". I've got some code that
> provides some additional speedup and shorter code compared to your patch.
>
> Here are times from a test that sums logcount of the integers 0 through
> 199999999, compiled with (type (unsignedbyte 32)) declarations, run on
> a 2GHz Intel dual core MacBook:
>
> existing LOGCOUNT: 1.92 seconds
> Lutz Euler's LOGCOUNT: 1.51 seconds
> My LOGCOUNT: 1.36 seconds
>
> Noop loop, summing n instead of (logcount n): 0.3 seconds
>
> The code is based on the faster and smaller algorithm at
>
> http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
>
> It's the one that is referred to as the "best" counting bits algorithm.
> It uses an integer multiply to do the final shift and adds in parallel.
>
> This VOP also gets rid of an extra initial MOV instruction that I saw in
> both the existing and your versions. To see it, try
>
> (defun t1 (n) (declare (type (unsignedbyte 32) n) (logcount n))
> (disassemble #'t1)
>
> The VOP is short enough that I'm just going to paste it in here rather
> than attach it as a patch, since presumably it should be merged in with
> the rest of your patch anyway. I don't have a 64 bit machine to test a
> 64 bit equivalent, but according to the link I mentioned earlier, this
> is the best algorithm for counting bits in 32 bit words, even when 64
> bit instructions are available. Perhaps there is a similar thing that
> can be done for counting words that can be up to 64 bits on a 64 bit
> machine.
>
> This code goes in src/compiler/x86/arith.lisp, replacing the existing
> definevop for logcount:
>
>
> (definevop (unsignedbyte32count)
> (:translate logcount)
> (:note "inline (unsignedbyte 32) logcount")
> (:policy :fastsafe)
> (:args (arg :scs (unsignedreg) :target w))
> (:argtypes unsignednum)
> (:results (result :scs (unsignedreg)))
> (:resulttypes positivefixnum)
> (:temporary (:sc unsignedreg :from (:argument 0)) w)
> (:generator 16
> (move result arg)
> (move w arg)
>
> (inst shr w 1)
> (inst and w #x55555555)
> (inst sub result w)
>
> (inst mov w result)
> (inst shr w 2)
> (inst and result #x33333333)
> (inst and w #x33333333)
> (inst add result w)
>
> (inst mov w result)
> (inst shr result 4)
> (inst add result w)
> (inst and result #x0f0f0f0f)
> (inst imul result #x01010101)
> (inst shr result 24)))
It is nice to see that someone indeed has a need for a fast logcount!
(clbench for instance does not care). I am reading the comp.lang.lisp
"turnip" thread, too.
May I add some remarks?
1. Concerning the multiplication at the end:
As you can see from the x8664 version in my patch I am well aware of
this solution. I tested this method on my Athlon 64 in 32bit mode and
it was faster than the shiftandadd method.
Despite this, I intentionally did not put it in the x86 port because it
would be slower than the shiftandaddmethod for lots of existing x86
processors, for example Pentium 3 and 4. About the Athlon I don't know,
while the Athlon 64 definitely likes the multiplication method, both in
32bit and in 64bit mode.
OTOH, personally I have no stake in x86, so if it is more important to
be as fast as possible on new hardware I don't mind hurting older
hardware. A backendsubfeature "costofwordsizedimul" or some such
may not be warranted?
On x8664 I use the multiplication method because it saves two more
clocks than in x86 and there is only one implementation of x8664 that
has a slow multiplier (Pentium 4) and that processor is already very
slow in 64bit mode generally so nobody would be using it with a 64bit
SBCL anyway, or so I presume ...
2. Concerning the spared MOV at the beginning:
Yes, that's better. According to my experiments (on x8664)the register
allocator is not always able to find the allocation that minimizes the
number of MOVs, depending on surrounding code. For instance, compare the
disassembly of LOGCOUNT and of SBBIGNUM::LOGCOUNT. On x8664, the best
way seemed to be to have both temporaries live as long as possible and I
carried this conclusion over to the single temporary that x86 needs.
3. You can speed up your code by shifting the source of MOVs and not the
destination:
You have:
> (move w arg)
>
> (inst shr w 1)
and later:
> (inst mov w result)
> (inst shr w 2)
Both have a latency of two clocks (on all superscalar x86 and x8664
processors that I know of). If you instead do
(move w arg)
(shr arg 1)
and so on, the two instructions can be executed in parallel.
It is easy to rewrite the x86 code to take this into account, as you did
already for the third shift in your version above. I did it in my patch
for all shifts (see the comments there).
4. Further optimizations
When I wrote the patch, my intentions were to provide improvements both
in code size and execution speed. If you don't mind much larger code
size and if you have to count bit vectors or bignums of larger size, say
at least two or three words, an additional speedup should be possible;
I would estimate a factor of two.
4.1. More parallelism
Write a VOP that takes two words at the same time and interleave the
counting algorithms, additionally sharing the multiplication between
them. As you see, the code has approximately 15 instructions but a
latency of 12, so the parallelism available in your processor is by far
not used. (It may be possible to even do three words at a time
advantageously).
Of course, you need to adapt the bignum and bitvector arithmetic loops
to use this for large operands.
Also, on x86, this may be hindered by too few registers being available.
See for example "http://aggregate.org/MAGIC";. The entry on "population
count" contains:
"It is worthwhile noting that the SWAR population count algorithm given
above can be improved upon for the case of counting the population of
multiword bit sets. How? The last few steps in the reduction are using
only a portion of the SWAR width to produce their results; thus, it
would be possible to combine these steps across multiple words being
reduced."
4.2. Reduction method
I can't find the source of this currently (there is a very elaborate
paper about in on the net), but you can use a single or a cascaded set
of 32reductions (or 73 reductions) to further speed logcount up.
A single 32 reduction looks like this:
Input: 3 words to be counted
Output: 2 words, "sum" and "carry", generated by using a bitwise full
adder on the three input words (C notation):
sum = x1 ^ x2 ^ x3
carry = (x1 & x2)  (x1 & x3)  (x2 & x3)
(It may be faster to use carry = (x1 & x2)  (x3 & (x1 ^ x2)) instead.)
Now each bit set in "sum" counts one in the final sum, each bit set in
"carry" counts two.
(A 73 reduction would be a combination of three full adders to add 7
input words, giving three words with weight 1, 2 and 4.)
Repeat this across all your input, doing a logcount on all the "sum"s
and in parallel on all the "carry"s. Finally add the sum of the
logcounts of the "sum"s and the sum of the logcounts of the "carry"s,
the latter shifted left by one.
Again, this may be no gain on x86 because the number of registers is too
small.
4.3. Better hardware
Wait for AMD's K8L that AFAIK is supposed to have a logcount operation
in hardware or microcode that should be much faster than anything that
can be done in software.
Happy hacking!
Lutz Euler
