[Fastlibnt-devel] ssfft progress report part 2

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've now written a decent enough version of ssfft_fft_iterative_limbs 
() to be worth profiling, and I profiled it on all the same machines  
as before, and the results are very promising.

This function is the main building block of larger FFTs; it probably  
won't be used often for a whole FFT itself. It requires that n (the  
number of limbs per coefficient) is divisible by M/2 (where M is the  
transform length), in other words, that the Mth roots of unity can  
all be achieved by full limb shifts alone, with no bitshifts. This  
seems quite restrictive, but in fact it's a perfectly reasonable  
assumption when used as a piece of a large transform.

For example, consider a transform length of 8192, with the smallest  
possible coefficients using sqrt2, namely 2048 bits = 32 limbs  
(assuming 64 bit machine). The total data size is 8192*2048 bits =  
about 2MB. The FFT factoring code will split this into transforms of  
length 64 and 128. In the former case, ssfft_fft_iterative_limbs() is  
directly applicable, since 32 is divisible by 64/2. In the latter  
case, the factoring code will split again, despite the fact that the  
transform will already fit easily into L1, since it can't use  
ssfft_fft_iterative_limbs() on a length 128 transform. So in the end  
it will be running transforms of length 64, 16, and 8. (Hmmmm in this  
case it would be better for the factoring code to split three ways  
into 32 * 16 * 16, need to think about that a bit more.)

So what's the point of all this....

The point is that ssfft_fft_iterative_limbs() still allows a twist  
parameter (psi), which can be any bitshift or even a sqrt2 rotation,  
and it arranges things so that all the sqrt2/bitshifts are done in  
the *last layer only*. Basically the trick is kind of like in  
bailey's N-step algorithm (N=?) where if you want to do a transform  
of length A = B*C, you do transforms of lengths B and C using B-th  
and C-th roots of unity, and in between you have a twisting step  
where you need to use Ath roots of unity. I'm basically taking  
advantage of the fact that the higher order roots of unity (those  
involving bitshifts) are more expensive than the others, and so I  
push them all into an intermediate twisting layer. The nice thing in  
the SS case is that you don't have to do the twisting layer  
"separately"; you can merge it with all the other work.

So in the above example. Under either a naive iterative regime, or  
under the current trunk code, most of the butterflies in the first  
layer require sqrt2 rotations and bitshifts, most of the butterflies  
in the next 6 layers require bitshifts, and then the remaining 6  
layers do not need any bitshifts. So altogether there are 7 layers  
that have to think about bitshifts.

Under the new code, it's quite different. In the length 64  
subtransforms, only the last of the six layers does sqrt2/bitshifts.  
In the length 16 transforms, again only the last of four layers has  
to do bitshifts. In the length 8 subtransforms, there are in fact no  
bitshifts left. So there are only two layers involving bitshifts,  
instead of seven. Not only does this reduce the number of bitshifts,  
but it means the layers that don't do bitshifts have less overhead to  
worry about, since they measure everything in limbs.

Furthermore, the new code works much harder to do that last layer of  
merged shifts efficiently. There's a number of new operations  
required, like ssfft_add_rotate_bits(), and specialised code for  
doing sqrt2 rotations when n is even.

==================================

Enough talk.... some profiling results. I ran the profile on the same  
machines as before (except my laptop hasn't finished yet), with:
* transform lengths between 8 and 64
* various allowable coefficient lengths
* various truncation parameters
* psi = 0 (no twisting), 1 (sqrt2), 2 (bitshifts), 3 (sqrt2) and 4  
(bitshifts).

This doesn't cover everything that will eventually be important, but  
it's a good start. In particular I didn't try nonzero values of eta,  
since the old FFT doesn't support that kind of twist, so I couldn't  
make a comparison with the old code. But I expect nonzero values of  
eta to have similar effects on the running time to nonzero values of  
psi as above.

Here are the *median* speedups of the test cases on the various  
machines:

G5: 1.35x
martinj: 1.2x
sage.math: 1.3x
bsd: 1.2x

So it's a pretty bloody decent speedup. Note that the final speedup  
when applied to the whole FFT will be somewhat less, because the  
cases with the biggest speedups will only apply to some of the layers  
of the transform. I'm guessing we'll see a speedup of maybe 10-15%,  
depending on the architecture. (Heh this is a lot of effort for 10%  
isn't it..... :-))

However, the new code was not always faster. The main black spot is  
when M = 8 and z = 1 (i.e. there is only one nonezro input  
coefficient). In fact both cases M = 8 and z = 1 were are not great  
separately, and they are lethal in combination. I think I understand  
why the z = 1 case is slow; basically the bitshift factoring trick is  
not efficient when z = 1. I think I will need to write special case  
code to handle z = 1, which shouldn't be hard.

Another thing I might consider doing is writing a specialised version  
for the non-truncated case (i.e. z = g = M). The trunk code has  
something like that, and it shows.

==================================

I've been hacking pretty hard on this the last few days, I need a  
break. I'll come back and work on this function a bit more in a  
little while. Then the last piece of the puzzle is optimising the new  
FFT factoring code, which puts all of this to work for larger  
transforms.

David