|
From: David H. <dmh...@ma...> - 2007-04-22 13:28:25
|
I've now written a decent enough version of ssfft_fft_iterative_limbs () to be worth profiling, and I profiled it on all the same machines as before, and the results are very promising. This function is the main building block of larger FFTs; it probably won't be used often for a whole FFT itself. It requires that n (the number of limbs per coefficient) is divisible by M/2 (where M is the transform length), in other words, that the Mth roots of unity can all be achieved by full limb shifts alone, with no bitshifts. This seems quite restrictive, but in fact it's a perfectly reasonable assumption when used as a piece of a large transform. For example, consider a transform length of 8192, with the smallest possible coefficients using sqrt2, namely 2048 bits = 32 limbs (assuming 64 bit machine). The total data size is 8192*2048 bits = about 2MB. The FFT factoring code will split this into transforms of length 64 and 128. In the former case, ssfft_fft_iterative_limbs() is directly applicable, since 32 is divisible by 64/2. In the latter case, the factoring code will split again, despite the fact that the transform will already fit easily into L1, since it can't use ssfft_fft_iterative_limbs() on a length 128 transform. So in the end it will be running transforms of length 64, 16, and 8. (Hmmmm in this case it would be better for the factoring code to split three ways into 32 * 16 * 16, need to think about that a bit more.) So what's the point of all this.... The point is that ssfft_fft_iterative_limbs() still allows a twist parameter (psi), which can be any bitshift or even a sqrt2 rotation, and it arranges things so that all the sqrt2/bitshifts are done in the *last layer only*. Basically the trick is kind of like in bailey's N-step algorithm (N=?) where if you want to do a transform of length A = B*C, you do transforms of lengths B and C using B-th and C-th roots of unity, and in between you have a twisting step where you need to use Ath roots of unity. I'm basically taking advantage of the fact that the higher order roots of unity (those involving bitshifts) are more expensive than the others, and so I push them all into an intermediate twisting layer. The nice thing in the SS case is that you don't have to do the twisting layer "separately"; you can merge it with all the other work. So in the above example. Under either a naive iterative regime, or under the current trunk code, most of the butterflies in the first layer require sqrt2 rotations and bitshifts, most of the butterflies in the next 6 layers require bitshifts, and then the remaining 6 layers do not need any bitshifts. So altogether there are 7 layers that have to think about bitshifts. Under the new code, it's quite different. In the length 64 subtransforms, only the last of the six layers does sqrt2/bitshifts. In the length 16 transforms, again only the last of four layers has to do bitshifts. In the length 8 subtransforms, there are in fact no bitshifts left. So there are only two layers involving bitshifts, instead of seven. Not only does this reduce the number of bitshifts, but it means the layers that don't do bitshifts have less overhead to worry about, since they measure everything in limbs. Furthermore, the new code works much harder to do that last layer of merged shifts efficiently. There's a number of new operations required, like ssfft_add_rotate_bits(), and specialised code for doing sqrt2 rotations when n is even. ================================== Enough talk.... some profiling results. I ran the profile on the same machines as before (except my laptop hasn't finished yet), with: * transform lengths between 8 and 64 * various allowable coefficient lengths * various truncation parameters * psi = 0 (no twisting), 1 (sqrt2), 2 (bitshifts), 3 (sqrt2) and 4 (bitshifts). This doesn't cover everything that will eventually be important, but it's a good start. In particular I didn't try nonzero values of eta, since the old FFT doesn't support that kind of twist, so I couldn't make a comparison with the old code. But I expect nonzero values of eta to have similar effects on the running time to nonzero values of psi as above. Here are the *median* speedups of the test cases on the various machines: G5: 1.35x martinj: 1.2x sage.math: 1.3x bsd: 1.2x So it's a pretty bloody decent speedup. Note that the final speedup when applied to the whole FFT will be somewhat less, because the cases with the biggest speedups will only apply to some of the layers of the transform. I'm guessing we'll see a speedup of maybe 10-15%, depending on the architecture. (Heh this is a lot of effort for 10% isn't it..... :-)) However, the new code was not always faster. The main black spot is when M = 8 and z = 1 (i.e. there is only one nonezro input coefficient). In fact both cases M = 8 and z = 1 were are not great separately, and they are lethal in combination. I think I understand why the z = 1 case is slow; basically the bitshift factoring trick is not efficient when z = 1. I think I will need to write special case code to handle z = 1, which shouldn't be hard. Another thing I might consider doing is writing a specialised version for the non-truncated case (i.e. z = g = M). The trunk code has something like that, and it shows. ================================== I've been hacking pretty hard on this the last few days, I need a break. I'll come back and work on this function a bit more in a little while. Then the last piece of the puzzle is optimising the new FFT factoring code, which puts all of this to work for larger transforms. David |