Re: [Fastlibnt-devel] ssfft progress report

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Apr 17, 2007, at 1:57 PM, Development list for FLINT wrote:

> I just had a look at your code and one big performance
> hit is the lack of static inline functions.

I have some "inline" functions which are not "static inline". What is 
the difference between "static inline" and just "inline". I thought it 
was only a linkage issue. What does plain "inline" do?

Of the remaining functions, I think the only candidates for inlining, 
which are not already marked inline, are the ones starting with 
"coeff". Some of these look a bit long to be inlined (code bloat), but 
I agree I should try inlining some of the shorter ones.

But this doesn't answer the basic question about the slowness of the 
code on the G5.

Perhaps I need to explain what's going on with the new code, so you can 
see why I am perplexed.

The ssfft code has basically three layers. The bottom layer is the 
functions starting with "basic". These are very low level coefficient 
operations on raw blocks of memory, like rotations, and bitshifts with 
carry handling. The middle layer are the functions starting with 
"coeff". These are allowed to do things like swap buffers, they make 
decisions about how to decompose large rotations into bitshifts and 
limbshifts etc. Finally the top layer consists of functions that call 
the coefficient operations in some appropriate order to carry out FFTs.

Now the bottom and middle layers have NOT changed between the trunk 
version and my new version. I am only fiddling with the top layer for 
this new code. (There is one minor change I want to make to some middle 
layer code at some point, but I haven't got to that yet.)

In particular the old code has just as much inlining going on as the 
new code. In fact I would argue the new code is *better* inlined, for 
the following reason. The old code used a table lookup to decide which 
of the 16 variants of the radix-4 transform to call on each block. So 
it was using function pointers all over the place. Surely function 
pointers are the arch-nemesis of inlining.

The new code being profiled is basically all in one function: 
ssfft_fft_iterative(). It doesn't call any other FFT functions, it 
calls directly into the middle and bottom layers to do everything. 
There should be much less function call overhead than before.

david