Eric Heien has checked Astropulse (a SETI@home application currently in beta) into the CVS repository. Under the client directory you will find fftgpu with experimental versions of fft code running on GPUs.
Check out this optimization document. It is designed toward Pentium processors, but it has usefull information for all CPUs.
Organizing data for best cache usage, avoiding dependancy chains, associativity in caches and what it means VS memory addresses....
Eric's SIMD template is viewable...
http://cvs.sourceforge.net/viewcvs.py/setiboinc/setiboinc/vector_lib/
to see, use online CVS or checkout.
Branch is "runtime_SIMD_select"