From: A. M. A. <per...@gm...> - 2006-10-04 01:46:22
|
On 02/10/06, Travis Oliphant <oli...@ee...> wrote: > Perhaps those inner 1-d loops could be optimized (using prefetch or > something) to reduce the number of cache misses on the inner > computation, and the concept of looping over the largest dimension > (instead of the last dimension) should be re-considered. Cache control seems to be the main factor deciding the speed of many algorithms. Prefectching could make a huge difference, particularly on NUMA machines (like a dual opteron). I think GCC has a moderately portable way to request it (though it may be only in beta versions as yet). More generally, all the tricks that ATLAS uses to accelerate BLAS routines would (in principle) be applicable here. The implementation would be extremely difficult, though, even if all the basic loops could be expressed in a few primitives. A. M. Archibald |