[atlas-devel] 3.11.15: complex AMMM! a little help?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Guys,

By not leaving the keyboard or reading e-mail for a week, I have finally 
been able to extend the new GEMM tuning framework to all types and 
precisions.  This should result in substantial speedups for complex 
arithmetic on most machines.

This is a big milestone, and means that once I've added some support for 
things like low-memory operation, I can look at deprecating the old 
block-major tuning framework.

In my present framework, I just reuse the real kernels for complex, so 
this release shouldn't add to the install time over 3.11.14, even though 
it is optimizing the complex types along with the real.  It should be 
the case that on some machines, using the largest real kernels 
overwhelms the cache when used for complex, and so I hope to later add 
some tuning that looks at decreasing NB when using the real kernels.

However, it would be handy if I knew of a machine where this happened, 
and on my present machine the complex without this is outstanding 
(beating the real).

So, I would like to ask for you guys's help.  I'd like anybody who 
installs 3.11.15 to post the following as a reply to this message:
    (1) Post your ARCH  & ARCHDEFS macros from Make.inc
    (2) indicate whether "make check" "make ptcheck" succeed
    (3) Assuming they do, post the results of:
        cd BLDdir/bin
        make xdmmtst_amm2 xsmmtst_amm2 xcmmtst_amm2 xzmmtst_amm2
        ./xdmmtst_amm2 -N 2000 8000 2000 ; ./xzmmtst_amm2 -N 2000 8000 
2000 ; ./xsmmtst_amm2 -N 2000 8000 2000 ; ./xcmmtst_amm2 -N 2000 8000 2000

I'm looking for three things here: (1) platforms where the new framework 
doesn't work, (2) architectures where the new access-major framework 
isn't performing as well as the old block-major, and (3)places where the 
complex performance is worse than the real.

I'm probably going to be busy on other stuff for a while, but anything 
folks post will help me target my efforts when I get time to return.

You may want to reduce the max problem size (8000) if you have a slow 
machine, or one with a small amount of memory.  I have special cases for 
very large problems, cases with degenerate dimension (eg., N=1, or K=1), 
and rank-K update.  Other matrix shapes (eg., inner product) may benefit 
from further tuning, which is why I want to initially scope asymptotic 
square.

If you have a particular machine you use a lot, bring it to my 
attention.  Particularly old machines may be slower under the new system 
due to inadequate kernel support, and if I don't know about a 
performance shortfall on a particular platform, I can't put it on the 
"improve it" list.

If you use non-square matrix shapes and want to post data on those as 
well, please do.  It might effect which special cases I scope next.

I will post the results of the install on my home machine as a reply to 
this e-mail.

Cheers,
Clint

ATLAS 3.11.15 released 10/26/13, highlights of changes from 3.11.14:
    * Got access-major framework working for complex types!
    * Rewrote ATL_ammm_rkK for bug fixes and clarity
    * Got complex m-vectorized access-major copy routines working
    * Got complex k-vectorized access-major copy routines working
    * Added support for recognizing some haswell chips sent in by:
        https://sourceforge.net/p/math-atlas/support-requests/913/

-- 
**********************************************************************
** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
**********************************************************************