[atlas-devel] 3.11.15: complex AMMM! a little help?
Brought to you by:
rwhaley,
tonyc040457
From: R. C. W. <rcw...@ls...> - 2013-10-27 01:34:14
|
Guys, By not leaving the keyboard or reading e-mail for a week, I have finally been able to extend the new GEMM tuning framework to all types and precisions. This should result in substantial speedups for complex arithmetic on most machines. This is a big milestone, and means that once I've added some support for things like low-memory operation, I can look at deprecating the old block-major tuning framework. In my present framework, I just reuse the real kernels for complex, so this release shouldn't add to the install time over 3.11.14, even though it is optimizing the complex types along with the real. It should be the case that on some machines, using the largest real kernels overwhelms the cache when used for complex, and so I hope to later add some tuning that looks at decreasing NB when using the real kernels. However, it would be handy if I knew of a machine where this happened, and on my present machine the complex without this is outstanding (beating the real). So, I would like to ask for you guys's help. I'd like anybody who installs 3.11.15 to post the following as a reply to this message: (1) Post your ARCH & ARCHDEFS macros from Make.inc (2) indicate whether "make check" "make ptcheck" succeed (3) Assuming they do, post the results of: cd BLDdir/bin make xdmmtst_amm2 xsmmtst_amm2 xcmmtst_amm2 xzmmtst_amm2 ./xdmmtst_amm2 -N 2000 8000 2000 ; ./xzmmtst_amm2 -N 2000 8000 2000 ; ./xsmmtst_amm2 -N 2000 8000 2000 ; ./xcmmtst_amm2 -N 2000 8000 2000 I'm looking for three things here: (1) platforms where the new framework doesn't work, (2) architectures where the new access-major framework isn't performing as well as the old block-major, and (3)places where the complex performance is worse than the real. I'm probably going to be busy on other stuff for a while, but anything folks post will help me target my efforts when I get time to return. You may want to reduce the max problem size (8000) if you have a slow machine, or one with a small amount of memory. I have special cases for very large problems, cases with degenerate dimension (eg., N=1, or K=1), and rank-K update. Other matrix shapes (eg., inner product) may benefit from further tuning, which is why I want to initially scope asymptotic square. If you have a particular machine you use a lot, bring it to my attention. Particularly old machines may be slower under the new system due to inadequate kernel support, and if I don't know about a performance shortfall on a particular platform, I can't put it on the "improve it" list. If you use non-square matrix shapes and want to post data on those as well, please do. It might effect which special cases I scope next. I will post the results of the install on my home machine as a reply to this e-mail. Cheers, Clint ATLAS 3.11.15 released 10/26/13, highlights of changes from 3.11.14: * Got access-major framework working for complex types! * Rewrote ATL_ammm_rkK for bug fixes and clarity * Got complex m-vectorized access-major copy routines working * Got complex k-vectorized access-major copy routines working * Added support for recognizing some haswell chips sent in by: https://sourceforge.net/p/math-atlas/support-requests/913/ -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley ** ********************************************************************** |