[atlas-devel] 3.11.35 - simd vectorization & Power8
Brought to you by:
rwhaley,
tonyc040457
From: R. C. W. <cw...@cc...> - 2015-07-30 00:47:10
|
Guys, I am happy to finally release 3.11.35. I've been working on a bunch of stuff, but the reason for the release is I've just finished rewriting the way ATLAS handles SIMD vectorization, and a related rewriting of the GEMM code generator. This has touched the entire framework, and more needs to be done there, but I wanted a release before rebreaking things. In the past, I've used intrinsics as a lighter weight way to write vector code (compared to assembly), but the problem is that it doesn't port: when intel lengthens the vector, it must be rewritten, and my intel code does not work on ARM. So, I've added a file atlas_simd.h that abstracts all this stuff. Code written in terms of these fake "extrinsics" will not need to change when the vector lengthens, and if I update this file for a completely new vectorization ISA, all codes should just work. The idea is to keep ATLAS from having x86 be the only well-supported framework. Right now, this code is mainly used for GEMM, but it should eventually get into all the code generators, making all archs more equal. There's also a dependent "atlas_cplxsimd.h" that helps in doing complex arithmetic in a portable way. Both of these guys are still evolving (I only support operations I currently need), but I have tested them on : AVX2,AVX,SSE3,SSE2,SSE1, VSX, ARM64. If gcc knows how to vectorize, then this file will work even prior to my supporting the new vectorization scheme: if nothing else works, atlas_simd.h expresses everything in terms of gcc's vector builtins, which work the same on all platforms. I've been wanting to do this for a **long** time, but the impetus here was getting ATLAS to work on the power8 architecture. IBM has started openpower to allow others to license their chips, and you can now buy some affordable power8 systems from China. Hopefully, they get more people on board, and the price comes down, and we can see some real competition with x86, ARM64, and power all going head-to-head! Anyway, this is the first ATLAS release to explicitly support both the Power8 on Linux (little endian) and VSX (using atlas_simd.h). If you install on a power8 machine, be sure to tell ATLAS not to use all those virtual processors, since this will kill performance. You use the -tl configure argument for this, for my 4-core Power8, the command was: ../configure -b 64 -tl 4 0 8 16 24 (the "4" is the # of physical cores, and the four following numbers are the tids to use). I have not done a lot of hand-tuning for power8. The new code generator gets around 87% of peak for double, and 66% for single (I think the single slowdown may be related to how intrinsics are handled due to endian issues, but that may not be it). ARM64's "Advanced SIMD" is also supported in atlas_simd.h. However, you won't see huge speedups, because David Nuechterlein's hand-tuned kernels are about 5% better. As a matter of fact, I saw a very slight slowdown for serial LU; this may be due to the huge number of other changes since 3.11.34 (I don't see how cleanup code getting vectorized could slow things down even slightly). The present code generator is still quite limited (no prefetch, no K unrolling). However, changing things like this shakes the entire framework, so I want to do a release with Power8 support before I look at the extensions that this work has suggested. Once I get prefetch & K unrolling in, I'll need to rework the search yet again, and then I think older AMD/Intel machines will be much better served by the new framework than the old. Sourceforge file servers are currently broken. I'll upload 3.11.35 as soon as they work again. In the meantime, I've hung it off my homepage: wget http://www.csc.lsu.edu/~whaley/dload/atlas3.11.35.tar.bz2 should work until sourceforge comes back up . . . Cheers, Clint ATLAS 3.11.35 released 07/29/15, highlights of changes from 3.11.34: * Added basic configure support for Linux/Power8 * Addition of atlas_simd.h to provide SIMD support that is independent of architecture and vector length. + gnuvec, VSX, ARM64 (Advanced SIMD), AVX2, AVX, SSE3, SSE2, SSE1 * Complete rewrite of gemm code generator to target atlas_simd.h. + Supports vectorizing K dim as well as M + No support for explicit K unrolling yet (required for old x86) + uammsearch.c completely broken until finished evolving code gens * Addition of atlas_cplxsimd.h to provide support for complex for L1/L2 + axpy/dot written as test cases, but not tuned or put in archdefs * Rewrite of src/blas/ammm for higher abstraction & maintainability * Fixed bug in ammm kernel generation where different kerns had the same filename, resulting kern collisions, and thus errors. * Add ATL_ammm_syrk to provide 15-3% (small-asymp) serial Cholesky speedup * iFKO added to tarfile, but not yet hooked up to ATLAS install * Got rid of some race conditions in timers |