[atlas-devel] 3.11.40: I am not dead
Brought to you by:
rwhaley,
tonyc040457
|
From: R. C. W. <rcw...@iu...> - 2018-10-03 00:53:40
|
Guys,
Sorry to spam both lists and any dups that causes, but since it has
looked like I've retired, I'm sending this to atlas-devel & announce.
3.11.40 has finally been released. I have actually been working on it
for most of this time, but, with the move to Indiana factored in, it has
taken me this long to get the framework working again!
The reason is that we have essentially rewritten the entire way
microkernels are tuned and accessed in the library. Therefore, the
majority of tuning code has been touched or rewritten, and since this
includes all the generation, etc, it took a long while to get things at
all reliable.
The end goal is that increased microkernel specialization should greatly
increase our weird-shape and parallel scaling performance.
Right now, you will hopefully see much better serial non-GEMM BLAS
performance (eg., small-triangle TRSM or TRMM, for instance). Very
large problems aren't likely to have a huge difference, if prior
releases supported your architecture well (eg., we've added AVX-512 to
the code generators, which obviously will hugely improve SkylakeX
asymptotic performance).
The installs have gone from long to endless, unfortunately. I will fix
this before stable, but right now searches are all brute-force and
ignorance while we concentrate on getting the last of the microkernel
handling solidified. I will attempt to speed up search later, and allow
for a "no-timing" install from archdefs, so that people on
already-supported platforms can skip most or all of the tuning (a
feature many maintainers have long wanted).
For now, terrible install times will just be a feature until we finish
debugging and publish the new BLAS approach.
The major weakness in the install when ran on arbitrary machines right
now (other than time) is in some new cache detection code that creates a
file called atlas_cache.h. This code dies on several machines, and I
haven't had time to track down details. However, if it fails for you,
open up a tracker item and I can tell you how to proceed beyond it even
before fixing the code in question.
Hopefully, this release should be purely faster than any other that came
before, but if you spot performance regressions, please let us know. We
are not yet always using the correct microkernel (even when the library
has built it), because our selection algorithm work is awaiting the
finishing of the new tuning strategy.
Eventually, ATLAS will be able to not only tune microkernels to make the
BLAS/LAPACK, but specialized operations for people wanting to avoid BLAS
overheads (at cost of calling messy microkernels; think of things like
tensor algebra with very small shapes that need to scale, perhaps
machine learning, etc.). This allows you to have detailed cache control
necessary to scale when the problem size isn't large enough to dominate
low-order terms, and thus make BLAS API OK.
ChangeLog (which has almost no detail on massive changes) is below.
Cheers,
Clint
ATLAS 3.11.40 released 10/02/18, highlights of changes from 3.11.39:
* Basically a rewrite of all L3BLAS and LAPACK tuning framework:
+ Complete rewrite of all searches to allow different "views" of
kernels
for maximum performance for all-BLAS usage; present
implementation very
slow even with archdefs, will need to be speed up before stable
+ Complete rewrite of gemm kernel choice mechanism
+ Complete rewrite of all BLAS handling for much improved
small/medium perf
via greater use of microkernels
* Addition of core count to archdefs, because this usually increases
block
factors when maximizing performance
* Addition of -ansi flag to avoid C changes borking include files
* Archdef support for host of modern Intel/AMD + POWER9:
- Corei264AVXp16, Corei3EP64AVXMACp36, Corei4X64AVXZp18,
- AMD64K10h64SSE3p32, AMDRyzen64AVXMACp[8,16,64]
- ARM64xgene164p8, ARM64thund64p48
- POWER964LEVSXp8
* Addition of cpuid-based cache detection for Intel & AMD x86 machines
- Presently gets wrong answer on some machines, where shared caches
are either multiplied or divided by P inappropriately
* Beginning of rewrite of generic cache detection
* Fixed bug where names like "c99-gcc" preferred over "gcc"
* Added -Si indthr 1 option to autoprobe for aliased thread IDs
+ Presently, only supported on ARM64 & x86 with at least SSE2
* Complete rewrite of gemm kernel indexing to compact data structures
and minimize cache pollution
|