[atlas-devel] 3.11.40: I am not dead

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Guys,

Sorry to spam both lists and any dups that causes, but since it has 
looked like I've retired, I'm sending this to atlas-devel & announce.

3.11.40 has finally been released.  I have actually been working on it 
for most of this time, but, with the move to Indiana factored in, it has 
taken me this long to get the framework working again!

The reason is that we have essentially rewritten the entire way 
microkernels are tuned and accessed in the library.  Therefore, the 
majority of tuning code has been touched or rewritten, and since this 
includes all the generation, etc, it took a long while to get things at 
all reliable.

The end goal is that increased microkernel specialization should greatly 
increase our weird-shape and parallel scaling performance.

Right now, you will hopefully see much better serial non-GEMM BLAS 
performance (eg., small-triangle TRSM or TRMM, for instance).   Very 
large problems aren't likely to have a huge difference, if prior 
releases supported your architecture well (eg., we've added AVX-512 to 
the code generators, which obviously will hugely improve SkylakeX 
asymptotic performance).

The installs have gone from long to endless, unfortunately.  I will fix 
this before stable, but right now searches are all brute-force and 
ignorance while we concentrate on getting the last of the microkernel 
handling solidified.  I will attempt to speed up search later, and allow 
for a "no-timing" install from archdefs, so that people on 
already-supported platforms can skip most or all of the tuning (a 
feature many maintainers have long wanted).

For now, terrible install times will just be a feature until we finish 
debugging and publish the new BLAS approach.

The major weakness in the install when ran on arbitrary machines right 
now (other than time) is in some new cache detection code that creates a 
file called atlas_cache.h.  This code dies on several machines, and I 
haven't had time to track down details.  However, if it fails for you, 
open up a tracker item and I can tell you how to proceed beyond it even 
before fixing the code in question.

Hopefully, this release should be purely faster than any other that came 
before, but if you spot performance regressions, please let us know.  We 
are not yet always using the correct microkernel (even when the library 
has built it), because our selection algorithm work is awaiting the 
finishing of the new tuning strategy.

Eventually, ATLAS will be able to not only tune microkernels to make the 
BLAS/LAPACK, but specialized operations for people wanting to avoid BLAS 
overheads (at cost of calling messy microkernels; think of things like 
tensor algebra with very small shapes that need to scale, perhaps 
machine learning, etc.).  This allows you to have detailed cache control 
necessary to scale when the problem size isn't large enough to dominate 
low-order terms, and thus make BLAS API OK.

ChangeLog (which has almost no detail on massive changes) is below.

Cheers,
Clint
ATLAS 3.11.40 released 10/02/18, highlights of changes from 3.11.39:
    * Basically a rewrite of all L3BLAS and LAPACK tuning framework:
      + Complete rewrite of all searches to allow different "views" of 
kernels
        for maximum performance for all-BLAS usage;  present 
implementation very
        slow even with archdefs, will need to be speed up before stable
      + Complete rewrite of gemm kernel choice mechanism
      + Complete rewrite of all BLAS handling for much improved 
small/medium perf
        via greater use of microkernels
    * Addition of core count to archdefs, because this usually increases 
block
      factors when maximizing performance
    * Addition of -ansi flag to avoid C changes borking include files
    * Archdef support for host of modern Intel/AMD + POWER9:
      - Corei264AVXp16, Corei3EP64AVXMACp36, Corei4X64AVXZp18,
      - AMD64K10h64SSE3p32, AMDRyzen64AVXMACp[8,16,64]
      - ARM64xgene164p8, ARM64thund64p48
      - POWER964LEVSXp8
    * Addition of cpuid-based cache detection for Intel & AMD x86 machines
      - Presently gets wrong answer on some machines, where shared caches
        are either multiplied or divided by P inappropriately
    * Beginning of rewrite of generic cache detection
    * Fixed bug where names like "c99-gcc" preferred over "gcc"
    * Added -Si indthr 1 option to autoprobe for aliased thread IDs
      + Presently, only supported on ARM64 & x86 with at least SSE2
    * Complete rewrite of gemm kernel indexing to compact data structures
      and minimize cache pollution