Re: [atlas-devel] Compiling Atlas with hyperthreading

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yeah, if it can't get that perf w/o hyperthreading, its not fully tuned.

Back in day when I investigated HT, the problem really is in cache 
stomping, as two threads compete for the same cache.  This makes the 
effects unpredictable (if the cache wasn't being fully utilized, maybe 
no effect, if you get lucky on the replacement, maybe tiny effect, and 
if you get unlucky, an truly bad dropoff).

You might try running an actual application, where you get a mix of 
kernels.  This tends to stress the cache more, and can sometimes expose 
the downside of HT.

I remember finding slight speedup in some case leading me to think HT 
was helpful, but then I had performance collapses other places, which 
led to me to recommend turning it off (or using affinity to avoid it, 
like MKL is doing, if you can't turn it off) to maximize performance.

So, for instance, take LAPACK or ATLAS LU or QR (or your own version) 
and hook them up to the two BLAS.  Does the non-MKL HT-liking kernel get 
anywhere close to MKL performance despite it's gemm looking as good with 
HT, or does it collapse its performance while MKL maintains?

My guess is the MKL group got the same "HT not-reliable, non-HT is" 
results, and that's why its behaving in this way.

Thanks for results!
Clint

On 06/29/2017 05:56 PM, Hammond, Jeff R wrote:
> Good catch.  strace shows only 35 calls to clone in both cases with MKL.  I didn’t know that MKL was doing these tricks.
> 
> However, I tested another DGEMM implementation that supports AVX2 and it uses all of the HTs and it performs on par with MKL, but only when HT is used.
> 
> Jeff
> 
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x 2>&1 | head -n5000 | grep -c clone
> 71
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x 2>&1 | head -n5000 | grep -c clone
> 35
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr                  384   384   384  204.027   8.27e-18   PASS
> blis_dgemm_nn_rrr                  768   768   768  650.820   5.36e-18   PASS
> blis_dgemm_nn_rrr                 1152  1152  1152  816.355   4.40e-18   PASS
> blis_dgemm_nn_rrr                 1536  1536  1536  835.650   7.02e-18   PASS
> blis_dgemm_nn_rrr                 1920  1920  1920  832.179   9.96e-18   PASS
> blis_dgemm_nn_rrr                 2304  2304  2304  863.123   6.28e-18   PASS
> blis_dgemm_nn_rrr                 2688  2688  2688  844.502   8.28e-18   PASS
> blis_dgemm_nn_rrr                 3072  3072  3072  860.262   9.92e-18   PASS
> blis_dgemm_nn_rrr                 3456  3456  3456  851.694   5.80e-18   PASS
> blis_dgemm_nn_rrr                 3840  3840  3840  856.526   6.79e-18   PASS
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr                  384   384   384  161.331   8.27e-18   PASS
> blis_dgemm_nn_rrr                  768   768   768  437.967   5.36e-18   PASS
> blis_dgemm_nn_rrr                 1152  1152  1152  545.498   4.40e-18   PASS
> blis_dgemm_nn_rrr                 1536  1536  1536  616.338   7.02e-18   PASS
> blis_dgemm_nn_rrr                 1920  1920  1920  606.650   9.96e-18   PASS
> blis_dgemm_nn_rrr                 2304  2304  2304  611.153   6.28e-18   PASS
> blis_dgemm_nn_rrr                 2688  2688  2688  603.314   8.28e-18   PASS
> blis_dgemm_nn_rrr                 3072  3072  3072  631.292   9.92e-18   PASS
> blis_dgemm_nn_rrr                 3456  3456  3456  625.833   5.80e-18   PASS
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr                  384   384   384  159.789   8.27e-18   PASS
> blis_dgemm_nn_rrr                  768   768   768  443.810   5.36e-18   PASS
> blis_dgemm_nn_rrr                 1152  1152  1152  536.077   4.40e-18   PASS
> blis_dgemm_nn_rrr                 1536  1536  1536  596.069   7.02e-18   PASS
> blis_dgemm_nn_rrr                 1920  1920  1920  595.763   9.96e-18   PASS
> blis_dgemm_nn_rrr                 2304  2304  2304  616.531   6.28e-18   PASS
> blis_dgemm_nn_rrr                 2688  2688  2688  591.823   8.28e-18   PASS
> blis_dgemm_nn_rrr                 3072  3072  3072  615.153   9.92e-18   PASS
> blis_dgemm_nn_rrr                 3456  3456  3456  621.714   5.80e-18   PASS
> 
> [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%"
> blis_dgemm_nn_rrr                  384   384   384  189.615   8.27e-18   PASS
> blis_dgemm_nn_rrr                  768   768   768  423.504   5.36e-18   PASS
> blis_dgemm_nn_rrr                 1152  1152  1152  445.424   4.40e-18   PASS
> blis_dgemm_nn_rrr                 1536  1536  1536  444.830   7.02e-18   PASS
> blis_dgemm_nn_rrr                 1920  1920  1920  442.893   9.96e-18   PASS
> blis_dgemm_nn_rrr                 2304  2304  2304  445.979   6.28e-18   PASS
> blis_dgemm_nn_rrr                 2688  2688  2688  445.694   8.28e-18   PASS
> blis_dgemm_nn_rrr                 3072  3072  3072  451.026   9.92e-18   PASS
> blis_dgemm_nn_rrr                 3456  3456  3456  454.909   5.80e-18   PASS
> 
> On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley <rcw...@ls...<mailto:rcw...@ls...>> wrote:
> Jeff,
> 
> Have you run a thread monitor to see if MKL is simply not using the hyperthreading regardless of whether it is on or off in BIOS?
> 
> You also may want to try something like LU.
> 
> Cheers,
> Clint
> 
> 
> On 06/29/2017 05:15 PM, Jeff Hammond wrote:
> I don't see any negative impact from using HT relative to not using HT, at
> least with MKL DGEMM on E5-2699v3 (Haswell).  The 0.1-0.5% gain here is
> irrelevant and may be due to thermal effects (this box is in my cubicle,
> not an air-conditioned machine room).
> 
> $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
>              BLAS_NAME dim1 dim2 dim3        seconds        Gflop/s
> Intel MKL (parallel) 15360 15360 1536      0.8582699    844.4612765
> Intel MKL (parallel) 15360 15360 1536      0.8627163    840.1089930
> 
> HT on
> 
> $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
> ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
>              BLAS_NAME dim1 dim2 dim3        seconds        Gflop/s
> Intel MKL (parallel) 15360 15360 1536      0.8636520    839.1988073
> Intel MKL (parallel) 15360 15360 1536      0.8644268    838.4465853
> 
> I would be interested to see folks post data to support the argument
> against HT.
> 
> Jeff
> 
> On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
> mat...@li...<mailto:mat...@li...>> wrote:
> 
> Thank you very much for quick response. Just to check if my understanding
> is correct :
> 
> 1. By turning off cpuid in bios, I only need to use -t N to build Atlas
> right?
> 
> 2. The N in -t N is the total number of threads on the machine, not per
> Cpu right ?
> 
> 3. One more question I have is, how to set the correct -t N for mpi based
> application.
>       Let's say on the 2-cpu machine with 4  cores per CPU, should I use -t
> 4 or -t 8 if I rum my application with 2 mpi processes :
>        mpirun -n 2 myprogram
> 
> Many thanks !
> 
> Sent from Yahoo Mail on Android
> 
> On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
> <wh...@my...<mailto:wh...@my...>> wrote:
> Hyperthreading is an optimization aimed at addressing poorly optimized
> code.  The idea is that most codes cannot drive the backend hardware
> (ALU/FPU, etc) at the maximal rate, so if you duplicate registers you
> can, amongst several threads, find enough work to keep the backend busy.
> 
> ATLAS (or any optimized linear algebra library) already runs the FPU at
> its maximal rate supported by the cache architecture after cache blocking.
> 
> If you can already drive the backend at >90% of peak, then
> hyperthreading can actually *lose* you performance, as the threads bring
> conflicting data in the cache.
> 
> It's usually not a night and day difference, but I haven't measured it
> in the huge blocking era used by recent developer releases (it may be
> worse there).
> 
> My general recommendation is turn off hyperthreading for highly
> optimized codes, and turn it on for relatively unoptimized codes.
> 
> As to which core IDs correspond to the physical cores, that varies by
> machine.  On x86, you can use CPUID to determine that if you are
> super-knowledgeable.  I usually just turn it off in the BIOS, because I
> don't like something that may thrash my cache running, even if it might
> occasionally help :)
> 
> Cheers,
> Clint
> 
> On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
> Hello,Would like go check if my understanding is correct for compiling
> Atlas on a machine that has multiple CPUs and hyperthreading.
> I have two types of machine:
> - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU,
> each with 8 Cores, hyperthreaded, 2 threads per core
> So when I compile Atlas, is it correct that I should use:
> -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID
> is from 0-7 and 0-15).
> That means the number 8 or 16 is the total cores on the machine, not
> number of cores per CPU. Am I correct ?
> I also read somewhere saying that Atlas supports Hyperthreading. What
> does this mean ?
> Does this mean:1. I do not need to disable hyperthreading in BIOS (no
> performance difference whether it is enabled or disabled, as long as the
> number of threads and affinity IDs are set correctly when compiling
> Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ?
> Thank you very much,
> lixin
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> 
> 
> --
> Jeff Hammond
> jef...@gm...<mailto:jef...@gm...>
> http://jeffhammond.github.io/
> 
> 
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 
> 
> 
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...<mailto:Mat...@li...>
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
> 
> --
> **********************************************************************
> ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley<http://www.csc.lsu.edu/~whaley> **
> **********************************************************************
> 
> 
> 
> 
> --
> Jeff Hammond
> jef...@gm...<mailto:jef...@gm...>
> http://jeffhammond.github.io/
> 
> 

-- 
**********************************************************************
** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
**********************************************************************