Re: [atlas-devel] Compiling Atlas with hyperthreading
Brought to you by:
rwhaley,
tonyc040457
|
From: R. C. W. <rcw...@ls...> - 2017-06-29 23:10:56
|
Yeah, if it can't get that perf w/o hyperthreading, its not fully tuned. Back in day when I investigated HT, the problem really is in cache stomping, as two threads compete for the same cache. This makes the effects unpredictable (if the cache wasn't being fully utilized, maybe no effect, if you get lucky on the replacement, maybe tiny effect, and if you get unlucky, an truly bad dropoff). You might try running an actual application, where you get a mix of kernels. This tends to stress the cache more, and can sometimes expose the downside of HT. I remember finding slight speedup in some case leading me to think HT was helpful, but then I had performance collapses other places, which led to me to recommend turning it off (or using affinity to avoid it, like MKL is doing, if you can't turn it off) to maximize performance. So, for instance, take LAPACK or ATLAS LU or QR (or your own version) and hook them up to the two BLAS. Does the non-MKL HT-liking kernel get anywhere close to MKL performance despite it's gemm looking as good with HT, or does it collapse its performance while MKL maintains? My guess is the MKL group got the same "HT not-reliable, non-HT is" results, and that's why its behaving in this way. Thanks for results! Clint On 06/29/2017 05:56 PM, Hammond, Jeff R wrote: > Good catch. strace shows only 35 calls to clone in both cases with MKL. I didn’t know that MKL was doing these tricks. > > However, I tested another DGEMM implementation that supports AVX2 and it uses all of the HTs and it performs on par with MKL, but only when HT is used. > > Jeff > > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x 2>&1 | head -n5000 | grep -c clone > 71 > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x 2>&1 | head -n5000 | grep -c clone > 35 > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%" > blis_dgemm_nn_rrr 384 384 384 204.027 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 650.820 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 816.355 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 835.650 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 832.179 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 863.123 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 844.502 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 860.262 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 851.694 5.80e-18 PASS > blis_dgemm_nn_rrr 3840 3840 3840 856.526 6.79e-18 PASS > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%" > blis_dgemm_nn_rrr 384 384 384 161.331 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 437.967 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 545.498 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 616.338 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 606.650 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 611.153 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 603.314 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 631.292 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 625.833 5.80e-18 PASS > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep -v "%" > blis_dgemm_nn_rrr 384 384 384 159.789 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 443.810 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 536.077 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 596.069 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 595.763 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 616.531 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 591.823 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 615.153 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 621.714 5.80e-18 PASS > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep -v "%" > blis_dgemm_nn_rrr 384 384 384 189.615 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 423.504 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 445.424 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 444.830 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 442.893 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 445.979 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 445.694 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 451.026 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 454.909 5.80e-18 PASS > > On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley <rcw...@ls...<mailto:rcw...@ls...>> wrote: > Jeff, > > Have you run a thread monitor to see if MKL is simply not using the hyperthreading regardless of whether it is on or off in BIOS? > > You also may want to try something like LU. > > Cheers, > Clint > > > On 06/29/2017 05:15 PM, Jeff Hammond wrote: > I don't see any negative impact from using HT relative to not using HT, at > least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5% gain here is > irrelevant and may be due to thermal effects (this box is in my cubicle, > not an air-conditioned machine room). > > $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine > ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4)) > BLAS_NAME dim1 dim2 dim3 seconds Gflop/s > Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765 > Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930 > > HT on > > $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine > ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4)) > BLAS_NAME dim1 dim2 dim3 seconds Gflop/s > Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073 > Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853 > > I would be interested to see folks post data to support the argument > against HT. > > Jeff > > On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel < > mat...@li...<mailto:mat...@li...>> wrote: > > Thank you very much for quick response. Just to check if my understanding > is correct : > > 1. By turning off cpuid in bios, I only need to use -t N to build Atlas > right? > > 2. The N in -t N is the total number of threads on the machine, not per > Cpu right ? > > 3. One more question I have is, how to set the correct -t N for mpi based > application. > Let's say on the 2-cpu machine with 4 cores per CPU, should I use -t > 4 or -t 8 if I rum my application with 2 mpi processes : > mpirun -n 2 myprogram > > Many thanks ! > > Sent from Yahoo Mail on Android > > On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley > <wh...@my...<mailto:wh...@my...>> wrote: > Hyperthreading is an optimization aimed at addressing poorly optimized > code. The idea is that most codes cannot drive the backend hardware > (ALU/FPU, etc) at the maximal rate, so if you duplicate registers you > can, amongst several threads, find enough work to keep the backend busy. > > ATLAS (or any optimized linear algebra library) already runs the FPU at > its maximal rate supported by the cache architecture after cache blocking. > > If you can already drive the backend at >90% of peak, then > hyperthreading can actually *lose* you performance, as the threads bring > conflicting data in the cache. > > It's usually not a night and day difference, but I haven't measured it > in the huge blocking era used by recent developer releases (it may be > worse there). > > My general recommendation is turn off hyperthreading for highly > optimized codes, and turn it on for relatively unoptimized codes. > > As to which core IDs correspond to the physical cores, that varies by > machine. On x86, you can use CPUID to determine that if you are > super-knowledgeable. I usually just turn it off in the BIOS, because I > don't like something that may thrash my cache running, even if it might > occasionally help :) > > Cheers, > Clint > > On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote: > Hello,Would like go check if my understanding is correct for compiling > Atlas on a machine that has multiple CPUs and hyperthreading. > I have two types of machine: > - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- 2 CPU, > each with 8 Cores, hyperthreaded, 2 threads per core > So when I compile Atlas, is it correct that I should use: > -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the affinity ID > is from 0-7 and 0-15). > That means the number 8 or 16 is the total cores on the machine, not > number of cores per CPU. Am I correct ? > I also read somewhere saying that Atlas supports Hyperthreading. What > does this mean ? > Does this mean:1. I do not need to disable hyperthreading in BIOS (no > performance difference whether it is enabled or disabled, as long as the > number of threads and affinity IDs are set correctly when compiling > Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 and -tl 32 ? > Thank you very much, > lixin > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li...<mailto:Mat...@li...> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li...<mailto:Mat...@li...> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li...<mailto:Mat...@li...> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > > > > -- > Jeff Hammond > jef...@gm...<mailto:jef...@gm...> > http://jeffhammond.github.io/ > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li...<mailto:Mat...@li...> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley<http://www.csc.lsu.edu/~whaley> ** > ********************************************************************** > > > > > -- > Jeff Hammond > jef...@gm...<mailto:jef...@gm...> > http://jeffhammond.github.io/ > > -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley ** ********************************************************************** |