Re: [atlas-devel] Compiling Atlas with hyperthreading
Brought to you by:
rwhaley,
tonyc040457
|
From: R. C. W. <wh...@my...> - 2017-06-30 00:14:15
|
> > The implementation of HT has improved over the years, so please don't > assume results obtained on older processors are applicable to the > current ones. I used to be a HT skeptic but almost everything runs > faster with them on Haswell and later, particularly the client parts > (i.e. Core series as opposed to Xeon). Unless they have changed the definition of what HT does, I do not see a theoretical way to avoid the cache problem. > > You might try running an actual application, where you get a mix > of kernels. This tends to stress the cache more, and can > sometimes expose the downside of HT. > > > On the other hand, idle HTs help with OS interrupts and other stuff > that happens quite a bit in an HPC environment once one starts using > MPI etc. This is one of the reasons I encourage everyone to enable HT > in the BIOS even if their applications don't use them. If the OS interrupts, its interrupting all threads, so I don't think I'm following this line of thought. Maybe you mean that if you have a huge stack of threads to be run, using HT you have 2 or 4 slots to round robin into once interrupted? > > I remember finding slight speedup in some case leading me to think > HT was helpful, but then I had performance collapses other places, > which led to me to recommend turning it off (or using affinity to > avoid it, like MKL is doing, if you can't turn it off) to maximize > performance. > > > If nothing else, HT doubles the number of threads, which hurts any > part of a code that scales poorly, and it makes it harder to manage > affinity. I had to spend quite a bit of time helping users with SMT > (2-4 HW threads per core) on Blue Gene/Q in my old job. > > So, for instance, take LAPACK or ATLAS LU or QR (or your own > version) and hook them up to the two BLAS. Does the non-MKL > HT-liking kernel get anywhere close to MKL performance despite > it's gemm looking as good with HT, or does it collapse its > performance while MKL maintains? > > > I don't have test driver for those already so I'm afraid I'm not going > to punt on those experiments. However, if somebody else posts the > code, I'll certainly run it and post results for generally available > hardware. ATLAS comes with timers for any or all of these. They are built to time other's libs too. For instance, set BLASlib to MKL, set FLAPACKlib to your f77 lapack, and "make xdtlatime_fl_sb" will time using MKL + LAPACK. Switch BLASlib to bliss now, remake, voila. > My guess is the MKL group got the same "HT not-reliable, non-HT > is" results, and that's why its behaving in this way. > > > Maybe. In any case, it simplifies the design space to not have to > think about >1 threads sharing an L1. L1 is not the problem on modern machines. As you scale like with Xeon-E series you need to use every scrap of cache, including shared. If you use the full scale of something like 12-cores per shared cache, I believe you will see substantial slowdowns from HT. Cheers, Clint > > Jeff > > Thanks for results! > Clint > > On 06/29/2017 05:56 PM, Hammond, Jeff R wrote: > > Good catch. strace shows only 35 calls to clone in both cases > with MKL. I didn’t know that MKL was doing these tricks. > > However, I tested another DGEMM implementation that supports > AVX2 and it uses all of the HTs and it performs on par with > MKL, but only when HT is used. > > Jeff > > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 > KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x > 2>&1 | head -n5000 | grep -c clone > 71 > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 > KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x > 2>&1 | head -n5000 | grep -c clone > 35 > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 > KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep > -v "%" > blis_dgemm_nn_rrr 384 384 384 204.027 > 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 650.820 > 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 816.355 > 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 835.650 > 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 832.179 > 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 863.123 > 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 844.502 > 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 860.262 > 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 851.694 > 5.80e-18 PASS > blis_dgemm_nn_rrr 3840 3840 3840 856.526 > 6.79e-18 PASS > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 > KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep > -v "%" > blis_dgemm_nn_rrr 384 384 384 161.331 > 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 437.967 > 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 545.498 > 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 616.338 > 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 606.650 > 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 611.153 > 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 603.314 > 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 631.292 > 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 625.833 > 5.80e-18 PASS > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72 > KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep > -v "%" > blis_dgemm_nn_rrr 384 384 384 159.789 > 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 443.810 > 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 536.077 > 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 596.069 > 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 595.763 > 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 616.531 > 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 591.823 > 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 615.153 > 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 621.714 > 5.80e-18 PASS > > [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36 > KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep > -v "%" > blis_dgemm_nn_rrr 384 384 384 189.615 > 8.27e-18 PASS > blis_dgemm_nn_rrr 768 768 768 423.504 > 5.36e-18 PASS > blis_dgemm_nn_rrr 1152 1152 1152 445.424 > 4.40e-18 PASS > blis_dgemm_nn_rrr 1536 1536 1536 444.830 > 7.02e-18 PASS > blis_dgemm_nn_rrr 1920 1920 1920 442.893 > 9.96e-18 PASS > blis_dgemm_nn_rrr 2304 2304 2304 445.979 > 6.28e-18 PASS > blis_dgemm_nn_rrr 2688 2688 2688 445.694 > 8.28e-18 PASS > blis_dgemm_nn_rrr 3072 3072 3072 451.026 > 9.92e-18 PASS > blis_dgemm_nn_rrr 3456 3456 3456 454.909 > 5.80e-18 PASS > > > On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley > <rcw...@ls... > <mailto:rcw...@ls...><mailto:rcw...@ls... > <mailto:rcw...@ls...>>> wrote: > Jeff, > > Have you run a thread monitor to see if MKL is simply not > using the hyperthreading regardless of whether it is on or off > in BIOS? > > You also may want to try something like LU. > > Cheers, > Clint > > > On 06/29/2017 05:15 PM, Jeff Hammond wrote: > I don't see any negative impact from using HT relative to not > using HT, at > least with MKL DGEMM on E5-2699v3 (Haswell). The 0.1-0.5% > gain here is > irrelevant and may be due to thermal effects (this box is in > my cubicle, > not an air-conditioned machine room). > > $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine > ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4)) > BLAS_NAME dim1 dim2 dim3 seconds Gflop/s > Intel MKL (parallel) 15360 15360 1536 0.8582699 844.4612765 > Intel MKL (parallel) 15360 15360 1536 0.8627163 840.1089930 > > HT on > > $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine > ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4)) > BLAS_NAME dim1 dim2 dim3 seconds Gflop/s > Intel MKL (parallel) 15360 15360 1536 0.8636520 839.1988073 > Intel MKL (parallel) 15360 15360 1536 0.8644268 838.4465853 > > I would be interested to see folks post data to support the > argument > against HT. > > Jeff > > On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel < > mat...@li... > <mailto:mat...@li...><mailto:mat...@li... > <mailto:mat...@li...>>> wrote: > > Thank you very much for quick response. Just to check if my > understanding > is correct : > > 1. By turning off cpuid in bios, I only need to use -t N to > build Atlas > right? > > 2. The N in -t N is the total number of threads on the > machine, not per > Cpu right ? > > 3. One more question I have is, how to set the correct -t N > for mpi based > application. > Let's say on the 2-cpu machine with 4 cores per CPU, > should I use -t > 4 or -t 8 if I rum my application with 2 mpi processes : > mpirun -n 2 myprogram > > Many thanks ! > > Sent from Yahoo Mail on Android > > On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley > <wh...@my... > <mailto:wh...@my...><mailto:wh...@my... > <mailto:wh...@my...>>> wrote: > Hyperthreading is an optimization aimed at addressing poorly > optimized > code. The idea is that most codes cannot drive the backend > hardware > (ALU/FPU, etc) at the maximal rate, so if you duplicate > registers you > can, amongst several threads, find enough work to keep the > backend busy. > > ATLAS (or any optimized linear algebra library) already runs > the FPU at > its maximal rate supported by the cache architecture after > cache blocking. > > If you can already drive the backend at >90% of peak, then > hyperthreading can actually *lose* you performance, as the > threads bring > conflicting data in the cache. > > It's usually not a night and day difference, but I haven't > measured it > in the huge blocking era used by recent developer releases (it > may be > worse there). > > My general recommendation is turn off hyperthreading for highly > optimized codes, and turn it on for relatively unoptimized codes. > > As to which core IDs correspond to the physical cores, that > varies by > machine. On x86, you can use CPUID to determine that if you are > super-knowledgeable. I usually just turn it off in the BIOS, > because I > don't like something that may thrash my cache running, even if > it might > occasionally help :) > > Cheers, > Clint > > On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote: > Hello,Would like go check if my understanding is correct for > compiling > Atlas on a machine that has multiple CPUs and hyperthreading. > I have two types of machine: > - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core- > 2 CPU, > each with 8 Cores, hyperthreaded, 2 threads per core > So when I compile Atlas, is it correct that I should use: > -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the > affinity ID > is from 0-7 and 0-15). > That means the number 8 or 16 is the total cores on the > machine, not > number of cores per CPU. Am I correct ? > I also read somewhere saying that Atlas supports > Hyperthreading. What > does this mean ? > Does this mean:1. I do not need to disable hyperthreading in > BIOS (no > performance difference whether it is enabled or disabled, as > long as the > number of threads and affinity IDs are set correctly when > compiling > Atlas)2. Or I can make use of the hyperthread, that is, -tl 16 > and -tl 32 ? > Thank you very much, > lixin > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li... > <mailto:Mat...@li...><mailto:Mat...@li... > <mailto:Mat...@li...>> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel> > > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li... > <mailto:Mat...@li...><mailto:Mat...@li... > <mailto:Mat...@li...>> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel> > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li... > <mailto:Mat...@li...><mailto:Mat...@li... > <mailto:Mat...@li...>> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel> > > > > -- > Jeff Hammond > jef...@gm... > <mailto:jef...@gm...><mailto:jef...@gm... > <mailto:jef...@gm...>> > http://jeffhammond.github.io/ > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li... > <mailto:Mat...@li...><mailto:Mat...@li... > <mailto:Mat...@li...>> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel> > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * > www.csc.lsu.edu/~whaley > <http://www.csc.lsu.edu/%7Ewhaley><http://www.csc.lsu.edu/~whaley > <http://www.csc.lsu.edu/%7Ewhaley>> ** > ********************************************************************** > > > > > -- > Jeff Hammond > jef...@gm... > <mailto:jef...@gm...><mailto:jef...@gm... > <mailto:jef...@gm...>> > http://jeffhammond.github.io/ > > > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * > www.csc.lsu.edu/~whaley <http://www.csc.lsu.edu/%7Ewhaley> ** > ********************************************************************** > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li... > <mailto:Mat...@li...> > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel> > > > > > -- > Jeff Hammond > jef...@gm... <mailto:jef...@gm...> > http://jeffhammond.github.io/ > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > _______________________________________________ > Math-atlas-devel mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |