Re: [atlas-devel] Compiling Atlas with hyperthreading

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>
> The implementation of HT has improved over the years, so please don't 
> assume results obtained on older processors are applicable to the 
> current ones.  I used to be a HT skeptic but almost everything runs 
> faster with them on Haswell and later, particularly the client parts 
> (i.e. Core series as opposed to Xeon).

Unless they have changed the definition of what HT does, I do not see a 
theoretical way to avoid the cache problem.
>
>     You might try running an actual application, where you get a mix
>     of kernels.  This tends to stress the cache more, and can
>     sometimes expose the downside of HT.
>
>
> On the other hand, idle HTs help with OS interrupts and other stuff 
> that happens quite a bit in an HPC environment once one starts using 
> MPI etc.  This is one of the reasons I encourage everyone to enable HT 
> in the BIOS even if their applications don't use them.

If the OS interrupts, its interrupting all threads, so I don't think I'm 
following this line of thought.  Maybe you mean that if you have a huge 
stack of threads to be run, using HT you have 2 or 4 slots to round 
robin into once interrupted?
>
>     I remember finding slight speedup in some case leading me to think
>     HT was helpful, but then I had performance collapses other places,
>     which led to me to recommend turning it off (or using affinity to
>     avoid it, like MKL is doing, if you can't turn it off) to maximize
>     performance.
>
>
> If nothing else, HT doubles the number of threads, which hurts any 
> part of a code that scales poorly, and it makes it harder to manage 
> affinity.  I had to spend quite a bit of time helping users with SMT 
> (2-4 HW threads per core) on Blue Gene/Q in my old job.
>
>     So, for instance, take LAPACK or ATLAS LU or QR (or your own
>     version) and hook them up to the two BLAS.  Does the non-MKL
>     HT-liking kernel get anywhere close to MKL performance despite
>     it's gemm looking as good with HT, or does it collapse its
>     performance while MKL maintains?
>
>
> I don't have test driver for those already so I'm afraid I'm not going 
> to punt on those experiments. However, if somebody else posts the 
> code, I'll certainly run it and post results for generally available 
> hardware.
ATLAS comes with timers for any or all of these.  They are built to time 
other's libs too.

For instance, set BLASlib to MKL, set FLAPACKlib to your f77 lapack, and 
"make xdtlatime_fl_sb" will time using MKL + LAPACK.  Switch BLASlib to 
bliss now, remake, voila.

>     My guess is the MKL group got the same "HT not-reliable, non-HT
>     is" results, and that's why its behaving in this way.
>
>
> Maybe.  In any case, it simplifies the design space to not have to 
> think about >1 threads sharing an L1.

L1 is not the problem on modern machines.  As you scale like with Xeon-E 
series you need to use every scrap of cache, including shared.  If you 
use the full scale of something like 12-cores per shared cache, I 
believe you will see substantial slowdowns from HT.

Cheers,
Clint
>
> Jeff
>
>     Thanks for results!
>     Clint
>
>     On 06/29/2017 05:56 PM, Hammond, Jeff R wrote:
>
>         Good catch.  strace shows only 35 calls to clone in both cases
>         with MKL.  I didn’t know that MKL was doing these tricks.
>
>         However, I tested another DGEMM implementation that supports
>         AVX2 and it uses all of the HTs and it performs on par with
>         MKL, but only when HT is used.
>
>         Jeff
>
>
>         [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
>         KMP_AFFINITY=compact,granularity=fine strace ../test_libblis.x
>         2>&1 | head -n5000 | grep -c clone
>         71
>         [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
>         KMP_AFFINITY=scatter,granularity=fine strace ../test_libblis.x
>         2>&1 | head -n5000 | grep -c clone
>         35
>
>         [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
>         KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep
>         -v "%"
>         blis_dgemm_nn_rrr                  384   384   384 204.027 
>          8.27e-18   PASS
>         blis_dgemm_nn_rrr                  768   768   768 650.820 
>          5.36e-18   PASS
>         blis_dgemm_nn_rrr                 1152  1152  1152 816.355 
>          4.40e-18   PASS
>         blis_dgemm_nn_rrr                 1536  1536  1536 835.650 
>          7.02e-18   PASS
>         blis_dgemm_nn_rrr                 1920  1920  1920 832.179 
>          9.96e-18   PASS
>         blis_dgemm_nn_rrr                 2304  2304  2304 863.123 
>          6.28e-18   PASS
>         blis_dgemm_nn_rrr                 2688  2688  2688 844.502 
>          8.28e-18   PASS
>         blis_dgemm_nn_rrr                 3072  3072  3072 860.262 
>          9.92e-18   PASS
>         blis_dgemm_nn_rrr                 3456  3456  3456 851.694 
>          5.80e-18   PASS
>         blis_dgemm_nn_rrr                 3840  3840  3840 856.526 
>          6.79e-18   PASS
>
>         [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
>         KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep
>         -v "%"
>         blis_dgemm_nn_rrr                  384   384   384 161.331 
>          8.27e-18   PASS
>         blis_dgemm_nn_rrr                  768   768   768 437.967 
>          5.36e-18   PASS
>         blis_dgemm_nn_rrr                 1152  1152  1152 545.498 
>          4.40e-18   PASS
>         blis_dgemm_nn_rrr                 1536  1536  1536 616.338 
>          7.02e-18   PASS
>         blis_dgemm_nn_rrr                 1920  1920  1920 606.650 
>          9.96e-18   PASS
>         blis_dgemm_nn_rrr                 2304  2304  2304 611.153 
>          6.28e-18   PASS
>         blis_dgemm_nn_rrr                 2688  2688  2688 603.314 
>          8.28e-18   PASS
>         blis_dgemm_nn_rrr                 3072  3072  3072 631.292 
>          9.92e-18   PASS
>         blis_dgemm_nn_rrr                 3456  3456  3456 625.833 
>          5.80e-18   PASS
>
>         [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=72
>         KMP_AFFINITY=scatter,granularity=fine ../test_libblis.x | grep
>         -v "%"
>         blis_dgemm_nn_rrr                  384   384   384 159.789 
>          8.27e-18   PASS
>         blis_dgemm_nn_rrr                  768   768   768 443.810 
>          5.36e-18   PASS
>         blis_dgemm_nn_rrr                 1152  1152  1152 536.077 
>          4.40e-18   PASS
>         blis_dgemm_nn_rrr                 1536  1536  1536 596.069 
>          7.02e-18   PASS
>         blis_dgemm_nn_rrr                 1920  1920  1920 595.763 
>          9.96e-18   PASS
>         blis_dgemm_nn_rrr                 2304  2304  2304 616.531 
>          6.28e-18   PASS
>         blis_dgemm_nn_rrr                 2688  2688  2688 591.823 
>          8.28e-18   PASS
>         blis_dgemm_nn_rrr                 3072  3072  3072 615.153 
>          9.92e-18   PASS
>         blis_dgemm_nn_rrr                 3456  3456  3456 621.714 
>          5.80e-18   PASS
>
>         [jrhammon@esgmonster testsuite]$ OMP_NUM_THREADS=36
>         KMP_AFFINITY=compact,granularity=fine ../test_libblis.x | grep
>         -v "%"
>         blis_dgemm_nn_rrr                  384   384   384 189.615 
>          8.27e-18   PASS
>         blis_dgemm_nn_rrr                  768   768   768 423.504 
>          5.36e-18   PASS
>         blis_dgemm_nn_rrr                 1152  1152  1152 445.424 
>          4.40e-18   PASS
>         blis_dgemm_nn_rrr                 1536  1536  1536 444.830 
>          7.02e-18   PASS
>         blis_dgemm_nn_rrr                 1920  1920  1920 442.893 
>          9.96e-18   PASS
>         blis_dgemm_nn_rrr                 2304  2304  2304 445.979 
>          6.28e-18   PASS
>         blis_dgemm_nn_rrr                 2688  2688  2688 445.694 
>          8.28e-18   PASS
>         blis_dgemm_nn_rrr                 3072  3072  3072 451.026 
>          9.92e-18   PASS
>         blis_dgemm_nn_rrr                 3456  3456  3456 454.909 
>          5.80e-18   PASS
>
>
>         On Thu, Jun 29, 2017 at 3:22 PM, R. Clint Whaley
>         <rcw...@ls...
>         <mailto:rcw...@ls...><mailto:rcw...@ls...
>         <mailto:rcw...@ls...>>> wrote:
>         Jeff,
>
>         Have you run a thread monitor to see if MKL is simply not
>         using the hyperthreading regardless of whether it is on or off
>         in BIOS?
>
>         You also may want to try something like LU.
>
>         Cheers,
>         Clint
>
>
>         On 06/29/2017 05:15 PM, Jeff Hammond wrote:
>         I don't see any negative impact from using HT relative to not
>         using HT, at
>         least with MKL DGEMM on E5-2699v3 (Haswell).  The 0.1-0.5%
>         gain here is
>         irrelevant and may be due to thermal effects (this box is in
>         my cubicle,
>         not an air-conditioned machine room).
>
>         $ OMP_NUM_THREADS=36 KMP_AFFINITY=scatter,granularity=fine
>         ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
>                      BLAS_NAME dim1 dim2 dim3 seconds        Gflop/s
>         Intel MKL (parallel) 15360 15360 1536 0.8582699    844.4612765
>         Intel MKL (parallel) 15360 15360 1536 0.8627163    840.1089930
>
>         HT on
>
>         $ OMP_NUM_THREADS=72 KMP_AFFINITY=scatter,granularity=fine
>         ./dgemm_perf_PMKL.x $((384*40)) $((384*40)) $((384*4))
>                      BLAS_NAME dim1 dim2 dim3 seconds        Gflop/s
>         Intel MKL (parallel) 15360 15360 1536 0.8636520    839.1988073
>         Intel MKL (parallel) 15360 15360 1536 0.8644268    838.4465853
>
>         I would be interested to see folks post data to support the
>         argument
>         against HT.
>
>         Jeff
>
>         On Thu, Jun 29, 2017 at 7:57 AM, lixin chu via Math-atlas-devel <
>         mat...@li...
>         <mailto:mat...@li...><mailto:mat...@li...
>         <mailto:mat...@li...>>> wrote:
>
>         Thank you very much for quick response. Just to check if my
>         understanding
>         is correct :
>
>         1. By turning off cpuid in bios, I only need to use -t N to
>         build Atlas
>         right?
>
>         2. The N in -t N is the total number of threads on the
>         machine, not per
>         Cpu right ?
>
>         3. One more question I have is, how to set the correct -t N
>         for mpi based
>         application.
>               Let's say on the 2-cpu machine with 4  cores per CPU,
>         should I use -t
>         4 or -t 8 if I rum my application with 2 mpi processes :
>                mpirun -n 2 myprogram
>
>         Many thanks !
>
>         Sent from Yahoo Mail on Android
>
>         On Thu, Jun 29, 2017 at 22:20, R. Clint Whaley
>         <wh...@my...
>         <mailto:wh...@my...><mailto:wh...@my...
>         <mailto:wh...@my...>>> wrote:
>         Hyperthreading is an optimization aimed at addressing poorly
>         optimized
>         code.  The idea is that most codes cannot drive the backend
>         hardware
>         (ALU/FPU, etc) at the maximal rate, so if you duplicate
>         registers you
>         can, amongst several threads, find enough work to keep the
>         backend busy.
>
>         ATLAS (or any optimized linear algebra library) already runs
>         the FPU at
>         its maximal rate supported by the cache architecture after
>         cache blocking.
>
>         If you can already drive the backend at >90% of peak, then
>         hyperthreading can actually *lose* you performance, as the
>         threads bring
>         conflicting data in the cache.
>
>         It's usually not a night and day difference, but I haven't
>         measured it
>         in the huge blocking era used by recent developer releases (it
>         may be
>         worse there).
>
>         My general recommendation is turn off hyperthreading for highly
>         optimized codes, and turn it on for relatively unoptimized codes.
>
>         As to which core IDs correspond to the physical cores, that
>         varies by
>         machine.  On x86, you can use CPUID to determine that if you are
>         super-knowledgeable.  I usually just turn it off in the BIOS,
>         because I
>         don't like something that may thrash my cache running, even if
>         it might
>         occasionally help :)
>
>         Cheers,
>         Clint
>
>         On 06/28/2017 10:32 PM, lixin chu via Math-atlas-devel wrote:
>         Hello,Would like go check if my understanding is correct for
>         compiling
>         Atlas on a machine that has multiple CPUs and hyperthreading.
>         I have two types of machine:
>         - 2 CPU, each with 4 Core, hyperthreaded, 2 threads per core-
>         2 CPU,
>         each with 8 Cores, hyperthreaded, 2 threads per core
>         So when I compile Atlas, is it correct that I should use:
>         -tl 8 0,1,2,3,4,5,6,7 and -tl 16 0,1,....15 (assuming the
>         affinity ID
>         is from 0-7 and 0-15).
>         That means the number 8 or 16 is the total cores on the
>         machine, not
>         number of cores per CPU. Am I correct ?
>         I also read somewhere saying that Atlas supports
>         Hyperthreading. What
>         does this mean ?
>         Does this mean:1. I do not need to disable hyperthreading in
>         BIOS (no
>         performance difference whether it is enabled or disabled, as
>         long as the
>         number of threads and affinity IDs are set correctly when
>         compiling
>         Atlas)2. Or I can make use of the hyperthread, that is, -tl 16
>         and -tl 32 ?
>         Thank you very much,
>         lixin
>
>
>
>         ------------------------------------------------------------------------------
>         Check out the vibrant tech community on one of the world's most
>         engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
>
>         _______________________________________________
>         Math-atlas-devel mailing list
>         Mat...@li...
>         <mailto:Mat...@li...><mailto:Mat...@li...
>         <mailto:Mat...@li...>>
>         https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>         <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>
>
>         ------------------------------------------------------------------------------
>         Check out the vibrant tech community on one of the world's most
>         engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>         _______________________________________________
>         Math-atlas-devel mailing list
>         Mat...@li...
>         <mailto:Mat...@li...><mailto:Mat...@li...
>         <mailto:Mat...@li...>>
>         https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>         <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>         ------------------------------------------------------------------------------
>         Check out the vibrant tech community on one of the world's most
>         engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>         _______________________________________________
>         Math-atlas-devel mailing list
>         Mat...@li...
>         <mailto:Mat...@li...><mailto:Mat...@li...
>         <mailto:Mat...@li...>>
>         https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>         <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>
>         --
>         Jeff Hammond
>         jef...@gm...
>         <mailto:jef...@gm...><mailto:jef...@gm...
>         <mailto:jef...@gm...>>
>         http://jeffhammond.github.io/
>
>
>
>         ------------------------------------------------------------------------------
>         Check out the vibrant tech community on one of the world's most
>         engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
>
>         _______________________________________________
>         Math-atlas-devel mailing list
>         Mat...@li...
>         <mailto:Mat...@li...><mailto:Mat...@li...
>         <mailto:Mat...@li...>>
>         https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>         <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>         --
>         **********************************************************************
>         ** R. Clint Whaley, PhD * Assoc Prof, LSU *
>         www.csc.lsu.edu/~whaley
>         <http://www.csc.lsu.edu/%7Ewhaley><http://www.csc.lsu.edu/~whaley
>         <http://www.csc.lsu.edu/%7Ewhaley>> **
>         **********************************************************************
>
>
>
>
>         --
>         Jeff Hammond
>         jef...@gm...
>         <mailto:jef...@gm...><mailto:jef...@gm...
>         <mailto:jef...@gm...>>
>         http://jeffhammond.github.io/
>
>
>
>     -- 
>     **********************************************************************
>     ** R. Clint Whaley, PhD * Assoc Prof, LSU *
>     www.csc.lsu.edu/~whaley <http://www.csc.lsu.edu/%7Ewhaley> **
>     **********************************************************************
>
>     ------------------------------------------------------------------------------
>     Check out the vibrant tech community on one of the world's most
>     engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>     _______________________________________________
>     Math-atlas-devel mailing list
>     Mat...@li...
>     <mailto:Mat...@li...>
>     https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>     <https://lists.sourceforge.net/lists/listinfo/math-atlas-devel>
>
>
>
>
> -- 
> Jeff Hammond
> jef...@gm... <mailto:jef...@gm...>
> http://jeffhammond.github.io/
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
>
> _______________________________________________
> Math-atlas-devel mailing list
> Mat...@li...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel