#78 honor omp_get_max_threads for openmp


The OpenMP version of ATLAS (as of 3.9.68) currently ignores the number of threads set by OpenMP (which is usually set by the user via the OMP_NUM_THREADS environment variable). The maximum number of threads is instead compiled in (as the ATL_NTHREADS constant), modulo some dynamic tuning. It would be really nice if ATLAS would call omp_get_num_threads to get the number of threads to use (or at least a maximum number of threads to use) dynamically.

Not only does the current behavior mean that the user cannot change the number of threads without recompiling ATLAS (a common request as noted elsewhere on this list), but it also defeats a lot of the benefit of using OpenMP. By honoring omp_get_num_threads, ATLAS would share a common thread pool with the calling program, which allows the user to e.g. call ATLAS in a parallel loop and have each thread automatically call the serial ATLAS. It is currently quite painful to mix parallel code with a parallel ATLAS because the user code may be fighting with ATLAS over the processors.

I'd rather have the option of letting ATLAS rely on OpenMP more for the number of threads (and possibly even things like affinity, which should really be done within the OpenMP implementation as needed), even if the performance is somewhat worse, if the benefit is better co-existence with user parallelization and support for user control over the number of threads.

Looking through the code, unfortunately, there are quite a few places that ATL_NTHREADS (and ATL_NTHRPOW2) are used, so it's not so trivial to hack in (and I'm not sure what other side-effects replacing these with omp_get_num_threads would have). (Also, ATL_goparallel would be modifed to only call omp_set_num_threads when P < omp_get_num_threads, and to restore the previous value of omp_get_num_threads when it is done.)


  • Steven G. Johnson

    Whoops, that should be omp_get_max_threads, not omp_get_num_threads.

    (One could try to be fancier and handle nesting too. e.g. if get_max_threads is 16, but ATLAS is being called from one of 2 threads, it could try for 8 threads.)

  • Steven G. Johnson

    • summary: honor omp_get_num_threads for openmp --> honor omp_get_max_threads for openmp
  • Steven G. Johnson

    Update: I looked into this a bit, and it actually wasn't too hard to hack into the source code. I can now set the # of threads dynamically with the OMP_NUM_THREADS environment variable and it seems to work (give near-linear speedup proportional to the #threads for a sufficiently large matrix). I also disabled the ATL_setmyaffinity (it would be nice if there were an "official" way to do this at configure time, but I didn't see it in the options), since GNU openmp seems to do a reasonable job by itself.

    The patch follows. Note that I #defined a new var ATL_NTHREADS_MAX to the #CPUs to use for array sizes in header files, since there is no quick way to make these dynamically resized without hacking the code elsewhere, and I was lazy.

    --- orig/ATLAS/src//threads/ATL_goparallel.c 2012-02-23 10:57:00.000000000 -0500
    +++ ATLAS/src//threads/ATL_goparallel.c 2012-02-27 01:00:54.838005414 -0500
    @@ -4,6 +4,15 @@
    #include "atlas_misc.h"
    #include "atlas_threads.h"

    +#ifdef ATL_OMP_THREADS
    +int ATL_omp_get_nthrpow2(void)
    + int i, P = omp_get_max_threads();
    + for (i=0; (1<<i) < P; i++);
    + return i;
    #if !defined(ATL_NOAFFINITY) && defined(ATL_PAFF_SELF) && defined(ATL_USEOPENMP)
    static int ATL_setmyaffinity()
    @@ -131,14 +140,14 @@
    tp[i].P = P;
    ls.rank2thr = tp;
    - omp_set_num_threads(P);
    - #pragma omp parallel
    + ATL_assert(P <= ATL_NTHREADS_MAX);
    + #pragma omp parallel num_threads(P)
    * Make sure we got the requested nodes, and set affinity if supported
    ATL_assert(omp_get_num_threads() == P);
    - #ifdef ATL_PAFF_SELF
    + #ifdef ATL_PAFF_SELFxxxx /* SGJ: disable */
    i = omp_get_thread_num();
    @@ -150,6 +159,7 @@
    if (DoComb)
    for (i=1; i < P; i++)
    ls.DoComb(ls.opstruct, 0, i);
    ls.acounts = &lc;
    --- orig/ATLAS/include/atlas_tlvl3.h 2012-02-23 10:56:54.000000000 -0500
    +++ ATLAS/include//atlas_tlvl3.h 2012-02-26 23:55:03.245859726 -0500
    @@ -69,7 +69,7 @@
    typedef struct ATL_TMMNode ATL_TMMNODE_t;
    struct ATL_TMMNode
    void (*gemmK)(ATL_CINT, ATL_CINT, ATL_CINT, const void*, const void *,
    ATL_CINT, const void*, ATL_CINT, const void*, void*, ATL_CINT);
    const void *A, *B;
    @@ -122,7 +122,7 @@
    typedef struct ATL_SyrkK ATL_TSYRK_K_t;
    struct ATL_SyrkK
    void (*gemmT)(const enum ATLAS_TRANS, const enum ATLAS_TRANS,
    ATL_CINT, ATL_CINT, ATL_CINT, const void *,
    const void *, ATL_CINT, const void *, ATL_CINT,

  • Steven G. Johnson

    PS. Turning back on processor affinity (remove the "xxxx" in the patch) does give a significant improvement, although it's not so terrible without it.

  • R. Clint Whaley

    R. Clint Whaley - 2012-02-29

    There's going to have to be a threading rewrite someday after the stable release to better support thread pools and low-latency parallel ops. At that time, it may become possible to move the # of procs from compile to runtime decision, but I'll have to see how that works.

    Note that you should only use the OpenMP version of ATLAS if you are on OS X (where pthreads sucks horribly and OpenMP sucks just a tiny bit less), or if you are mixing ATLAS with another OpenMP-dependent library. Using OpenMP on any platform other than OS X is likely to cost you a factor of 2 in your parallel performance.



Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks