Menu

#60 Patches for OSX threading w/ affinity

stable
open
5
2013-01-09
2012-09-25
Alex Leach
No

As mentioned on a separate thread (https://sourceforge.net/tracker/index.php?func=detail&aid=3294997&group_id=23725&atid=379483), I was having some problems building xtune_aff, on OSX Mountain Lion. I was looking through the code, and saw there was an untested and deactivated Mac OS X implementation for setting processor affinity, in ATL_thread_start.c.

So I thought I'd try and get that to work, basically. I attach a patch of every source code file I edited.. `make check` and `make ptcheck` pass without error, but `xtune_aff` shows no speed improvements when using processor affinity code. Maybe there's no speedup because the Mac-specific is only implemented in ATL_thread_start, and not other ATL_thread_... files.

Changes in the patch:-
1) #include <mach/thread_policy.h>
I put this in atlas_threads.h, but it would probably be better off in ATL_thread_start.c, which is the only file which I saw used that API.

2) Debugged a couple lines in the "unchecked special OSX code" section of ATL_thread_start.c
Point of reference (as well as the link in the source code): https://developer.apple.com/library/mac/#documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html

3) In Make.lib, created a rule for creating a threaded lapack dynamic library (libptlapack.dylib).

4) In CONFIG/src/probe_aff.c, I commented out the line that outputs "#define ATL_NOAFFINITY 1" to atlas_taffinity.h. Perhaps a Mac-specific preprocessor clause around this would be safer...

That's all. Hope it might be of use...

Cheers,
Alex

Discussion

  • Alex Leach

    Alex Leach - 2012-09-25

    diff -u /path/to/edited/ATLAS /path/to/original/ATLAS

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-10-09

    I think by default I use openMP on OS X. Parallel performance is still terrible, but better than using pthreads. Did this not work for you when you did an install with the 3.10.0?

    Thanks,
    Clint

     
  • Alex Leach

    Alex Leach - 2012-10-10

    Oh, really? I thought I read somewhere that OpenMP support was in beta or something, and might be removed. I know Clang doesn't support OpenMP at all, which is one of the reasons I stopped building with it. I was initially using openmp with gcc and it seemed to work fine, but I didn't find performance was much better, and saw very few OMP pragmas in ATLAS, so I later started using only pthreads.. I also started building with Clang, even though I saw in the ATLAS Install Manual that Clang has previously failed to build accurate libraries. I've found it passes the basic tests okay though.

    I found performance with Apple's Clang was far better than with Apple's gcc 4.2.1 (I'd put speed comparisons below, but I'm pretty sure there's a better place for them..) However, I haven't got _any_ build to pass the 'make lapack_test_pt_pt' tests. The first test, 'xlintsts' hangs in Lapack's slartg function (infinite loop around lines 162-7), and if I run the test in parallel (e.g. make .. -j8), it'll reach other tests, which either segfault or hang. I suspect this is due to linking the test binaries (e.g. xlintsts) against static archives, instead of dylibs, but I haven't been able to confirm this.

    With regards to the affinity handling, I made some more source code edits and added a probe_aff_MAC makefile test, which can figure out the number of affinity groups available on OSX (there's 127 on my Mac). I'm not sure if this affects performance for better or worse, but I'll upload the extra changes I made in a bit, in case you're interested. These were made against the 3.10 release.

    Cheers,
    Alex

     
  • Bloðøx

    Bloðøx - 2012-10-10

    Interesting. Clint, do you plan to release a patched version for OS X, or should I pick the patch up and some how integrate it into the MacPorts framework?

     
  • Alex Leach

    Alex Leach - 2012-10-10

    patches for affinity probing on OS X (also need CONFIG/src/backend/probe_aff_MAC.c)

     
  • Alex Leach

    Alex Leach - 2012-10-10

    source file for OSX affinity probing (goes in ATLAS/CONFIG/src/backend/)

     
  • Alex Leach

    Alex Leach - 2012-10-10

    diff of builddir/lib/Makefile against makes/Make.lib. Has made me some pretty good dylibs for Numpy / Scipy...

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-10-10

    Alex,

    What I use is gcc 4.7 from macports or fink, and ATLAS and LAPACK passes all tests with that. However, it does mean you cannot use AVX, since nobody can seem to make gcc install not to use the broken version of as that OS X provides. This cuts your peak performance in half on a sandy bridge chip, but this didn't seem to have much of an effect on real performance on my laptop: apparantly AVX does not fully work there for some reason (not sure if it is a power issue, something unique to my laptop, or what; the laptop is the only apple machine I have with AVX).

    Anyway, gcc 4.7 got much better performance than clang when I scoped them out, in addition to passing all the tests, if you want to try that.

    Yes, according to my measurements, OpenMP got better performance than pthreads. I believe it comes down to cpu affinity. OpenMP seems to use some statistical methods to achieve something like affinity using a thread pool; this does not work well, but it works a lot better than no affinity at all. The Mac OS X on its own does a terrible job for all parallel operations, AFAIKT; it is far worse than Windows or Linux.

    If you get gcc 4.7 installed, I'd be interested to see what the performance of some of your pthread+affinity code is as compared to the OpenMP+gcc default.

    Do actually understand what the affniity is doing in the code you sent? What does it mean to have 127 affinity groups?

    Thanks!
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-10-10

    Bloðøx,

    I'm not going to put the patch in the developer right now. I'll need to either find the time to work on it myself, or see some posted timings on how it improves things over the default of gcc4.7+OpenMP. OS X is so awful to work with that I'm unlikely to do anything on the platform for a while. Right now, I'm currently mostly working on the new GEMM redesign.

    Cheers,
    Clint

     
  • Alex Leach

    Alex Leach - 2012-10-10

    Clint,
    Thanks for the info. I haven't used the MacPorts compilers before, but I'm installing gcc47 from MacPorts now. Might be a while before I time a build with it, but I'm sure 4.7 will give much better performance than at least Apple's GCC4.2. I've been using gfortran 4.2.4 btw, which I patched and compiled from Apple's source code of 4.2.1. Again, I imagine gfortran has improved significantly since 4.2, but if I run comparative tests, I'm inclined to use the same fortran compiler in both builds..

    I can't pretend I fully understand the processor affinity stuff, but I know a lot more about it than before I started digging around the code!
    Re: The "affinity groups". I think I meant "affinity tags", and there's 128, sorry (0-127) . These are probed by the xprobe_affs program, and I used them to populate the macro ATL_OSX_AFF_SETS, which I saw only in ATL_thread_start.c. I initially thought this should just equal the number of processor cores, but that macro is already defined in the compile command.
    So the way I figure it, is that if ATL_OSX_AFF_SETS is 127 (128?), then on average each 127th (8th?) process will spawn threads with the same affinity tag, which I think has something to do with shared CPU cache. So, any thread from the same process will share the same affinity-based CPU cache with other threads, as long as they're spawned from the same process. Is that the gist of it?

    These are the lines in ATL_thread_start.c (and in the atlas.diffs patch) which I think implement this:

    {{{
    struct thread_affinity_policy ap;
    ap.affinity_tag = proc % ATL_OSX_AFF_SETS;
    ATL_assert(thread_policy_set(thr->thrH, THREAD_AFFINITY_POLICY,
    (thread_policy_t*)&ap,
    THREAD_AFFINITY_POLICY_COUNT) != KERN_SUCCESS);
    }}}

    Cheers,
    Alex

     
  • Alex Leach

    Alex Leach - 2012-10-10

    Btw, I don't have AVX instructions on my Mac's CPU (Xeon W3530). Just ran 'sysctl -a', and the highest it has is SSE4.2. Sounds like that's a shame!

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-10-10

    Alex,

    I think for non-AVX machines, gcc4.7 should get you the same serial performance as you can get under Linux, as well as hopefully passing the tests!

    I'll be eager to see some performance comparisons of both serial and parallel code once you have both installs working. If the serial code is faster with gcc4.7, then you can apply your patch and compare your code to the standard OpenMP install both using gcc4.7.

    Once you have things installed, let me know if you need directions on how to do some timings.

    Thanks!
    Clint

     
  • Alex Leach

    Alex Leach - 2012-10-10

    Hi Clint,
    To clarify, can I compare serial and parallel code from a single build? i.e. use libtatlas as parallel test case, and libsatlas for serial. Can 'make time' be used to time both?
    With the OpenMP GCC4.7 build, will a comparison against a Clang build using pthreads be okay? I'm going to configure both to use only 4 processor cores, instead of the detected 8, hyper-threaded cores.
    Building my hacked code now with GCC4.7 and pthreads, but using the same gfortran 4.2.4 I used for the Clang build.
    I'll build with OMP enabled in a bit.
    Cheers,
    Alex

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-10-10

    Alex,

    There are a lot of things you can build time various things, giving performance of all the various blas. You can get an idea of overall serial kernel performance with "make time". You can even compare to different installs directly using these directions:
    http://math-atlas.sourceforge.net/atlas_install/node42.html

    To get an idea of overall performance, running the factorizations is a pretty good measure that can be done in one.
    To do this, go to BLDdir/bin, and issue:
    make xdslvtst xdslvtst_pt

    Then, for instance
    ./xdslvtst -N 200 2000 200
    runs LU on all problems between 200 & 2000 stride 200. Do the same with ./xdslvtst_pt. For threaded problems, you may want to run larger problems. Threaded numbers also jump around more: you can use -# 3 to make slvtst run each timing 3 times (so you can see what the variance is). OS X should have a lot of variance in threaded timings due to affinity issues.

    One way to see that your affinity is working like it does on Linux is to see if your pthreads+affinity version gets very reliable timings, while the OpenMP version jumps jitters around a lot more . . .

    Let me know,
    Clint

     
  • Alex Leach

    Alex Leach - 2012-10-11

    I've generated timings for 5 separate builds now. I thought the best way to compare them would be in graph format, so I slotted the timings in a spreadsheet, and drew some graphs. I will attach that now. There's 2 worksheets: one with serial timings and another with parallel. The data just contains timings and standard deviations for the factorisation test you suggested.
    I should probably make note of the different build settings I used in the Make.inc files, but don't have time for that just now. Not sure how reliable the results are, but they should be interesting at least; probably should have ran more repetitions of the tests...

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-10-12

    Interesting. As for affinity/non, none of the standard dev look like they are achieving real affinity to me, but 3 repeats isn't large enough to get a feel fro this probably anyway.

    It does look like (from limited data anyway), that your affinity is an improvement over straight OpenMP, which almost certainly means its an even stronger improvement over gcc+pthreads w/o affinity.

    Can you do the make time comparison of your best clang vs. best gcc47? For large problems, compiler doesn't matter, since 99% of time is in my assembly code on x86. But make time includes some kernel's whose performance comes from the compiler, and it is there that we usually see clang taking it on the chin.

    Thanks!
    Clint

     
  • Alex Leach

    Alex Leach - 2012-10-12

    Is it possible to re-generate the timings printed by 'make time'? I've been running the builds in parallel, and since I built the clang builds, I disabled Sophos Anti-virus's InterCheck, as it caused CPU usage to spike regularly, which could have interfered with some timings generated during earlier builds.

    I've also written a script to run xdslvtst and generate the stats for multiple repetitions. I'll redo the spreadsheet with timings repeated 20 times in a bit.

    As for make time comparisons, looks from the spreadsheet like the best parallel times were for GCC4.7 w/ Affinity vs Clang w/ affinity... Will this do? I'd like to regenerate the clang timings though, because I had InterCheck on when I did that one. Looks here like the gcc build is faster for almost all tests.

    ./buildgcc47/xatlbench -dp ./buildgcc47/bin/INSTALL_LOG/ -dc ./buildclang64/bin/INSTALL_LOG/

    The times labeled Reference are for ATLAS as installed by the authors.
    NAMING ABBREVIATIONS:
    kSelMM : selected matmul kernel (may be hand-tuned)
    kGenMM : generated matmul kernel
    kMM_NT : worst no-copy kernel
    kMM_TN : best no-copy kernel
    BIG_MM : large GEMM timing (usually N=1600); estimate of asymptotic peak
    kMV_N : NoTranspose matvec kernel
    kMV_T : Transpose matvec kernel
    kGER : GER (rank-1 update) kernel
    Kernel routines are not called by the user directly, and their
    performance is often somewhat different than the total
    algorithm (eg, dGER perf may differ from dkGER)

    Reference clock rate=2800Mhz, new rate=2800Mhz
    Refrenc : % of clock rate achieved by reference install
    Present : % of clock rate achieved by present ATLAS install

    single precision double precision
    ******************************** *******************************
    real complex real complex
    --------------- --------------- --------------- ---------------
    Benchmark Refrenc Present Refrenc Present Refrenc Present Refrenc Present
    ========= ======= ======= ======= ======= ======= ======= ======= =======
    kSelMM 753.6 698.4 737.0 624.5 362.3 369.7 366.4 355.7
    kGenMM 191.8 173.2 200.1 164.5 185.8 175.8 189.8 164.8
    kMM_NT 165.3 142.2 171.8 175.5 160.5 122.1 167.8 131.7
    kMM_TN 196.5 139.4 196.9 160.0 184.8 82.2 172.9 157.4
    BIG_MM 701.8 644.5 716.6 656.6 337.0 348.2 351.6 351.2
    kMV_N 186.8 201.6 396.9 379.8 102.4 96.2 193.0 182.1
    kMV_T 198.4 172.1 414.6 388.6 105.4 98.1 199.0 192.8
    kGER 135.7 136.9 272.1 266.5 68.4 61.2 136.5 126.2

     
  • Alex Leach

    Alex Leach - 2012-10-12

    ATLAS factorisation timings; updated w/ 20 reps

     

Log in to post a comment.