Automatically Tuned Linear Algebra Soft. / Patches / #60 Patches for OSX threading w/ affinity

Alex Leach - 2012-09-25

diff -u /path/to/edited/ATLAS /path/to/original/ATLAS

patches.udiff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2012-10-09

I think by default I use openMP on OS X. Parallel performance is still terrible, but better than using pthreads. Did this not work for you when you did an install with the 3.10.0?

Thanks,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

Oh, really? I thought I read somewhere that OpenMP support was in beta or something, and might be removed. I know Clang doesn't support OpenMP at all, which is one of the reasons I stopped building with it. I was initially using openmp with gcc and it seemed to work fine, but I didn't find performance was much better, and saw very few OMP pragmas in ATLAS, so I later started using only pthreads.. I also started building with Clang, even though I saw in the ATLAS Install Manual that Clang has previously failed to build accurate libraries. I've found it passes the basic tests okay though.

I found performance with Apple's Clang was far better than with Apple's gcc 4.2.1 (I'd put speed comparisons below, but I'm pretty sure there's a better place for them..) However, I haven't got _any_ build to pass the 'make lapack_test_pt_pt' tests. The first test, 'xlintsts' hangs in Lapack's slartg function (infinite loop around lines 162-7), and if I run the test in parallel (e.g. make .. -j8), it'll reach other tests, which either segfault or hang. I suspect this is due to linking the test binaries (e.g. xlintsts) against static archives, instead of dylibs, but I haven't been able to confirm this.

With regards to the affinity handling, I made some more source code edits and added a probe_aff_MAC makefile test, which can figure out the number of affinity groups available on OSX (there's 127 on my Mac). I'm not sure if this affects performance for better or worse, but I'll upload the extra changes I made in a bit, in case you're interested. These were made against the 3.10 release.

Cheers,
Alex

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bloðøx - 2012-10-10

Interesting. Clint, do you plan to release a patched version for OS X, or should I pick the patch up and some how integrate it into the MacPorts framework?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

patches for affinity probing on OS X (also need CONFIG/src/backend/probe_aff_MAC.c)

atlas.diffs

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

source file for OSX affinity probing (goes in ATLAS/CONFIG/src/backend/)

probe_aff_MAC.c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

diff of builddir/lib/Makefile against makes/Make.lib. Has made me some pretty good dylibs for Numpy / Scipy...

atlas.Make.lib.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2012-10-10

Alex,

What I use is gcc 4.7 from macports or fink, and ATLAS and LAPACK passes all tests with that. However, it does mean you cannot use AVX, since nobody can seem to make gcc install not to use the broken version of as that OS X provides. This cuts your peak performance in half on a sandy bridge chip, but this didn't seem to have much of an effect on real performance on my laptop: apparantly AVX does not fully work there for some reason (not sure if it is a power issue, something unique to my laptop, or what; the laptop is the only apple machine I have with AVX).

Anyway, gcc 4.7 got much better performance than clang when I scoped them out, in addition to passing all the tests, if you want to try that.

Yes, according to my measurements, OpenMP got better performance than pthreads. I believe it comes down to cpu affinity. OpenMP seems to use some statistical methods to achieve something like affinity using a thread pool; this does not work well, but it works a lot better than no affinity at all. The Mac OS X on its own does a terrible job for all parallel operations, AFAIKT; it is far worse than Windows or Linux.

If you get gcc 4.7 installed, I'd be interested to see what the performance of some of your pthread+affinity code is as compared to the OpenMP+gcc default.

Do actually understand what the affniity is doing in the code you sent? What does it mean to have 127 affinity groups?

Thanks!
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2012-10-10

Bloðøx,

I'm not going to put the patch in the developer right now. I'll need to either find the time to work on it myself, or see some posted timings on how it improves things over the default of gcc4.7+OpenMP. OS X is so awful to work with that I'm unlikely to do anything on the platform for a while. Right now, I'm currently mostly working on the new GEMM redesign.

Cheers,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

Clint,
Thanks for the info. I haven't used the MacPorts compilers before, but I'm installing gcc47 from MacPorts now. Might be a while before I time a build with it, but I'm sure 4.7 will give much better performance than at least Apple's GCC4.2. I've been using gfortran 4.2.4 btw, which I patched and compiled from Apple's source code of 4.2.1. Again, I imagine gfortran has improved significantly since 4.2, but if I run comparative tests, I'm inclined to use the same fortran compiler in both builds..

I can't pretend I fully understand the processor affinity stuff, but I know a lot more about it than before I started digging around the code!
Re: The "affinity groups". I think I meant "affinity tags", and there's 128, sorry (0-127) . These are probed by the xprobe_affs program, and I used them to populate the macro ATL_OSX_AFF_SETS, which I saw only in ATL_thread_start.c. I initially thought this should just equal the number of processor cores, but that macro is already defined in the compile command.
So the way I figure it, is that if ATL_OSX_AFF_SETS is 127 (128?), then on average each 127th (8th?) process will spawn threads with the same affinity tag, which I think has something to do with shared CPU cache. So, any thread from the same process will share the same affinity-based CPU cache with other threads, as long as they're spawned from the same process. Is that the gist of it?

These are the lines in ATL_thread_start.c (and in the atlas.diffs patch) which I think implement this:

{{{
struct thread_affinity_policy ap;
ap.affinity_tag = proc % ATL_OSX_AFF_SETS;
ATL_assert(thread_policy_set(thr->thrH, THREAD_AFFINITY_POLICY,
(thread_policy_t*)&ap,
THREAD_AFFINITY_POLICY_COUNT) != KERN_SUCCESS);
}}}

Cheers,
Alex

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

Btw, I don't have AVX instructions on my Mac's CPU (Xeon W3530). Just ran 'sysctl -a', and the highest it has is SSE4.2. Sounds like that's a shame!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2012-10-10

Alex,

I think for non-AVX machines, gcc4.7 should get you the same serial performance as you can get under Linux, as well as hopefully passing the tests!

I'll be eager to see some performance comparisons of both serial and parallel code once you have both installs working. If the serial code is faster with gcc4.7, then you can apply your patch and compare your code to the standard OpenMP install both using gcc4.7.

Once you have things installed, let me know if you need directions on how to do some timings.

Thanks!
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-10

Hi Clint,
To clarify, can I compare serial and parallel code from a single build? i.e. use libtatlas as parallel test case, and libsatlas for serial. Can 'make time' be used to time both?
With the OpenMP GCC4.7 build, will a comparison against a Clang build using pthreads be okay? I'm going to configure both to use only 4 processor cores, instead of the detected 8, hyper-threaded cores.
Building my hacked code now with GCC4.7 and pthreads, but using the same gfortran 4.2.4 I used for the Clang build.
I'll build with OMP enabled in a bit.
Cheers,
Alex

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2012-10-10

Alex,

There are a lot of things you can build time various things, giving performance of all the various blas. You can get an idea of overall serial kernel performance with "make time". You can even compare to different installs directly using these directions:
http://math-atlas.sourceforge.net/atlas_install/node42.html

To get an idea of overall performance, running the factorizations is a pretty good measure that can be done in one.
To do this, go to BLDdir/bin, and issue:
make xdslvtst xdslvtst_pt

Then, for instance
./xdslvtst -N 200 2000 200
runs LU on all problems between 200 & 2000 stride 200. Do the same with ./xdslvtst_pt. For threaded problems, you may want to run larger problems. Threaded numbers also jump around more: you can use -# 3 to make slvtst run each timing 3 times (so you can see what the variance is). OS X should have a lot of variance in threaded timings due to affinity issues.

One way to see that your affinity is working like it does on Linux is to see if your pthreads+affinity version gets very reliable timings, while the OpenMP version jumps jitters around a lot more . . .

Let me know,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-11

I've generated timings for 5 separate builds now. I thought the best way to compare them would be in graph format, so I slotted the timings in a spreadsheet, and drew some graphs. I will attach that now. There's 2 worksheets: one with serial timings and another with parallel. The data just contains timings and standard deviations for the factorisation test you suggested.
I should probably make note of the different build settings I used in the Make.inc files, but don't have time for that just now. Not sure how reliable the results are, but they should be interesting at least; probably should have ran more repetitions of the tests...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2012-10-12

Interesting. As for affinity/non, none of the standard dev look like they are achieving real affinity to me, but 3 repeats isn't large enough to get a feel fro this probably anyway.

It does look like (from limited data anyway), that your affinity is an improvement over straight OpenMP, which almost certainly means its an even stronger improvement over gcc+pthreads w/o affinity.

Can you do the make time comparison of your best clang vs. best gcc47? For large problems, compiler doesn't matter, since 99% of time is in my assembly code on x86. But make time includes some kernel's whose performance comes from the compiler, and it is there that we usually see clang taking it on the chin.

Thanks!
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-12

Is it possible to re-generate the timings printed by 'make time'? I've been running the builds in parallel, and since I built the clang builds, I disabled Sophos Anti-virus's InterCheck, as it caused CPU usage to spike regularly, which could have interfered with some timings generated during earlier builds.

I've also written a script to run xdslvtst and generate the stats for multiple repetitions. I'll redo the spreadsheet with timings repeated 20 times in a bit.

As for make time comparisons, looks from the spreadsheet like the best parallel times were for GCC4.7 w/ Affinity vs Clang w/ affinity... Will this do? I'd like to regenerate the clang timings though, because I had InterCheck on when I did that one. Looks here like the gcc build is faster for almost all tests.

./buildgcc47/xatlbench -dp ./buildgcc47/bin/INSTALL_LOG/ -dc ./buildclang64/bin/INSTALL_LOG/

The times labeled Reference are for ATLAS as installed by the authors.
NAMING ABBREVIATIONS:
kSelMM : selected matmul kernel (may be hand-tuned)
kGenMM : generated matmul kernel
kMM_NT : worst no-copy kernel
kMM_TN : best no-copy kernel
BIG_MM : large GEMM timing (usually N=1600); estimate of asymptotic peak
kMV_N : NoTranspose matvec kernel
kMV_T : Transpose matvec kernel
kGER : GER (rank-1 update) kernel
Kernel routines are not called by the user directly, and their
performance is often somewhat different than the total
algorithm (eg, dGER perf may differ from dkGER)

Reference clock rate=2800Mhz, new rate=2800Mhz
Refrenc : % of clock rate achieved by reference install
Present : % of clock rate achieved by present ATLAS install

single precision double precision
******************************** *******************************
real complex real complex
--------------- --------------- --------------- ---------------
Benchmark Refrenc Present Refrenc Present Refrenc Present Refrenc Present
========= ======= ======= ======= ======= ======= ======= ======= =======
kSelMM 753.6 698.4 737.0 624.5 362.3 369.7 366.4 355.7
kGenMM 191.8 173.2 200.1 164.5 185.8 175.8 189.8 164.8
kMM_NT 165.3 142.2 171.8 175.5 160.5 122.1 167.8 131.7
kMM_TN 196.5 139.4 196.9 160.0 184.8 82.2 172.9 157.4
BIG_MM 701.8 644.5 716.6 656.6 337.0 348.2 351.6 351.2
kMV_N 186.8 201.6 396.9 379.8 102.4 96.2 193.0 182.1
kMV_T 198.4 172.1 414.6 388.6 105.4 98.1 199.0 192.8
kGER 135.7 136.9 272.1 266.5 68.4 61.2 136.5 126.2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex Leach - 2012-10-12

ATLAS factorisation timings; updated w/ 20 reps

ATLAS_factorisation_timings.ods

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Patches for OSX threading w/ affinity

Group

Searches

Help

#60 Patches for OSX threading w/ affinity

Discussion