#896 Parallel GEMM Performance on IvyBridge


With 3.11.8 on an 3770K (3.50Ghz) system with GCC 4.7.2 I obtain the following GFLOPs for parallel GEMM:

                           ATLAS         OpenBLAS (SB)
N      M       K       sgemm   dgemm     sgemm   dgemm
1024   1024    1024    160.0    93.4     194.4   103.0
96     64      25024    74.0    58.1     110.1    56.9
96     192     25024    82.7    72.6     132.2    74.8
150    125     25024    96.5    54.7     111.5    61.2
150    375     25024   153.8    84.4     151.4    75.3
60     35      25024    75.6    37.2      36.1    19.5
60     105     25024   102.2    53.8     105.0    50.7

where ATLAS has been built to use only four cores (ignoring the IDs of the HT virtual cores). The single-precision performance is somewhat worse than OpenBLAS in around half of cases.


  • First, thanks for posting the comparison timings!

    Can you try a very large problem like M=N=K=5000. I should get around 92% of peak there, and so there should not be much room to beat me. If you don't see that, then I'll wonder if the new framework install went bad or something.

    As for the really non-square cases, I think the new framework I'm working on should eventually catch us up there, but it is in the early days right now, and I'm mostly working on the square and rank-K cases.

    When I get that done, I will then turn to better handling all the weird shapes. If the shapes you choose come from actual usage, let me know, so I can add them to my list of things to try to eventually support well!

    Many thanks,

  • For M=N=K=5000 I get 202 GFLOP/s for SGEMM and 107 GFLOP/s for DGEMM.

    While I do not have the full results on me at the moment I've seen the git OpenBLAS pull 235 GFLOP/s SGEMM on the same system for M=N=K=1024. This puts ATLAS at least 15% off peak (assuming, unrealistically, that OpenBLAS is getting peak). Should I consider recompiling, and if so are there any special flags/configure options I should investigate?

    The shapes are from my real world fluid solver which spends ~80% of its time performning such multiplications. Key M, N values are:

     M    N
     96   64
     96  192
    150  125
    150  375
     40   20
     40   60
     60   35
     60  105

    with K going from 4000 to 40,000.

    Last edit: Freddie Witherden 2013-03-26
  • Just as a slight, unfortunate, correction. The table headings should be K, M, N as opposed to N, M, K. Sorry about this.

    • status: open --> open-out-of-date
    • assigned_to: R. Clint Whaley
  • Can you try this again with the newest developer release? It has a partial (still ongoing) rewrite of the threaded stuff for better performance.

  • No response, closing.

    • status: open-out-of-date --> closed