[Math-atlas-results] ATLAS 3.3.4 on IA64
Brought to you by:
rwhaley,
tonyc040457
From: R C. W. <rw...@cs...> - 2001-09-13 21:21:51
|
<PRE> Date: Tue, 11 Sep 2001 22:24:15 Vers: ATLAS 3.3.2 and 3.3.4 Guys, OK, trying to set a record for number of releases, I've just posted 3.3.5. This gets rid of trtri out of lapack, improves IA64 complex performance, and fixes a bug in the complex Cholesky tester. I have figured out what was going on that I got no speedup with my new kernel on the IA64. If you recall, 3.3.3 (which started all this quick release madness) was supposed to be a IA64-improving release, due to IA64 prefetch, but when I timed it on machines I wasn't NDAd on, I got no performance improvement. Even though it used the same compiler as my NDAd machine, I got strange compiler problems as well. Turns out the problem is that on the TestDrive machine, they have two different compilers, and my 3.3.3 build was using a mixture of RedHat's baaaad gcc, and the much better gcc 3.0. So, this is the first performance hint for IA64: make sure you use gcc 3.0 everywhere in your ATLAS install: change all C compilers defined in your Make.<arch> to explicitly reference it, and change all gcc refs in ATLAS/tune/blas/gemm/CASES/?cases.flg as well. Once this was done, I got the performance shown below. What we see is that prefetch does not make a big performance improvement (3.3.2 and 3.3.4 are almost the same speed asymptotically), but that the improved cleanup code I wrote definitely helps small problems. Prefetch definitely helps the Level 1 and 2 BLAS performance; the bad news is that even the new performance is signally poor. This is because we have no IA64-specific kernels for Level 1/2; the improvement is simply using the best general kernel with prefetching enabled . . . The timings on a 800Mhz IA64 are included below, all for double precision. I do not have access to non-NDAd MKL; if anyone does, I'd love to see some comparisons . . . Cheers, Clint Timings for double precision, comparing ATLAS 3.3.2 vs. 3.3.4, all on a 800Mhz IA64. The performance of 3.3.4 is same as 3.3.5 for double precision (3.3.5 is faster for complex; complex timings are not shown). 100 200 300 400 500 600 700 800 900 1000 ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== 3.3.2 dMM 1024.0 1512.4 1783.7 1846.1 1896.3 2076.8 1973.2 2084.6 2102.8 2104.8 3.3.4 dMM 1061.1 1524.1 1803.1 1927.5 1969.2 2029.2 2081.6 2072.3 2126.8 2135.6 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== 3.3.2 dMM 2112.8 2129.5 2192.0 2222.1 2180.5 2136.3 2189.1 2159.2 2236.1 2218.9 3.3.4 dMM 2155.3 2144.9 2171.5 2206.1 2205.7 2194.5 2220.9 2218.9 2223.6 2229.9 GEMM SYMM SYRK SYR2K TRMM TRSM ===== ===== ===== ===== ===== ===== 3.3.2 d100 967.9 962.4 627.4 862.9 677.2 490.4 3.3.4 d100 1019.9 1153.2 710.1 891.6 732.5 636.8 3.3.2 d500 1889.3 1723.9 1452.0 1777.8 1514.8 1245.7 3.3.4 d500 1939.4 1729.7 1590.0 1718.1 1501.5 1402.7 3.3.2 d1000 2117.9 1917.6 1653.3 1935.7 1790.2 1526.1 3.3.4 d1000 2155.8 1823.7 1677.6 1932.1 1701.0 1528.4 GEMV SYMV TRMV TRSV GER SYR SYR2 ====== ====== ====== ====== ====== ====== ====== 3.3.2 d500 122.4 225.6 113.4 109.4 39.2 47.3 61.0 3.3.4 d500 130.1 245.2 170.1 151.5 160.1 107.1 156.9 3.3.2 d1000 166.0 231.3 101.0 97.3 37.3 37.4 52.1 3.3.4 d1000 214.1 208.7 194.3 180.0 172.2 115.3 165.7 ROTM SWAP SCAL COPY AXPY DOT NRM2 ASUM AMAX ====== ====== ====== ====== ====== ====== ====== ====== ====== 3.3.2 d500000 72.5 28.6 18.6 51.3 29.5 35.1 18.3 47.5 49.3 3.3.4 d500000 77.8 33.6 39.4 50.8 133.0 82.3 96.1 180.9 120.8 </PRE> |