I'd like to compare ATLAS to eigen but I'm having a hard time understanding what in atlas should be modified so that I can substitute or add eigen versions of some functions to be timed by atlas. I should probably start with one program from chapters 2, 4, 5 or 6 if I understood TestTime.txt correctly. Would that be for example l3blastst.c or some lower level function which I then substitute somewhere?
The eigen code should be easy to write, for example matrix and vector arithmetic is trivial in eigen: http://eigen.tuxfamily.org/dox/TutorialMatrixArithmetic.html#TutorialArithmeticMatrixMul
There even is an example showing how to use eigen from C code:
https://bitbucket.org/eigen/eigen/src/43de1660cb26/demos/mix_eigen_and_c/binary_library.h
https://bitbucket.org/eigen/eigen/src/43de1660cb26/demos/mix_eigen_and_c/binary_library.cpp
https://bitbucket.org/eigen/eigen/src/43de1660cb26/demos/mix_eigen_and_c/example.c
So if I want to reproduce the first benchmark given in http://eigen.tuxfamily.org/index.php?title=Benchmark (Y += alpha X) for eigen and atlas using atlas what would be the place to start in atlas code? I'm assuming atlas uses only dynamic vectors and matrices so writing one eigen function should suffice.
Thanks.
Yes, the ATLAS timers can time any BLAS or lapack implementation. However, they expect a library, not a header file (or C++).
What does the eigen package provide? Does it provide the C API to BLAS, for instance? If so, is it available as a library, or only a header file?
Cheers,
Clint
It turns out that eigen provides an experimental version of blas (http://eigen.tuxfamily.org/index.php?title=Todo#BLAS_implementation_on_top_of_Eigen , https://bitbucket.org/eigen/eigen/src/43de1660cb26/blas\) but there seem to be symbols missing when running make xsl1blastst:
cd /home/ilja/ATLAS-3.9.83/build/src/testing ; make slib
make[1]: Siirrytään hakemistoon "/home/ilja/ATLAS-3.9.83/build/src/testing"
make -j 6 slib.grd
make[2]: Siirrytään hakemistoon "/home/ilja/ATLAS-3.9.83/build/src/testing"
make[2]: "slib.grd" on ajan tasalla.
make[2]: Poistutaan hakemistosta "/home/ilja/ATLAS-3.9.83/build/src/testing"
make[1]: Poistutaan hakemistosta "/home/ilja/ATLAS-3.9.83/build/src/testing"
/usr/bin/x86_64-unknown-linux-gnu-g++ -fomit-frame-pointer -mfpmath=sse -O2 -msse3 -m64 -o xsl1blastst sl1blastst.o \ /home/ilja/ATLAS-3.9.83/build/lib/libtstatlas.a /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a /home/ilja/ATLAS-3.9.83/build/lib/libatlas.a -lpthread -lm
/home/ilja/ATLAS-3.9.83/build/lib/libtstatlas.a(ATL_sf77wrap.o): In function `dswrapdot_':
ATL_sf77wrap.f:(.text+0x67): undefined reference to `dsdot_'
/home/ilja/ATLAS-3.9.83/build/lib/libtstatlas.a(ATL_sf77wrap.o): In function `sdswrapdot_':
ATL_sf77wrap.f:(.text+0x87): undefined reference to `sdsdot_'
collect2: ld:n paluuarvo oli 1
make: *** [xsl1blastst] Virhe 1
Before trying that I had to change
F77 = /usr/bin/x86_64-unknown-linux-gnu-gfortran
to
F77 = /usr/bin/x86_64-unknown-linux-gnu-g++
otherwise I got this:
/home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `ssyr2_':
single.cpp:(.text+0x1555): undefined reference to `operator delete[](void*)'
single.cpp:(.text+0x1578): undefined reference to `operator delete[](void*)'
/home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `stbsv_':
single.cpp:(.text+0x194c): undefined reference to `operator delete[](void*)'
...
/home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `global constructors keyed to saxpy_':
single.cpp:(.text+0x6a6a): undefined reference to `std::ios_base::Init::Init()'
...
/home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `Eigen::internal::throw_std_bad_alloc()':
single.cpp:(.text._ZN5Eigen8internal19throw_std_bad_allocEv[Eigen::internal::throw_std_bad_alloc()]+0xa): undefined reference to `__cxa_allocate_exception'
...
/home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(xerbla.cpp.o): In function `xerbla_':
xerbla.cpp:(.text+0xa): undefined reference to `std::cerr'
xerbla.cpp:(.text+0x1b): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'
xerbla.cpp:(.text+0x20): undefined reference to `std::cerr'
...
But the level 2 and 3 programs seemed to compile, here's an example output from ./xsl3blastst:
--------------------------------- GEMM ----------------------------------
TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 0.0 1.00 -----
0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 0.0 0.00 PASS
1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.00 4000.0 1.00 -----
1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.00 0.0 0.00 PASS
2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.00 13500.0 1.00 -----
2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.00 13500.0 1.00 PASS
3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.01 15998.0 1.00 -----
3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.00 32000.0 2.00 PASS
4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.02 15624.0 1.00 -----
4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.01 20833.3 1.33 PASS
5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.03 13499.2 1.00 -----
5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.02 21597.8 1.60 PASS
6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.04 15589.8 1.00 -----
6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.03 21436.2 1.38 PASS
7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.07 15057.9 1.00 -----
7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.04 23271.1 1.55 PASS
8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.09 16567.2 1.00 -----
8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.06 22779.8 1.37 PASS
9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.13 15624.0 1.00 -----
9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.09 22725.7 1.45 PASS
10 tests run, 10 passed
So eigen is slower with matrices > 300x300?
OK, looks llike they don't provide the full L1BLAS. You can probably add the ATLAS interface routine after the eigen one so that it satisfies any missing symbols so that you can build the l1timer.
Yes, from what I can see below, it looks like ATLAS is winning for all the sizes for which the timer is producing reasonable results. Did you install ATLAS without using -DPentiumCPS or -DWALL, or something? These timings look very crude, like when using the default CPU timer, which has very low resolution.
Anyway, to succesfully time smaller problems, throw the -F flag to force the timing to be done multiple times, and then you can get more reasonable small-case timings. Eg., -F 200 or something similar. Just keep cranking up the number until the small cases are more repeatable.
If your ATLAS install was compiled with CPU time, you may want to reinstall using a WALL (as described in the install guide) to get better libraries and much better timers.
Note that you can also time the parallel BLAS; does Eigen provide parallel BLAS? For instance, try xdl3blastst_pt to get the parallel L3BLAS timer.
Let me know,,
Clint
Thanks I forgot about -DPentiumCPS. Eigen doesn't seem to have much parallelization yet so I won't try to test that.
I'll start attaching instructions and results, here's my cpuinfo:
processor : 5
vendor_id : AuthenticAMD
cpu family : 16
model : 10
model name : AMD Phenom(tm) II X6 1075T Processor
stepping : 0
microcode : 0x10000bf
cpu MHz : 3000.000
cache size : 512 KB
physical id : 0
siblings : 6
core id : 5
cpu cores : 6
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter
bogomips : 6000.31
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb
Installation instructions
Uploaded the results, it seems most of the time atlas is a bit faster, sometimes a lot faster and sometimes eigen is noticeably faster:
./xcl2blastst -F 10 -R all
--------------------------------- HEMV ---------------------------------
TST# UP N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ==== ==== ==== ==== ==== ==== ==== ====== ===== ===== =====
40 L 100 1.0 0.0 1000 1 1.0 0.0 1 0.00 3467.7 1.00 -----
40 L 100 1.0 0.0 1000 1 1.0 0.0 1 0.00 2541.5 0.73 PASS
41 L 200 1.0 0.0 1000 1 1.0 0.0 1 0.00 4607.5 1.00 -----
41 L 200 1.0 0.0 1000 1 1.0 0.0 1 0.00 3133.2 0.68 PASS
42 L 300 1.0 0.0 1000 1 1.0 0.0 1 0.00 5261.8 1.00 -----
42 L 300 1.0 0.0 1000 1 1.0 0.0 1 0.00 3795.6 0.72 PASS
43 L 400 1.0 0.0 1000 1 1.0 0.0 1 0.00 5632.7 1.00 -----
43 L 400 1.0 0.0 1000 1 1.0 0.0 1 0.00 4040.3 0.72 PASS
44 L 500 1.0 0.0 1000 1 1.0 0.0 1 0.00 6000.7 1.00 -----
44 L 500 1.0 0.0 1000 1 1.0 0.0 1 0.00 4403.2 0.73 PASS
45 L 600 1.0 0.0 1000 1 1.0 0.0 1 0.00 6137.4 1.00 -----
45 L 600 1.0 0.0 1000 1 1.0 0.0 1 0.00 4706.3 0.77 PASS
46 L 700 1.0 0.0 1000 1 1.0 0.0 1 0.00 6142.8 1.00 -----
46 L 700 1.0 0.0 1000 1 1.0 0.0 1 0.00 4772.4 0.78 PASS
47 L 800 1.0 0.0 1000 1 1.0 0.0 1 0.00 6389.9 1.00 -----
47 L 800 1.0 0.0 1000 1 1.0 0.0 1 0.00 4921.3 0.77 PASS
48 L 900 1.0 0.0 1000 1 1.0 0.0 1 0.00 6426.6 1.00 -----
48 L 900 1.0 0.0 1000 1 1.0 0.0 1 0.00 5016.0 0.78 PASS
49 L 1000 1.0 0.0 1000 1 1.0 0.0 1 0.00 6614.7 1.00 -----
49 L 1000 1.0 0.0 1000 1 1.0 0.0 1 0.00 5088.6 0.77 PASS
----------------------------- HPR2 ----------------------------
TST# UPLO N ALPHA INCX INCY TIME MFLOP SpUp TEST
==== ==== ===== ===== ===== ==== ==== ====== ====== ===== =====
150 L 100 1.0 0.0 1 1 0.00 5588.7 1.00 -----
150 L 100 1.0 0.0 1 1 0.00 3881.2 0.69 PASS
151 L 200 1.0 0.0 1 1 0.00 5993.6 1.00 -----
151 L 200 1.0 0.0 1 1 0.00 4002.9 0.67 PASS
152 L 300 1.0 0.0 1 1 0.00 6170.5 1.00 -----
152 L 300 1.0 0.0 1 1 0.00 4082.4 0.66 PASS
153 L 400 1.0 0.0 1 1 0.00 6135.5 1.00 -----
153 L 400 1.0 0.0 1 1 0.00 4012.7 0.65 PASS
154 L 500 1.0 0.0 1 1 0.00 6228.4 1.00 -----
154 L 500 1.0 0.0 1 1 0.00 4028.5 0.65 PASS
155 L 600 1.0 0.0 1 1 0.00 6199.2 1.00 -----
155 L 600 1.0 0.0 1 1 0.00 4065.9 0.66 PASS
156 L 700 1.0 0.0 1 1 0.00 6188.8 1.00 -----
156 L 700 1.0 0.0 1 1 0.00 4017.0 0.65 PASS
157 L 800 1.0 0.0 1 1 0.00 6346.8 1.00 -----
157 L 800 1.0 0.0 1 1 0.00 4014.2 0.63 PASS
158 L 900 1.0 0.0 1 1 0.00 6272.5 1.00 -----
158 L 900 1.0 0.0 1 1 0.00 4065.4 0.65 PASS
159 L 1000 1.0 0.0 1 1 0.00 6404.4 1.00 -----
159 L 1000 1.0 0.0 1 1 0.00 4060.1 0.63 PASS
./xdl2blastst -F 10 -R all
----------------------------- SYMV -----------------------------
TST# UP N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
40 L 100 1.0 1000 1 1.0 1 0.00 2218.3 1.00 -----
40 L 100 1.0 1000 1 1.0 1 0.00 1804.1 0.81 PASS
41 L 200 1.0 1000 1 1.0 1 0.00 2795.6 1.00 -----
41 L 200 1.0 1000 1 1.0 1 0.00 2248.5 0.80 PASS
42 L 300 1.0 1000 1 1.0 1 0.00 3051.5 1.00 -----
42 L 300 1.0 1000 1 1.0 1 0.00 2281.5 0.75 PASS
43 L 400 1.0 1000 1 1.0 1 0.00 3135.4 1.00 -----
43 L 400 1.0 1000 1 1.0 1 0.00 1937.8 0.62 PASS
44 L 500 1.0 1000 1 1.0 1 0.00 3065.5 1.00 -----
44 L 500 1.0 1000 1 1.0 1 0.00 1968.3 0.64 PASS
45 L 600 1.0 1000 1 1.0 1 0.00 3071.2 1.00 -----
45 L 600 1.0 1000 1 1.0 1 0.00 1976.5 0.64 PASS
46 L 700 1.0 1000 1 1.0 1 0.00 2586.4 1.00 -----
46 L 700 1.0 1000 1 1.0 1 0.00 1806.9 0.70 PASS
47 L 800 1.0 1000 1 1.0 1 0.00 2964.4 1.00 -----
47 L 800 1.0 1000 1 1.0 1 0.00 1940.6 0.65 PASS
48 L 900 1.0 1000 1 1.0 1 0.00 2541.9 1.00 -----
48 L 900 1.0 1000 1 1.0 1 0.00 1782.4 0.70 PASS
49 L 1000 1.0 1000 1 1.0 1 0.00 2456.2 1.00 -----
49 L 1000 1.0 1000 1 1.0 1 0.00 1603.7 0.65 PASS
./xsl2blastst -F 10 -R all
----------------------------- SYMV -----------------------------
TST# UP N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
40 L 100 1.0 1000 1 1.0 1 0.00 2709.0 1.00 -----
40 L 100 1.0 1000 1 1.0 1 0.00 1731.7 0.64 PASS
41 L 200 1.0 1000 1 1.0 1 0.00 3858.7 1.00 -----
41 L 200 1.0 1000 1 1.0 1 0.00 2181.8 0.57 PASS
42 L 300 1.0 1000 1 1.0 1 0.00 4739.5 1.00 -----
42 L 300 1.0 1000 1 1.0 1 0.00 2809.3 0.59 PASS
43 L 400 1.0 1000 1 1.0 1 0.00 5297.4 1.00 -----
43 L 400 1.0 1000 1 1.0 1 0.00 3272.8 0.62 PASS
44 L 500 1.0 1000 1 1.0 1 0.00 5459.3 1.00 -----
44 L 500 1.0 1000 1 1.0 1 0.00 3510.8 0.64 PASS
45 L 600 1.0 1000 1 1.0 1 0.00 5698.9 1.00 -----
45 L 600 1.0 1000 1 1.0 1 0.00 3161.3 0.55 PASS
46 L 700 1.0 1000 1 1.0 1 0.00 5164.9 1.00 -----
46 L 700 1.0 1000 1 1.0 1 0.00 3134.7 0.61 PASS
47 L 800 1.0 1000 1 1.0 1 0.00 5652.5 1.00 -----
47 L 800 1.0 1000 1 1.0 1 0.00 3210.7 0.57 PASS
48 L 900 1.0 1000 1 1.0 1 0.00 5104.5 1.00 -----
48 L 900 1.0 1000 1 1.0 1 0.00 3160.4 0.62 PASS
49 L 1000 1.0 1000 1 1.0 1 0.00 4205.0 1.00 -----
49 L 1000 1.0 1000 1 1.0 1 0.00 2797.4 0.67 PASS
------------------------ TPSV -------------------------
TST# UPLO TRAN DIAG N INCX TIME MFLOP SpUp TEST
==== ==== ==== ==== ==== ==== ====== ====== ===== =====
90 L N N 100 1 0.00 2408.1 1.00 -----
90 L N N 100 1 0.00 1743.5 0.72 PASS
91 L N N 200 1 0.00 2929.3 1.00 -----
91 L N N 200 1 0.00 1910.7 0.65 PASS
92 L N N 300 1 0.00 3200.4 1.00 -----
92 L N N 300 1 0.00 1996.7 0.62 PASS
93 L N N 400 1 0.00 3285.7 1.00 -----
93 L N N 400 1 0.00 1996.2 0.61 PASS
94 L N N 500 1 0.00 3287.1 1.00 -----
94 L N N 500 1 0.00 2010.1 0.61 PASS
95 L N N 600 1 0.00 3393.1 1.00 -----
95 L N N 600 1 0.00 2011.4 0.59 PASS
96 L N N 700 1 0.00 2945.2 1.00 -----
96 L N N 700 1 0.00 1914.5 0.65 PASS
97 L N N 800 1 0.00 2816.6 1.00 -----
97 L N N 800 1 0.00 1854.9 0.66 PASS
98 L N N 900 1 0.00 2785.2 1.00 -----
98 L N N 900 1 0.00 1854.6 0.67 PASS
99 L N N 1000 1 0.00 2410.4 1.00 -----
99 L N N 1000 1 0.00 1768.9 0.73 PASS
----------------------- SPR ------------------------
TST# UPLO N ALPHA INCX TIME MFLOP SpUp TEST
==== ==== ===== ===== ==== ====== ====== ===== =====
120 L 100 1.0 1 0.00 2366.5 1.00 -----
120 L 100 1.0 1 0.00 1690.2 0.71 PASS
121 L 200 1.0 1 0.00 2916.1 1.00 -----
121 L 200 1.0 1 0.00 1780.1 0.61 PASS
122 L 300 1.0 1 0.00 3197.0 1.00 -----
122 L 300 1.0 1 0.00 1805.5 0.56 PASS
123 L 400 1.0 1 0.00 3140.1 1.00 -----
123 L 400 1.0 1 0.00 1799.3 0.57 PASS
124 L 500 1.0 1 0.00 3134.2 1.00 -----
124 L 500 1.0 1 0.00 1812.4 0.58 PASS
125 L 600 1.0 1 0.00 3233.9 1.00 -----
125 L 600 1.0 1 0.00 1667.8 0.52 PASS
126 L 700 1.0 1 0.00 2859.8 1.00 -----
126 L 700 1.0 1 0.00 1733.1 0.61 PASS
127 L 800 1.0 1 0.00 2667.7 1.00 -----
127 L 800 1.0 1 0.00 1692.4 0.63 PASS
128 L 900 1.0 1 0.00 2619.4 1.00 -----
128 L 900 1.0 1 0.00 1688.7 0.64 PASS
129 L 1000 1.0 1 0.00 2223.0 1.00 -----
129 L 1000 1.0 1 0.00 1598.4 0.72 PASS
-------------------------- SPR2 -------------------------
TST# UPLO N ALPHA INCX INCY TIME MFLOP SpUp TEST
==== ==== ===== ===== ==== ==== ====== ====== ===== =====
140 L 100 1.0 1 1 0.00 3551.3 1.00 -----
140 L 100 1.0 1 1 0.00 2579.4 0.73 PASS
141 L 200 1.0 1 1 0.00 4278.7 1.00 -----
141 L 200 1.0 1 1 0.00 2747.1 0.64 PASS
142 L 300 1.0 1 1 0.00 4801.9 1.00 -----
142 L 300 1.0 1 1 0.00 2812.3 0.59 PASS
143 L 400 1.0 1 1 0.00 5291.5 1.00 -----
143 L 400 1.0 1 1 0.00 2775.1 0.52 PASS
144 L 500 1.0 1 1 0.00 5046.1 1.00 -----
144 L 500 1.0 1 1 0.00 2817.1 0.56 PASS
145 L 600 1.0 1 1 0.00 5051.2 1.00 -----
145 L 600 1.0 1 1 0.00 2842.6 0.56 PASS
146 L 700 1.0 1 1 0.00 4883.7 1.00 -----
146 L 700 1.0 1 1 0.00 2720.8 0.56 PASS
147 L 800 1.0 1 1 0.00 4672.9 1.00 -----
147 L 800 1.0 1 1 0.00 2633.3 0.56 PASS
148 L 900 1.0 1 1 0.00 4674.2 1.00 -----
148 L 900 1.0 1 1 0.00 2725.9 0.58 PASS
149 L 1000 1.0 1 1 0.00 4057.8 1.00 -----
149 L 1000 1.0 1 1 0.00 2601.6 0.64 PASS
./xzl1blastst -F 10 -R all
---------------- ASUM -----------------
TST# N INCX TIME MFLOP SpUp TEST
==== ==== ==== ====== ===== ===== =====
61 100 1 0.00 2141.7 1.00 -----
61 100 1 0.00 1878.6 0.88 PASS
62 200 1 0.00 2194.8 1.00 -----
62 200 1 0.00 1891.5 0.86 PASS
63 300 1 0.00 2212.3 1.00 -----
63 300 1 0.00 1888.8 0.85 PASS
64 400 1 0.00 2219.1 1.00 -----
64 400 1 0.00 1885.2 0.85 PASS
65 500 1 0.00 2225.9 1.00 -----
65 500 1 0.00 1882.9 0.85 PASS
66 600 1 0.00 2217.9 1.00 -----
66 600 1 0.00 1889.2 0.85 PASS
67 700 1 0.00 2215.9 1.00 -----
67 700 1 0.00 1884.6 0.85 PASS
68 800 1 0.00 2203.0 1.00 -----
68 800 1 0.00 1882.8 0.85 PASS
69 900 1 0.00 2226.6 1.00 -----
69 900 1 0.00 1885.0 0.85 PASS
70 1000 1 0.00 2230.7 1.00 -----
70 1000 1 0.00 1885.6 0.85 PASS
Does atlas time complex expressions like D = (a*A + b*B) * (c * C) ... where eigen should be very fast?
BTW, what ATLAS version are you using? I just hugely increased HEMV, SYMV, TRSV and TRMV performance in the most recent releases. I think CGEMV is like 3x faster now on my system.
If you get 3.9.84, then you can install with -DPentiumCPS= in order to get much more accurate timings as well as the new stuff that will be in the new stable release (3.10.0), which I hope to release next week.
As for operations, ATLAS provides the BLAS and the 3 factorizations and related routiines from LAPACK. It does not optimize single expressions, of course, since it is a library . . .
Cheers,
Clint
I used both ATLAS-3.9.84 and -DPentiumCPS=3000. The installation instructions file has all the commands that I used (from wget ...atlas3.9.84.tar.bz2... to ./xzl3blastst -F 100 -R all | tee xzl3blastst.out).
It would be nice to compare atlas using multiple blas calls [e.g. x = (a*A + b*B + c*C - d*D) dot e * E] with eigen transforming everything into just one loop...
Glad to see you are using 84; I hadn't checked your attached files, just scoped the posted one, sorry. Just gave the posted files a quick scope:
Makes sense ATLAS would lose to well-optimized HEMV: ATLAS's present approach to HEMV and SYMV is fundamentally flawed, because it takes two passes through memory instead of one, and memory is the main bottleneck. I haven't bothered to fix it, because I have yet to find an application where the performance of these routines is important. More broadly, the Level 1 and 2 BLAS don't tend to matter for many applications that I know about, though GEMV and GER can occasionally be important, as in the Hessenburg reduction (eigenvalues).
You can also time other problem sizes. For instance add -N 1400 3000 400 to time all problem between 1400-3000. My guess is that the timings will continue to get worse and worse for eigen. Looks like they are not cache blocking, which is the kiss of death for large linear algebra. If this is true, you will see their performance drop as problem sizes in increased, while ATLAS will continue to improve.
You can also time the factorizations if eigen provides lapack interfaces. W/o cache blocking, these guys will also die for large problems.
I can't follow your notation to understand what you are asking. If the equation you give is all scalars, then of course ATLAS can't help. If they are matrix-matrix operations, ATLAS will probably do well without fusion. If they are matrix-vector operations, then the only critical thing is to make only one pass through memory.
Cheers,
Clint
Actually it would seem that atlas is consistently about 46 % faster than eigen up to n=m=k=9000. Also atlas is only 5 to 8 % faster than eigen if n=m=k= odd number (xsl3blastst):
------------------------ GEMM ----------------------------------
M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
1111 1111 1111 1.0 1111 1111 1.0 1111 0.18 15273.2 1.00 -----
1111 1111 1111 1.0 1111 1111 1.0 1111 0.17 16535.9 1.08 PASS
1112 1112 1112 1.0 1112 1112 1.0 1112 0.18 15690.5 1.00 -----
1112 1112 1112 1.0 1112 1112 1.0 1112 0.13 21819.7 1.39 PASS
1113 1113 1113 1.0 1113 1113 1.0 1113 0.18 15529.9 1.00 -----
1113 1113 1113 1.0 1113 1113 1.0 1113 0.16 16925.3 1.09 PASS
1114 1114 1114 1.0 1114 1114 1.0 1114 0.18 15416.4 1.00 -----
1114 1114 1114 1.0 1114 1114 1.0 1114 0.15 18921.6 1.23 PASS
1115 1115 1115 1.0 1115 1115 1.0 1115 0.18 15407.0 1.00 -----
1115 1115 1115 1.0 1115 1115 1.0 1115 0.17 16626.9 1.08 PASS
1116 1116 1116 1.0 1116 1116 1.0 1116 0.18 15645.1 1.00 -----
1116 1116 1116 1.0 1116 1116 1.0 1116 0.13 21574.5 1.38 PASS
1117 1117 1117 1.0 1117 1117 1.0 1117 0.18 15643.9 1.00 -----
1117 1117 1117 1.0 1117 1117 1.0 1117 0.17 16471.1 1.05 PASS
1118 1118 1118 1.0 1118 1118 1.0 1118 0.18 15601.0 1.00 -----
1118 1118 1118 1.0 1118 1118 1.0 1118 0.15 18806.1 1.21 PASS
1119 1119 1119 1.0 1119 1119 1.0 1119 0.18 15484.5 1.00 -----
1119 1119 1119 1.0 1119 1119 1.0 1119 0.17 16207.1 1.05 PASS
For large (even) matrices:
------------------------ GEMM ----------------------------------
M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
1000 1000 1000 1.0 1000 1000 1.0 1000 0.12 16370.2 1.00 -----
1000 1000 1000 1.0 1000 1000 1.0 1000 0.09 22272.8 1.36 PASS
2000 2000 2000 1.0 2000 2000 1.0 2000 1.01 15918.3 1.00 -----
2000 2000 2000 1.0 2000 2000 1.0 2000 0.69 23089.3 1.45 PASS
3000 3000 3000 1.0 3000 3000 1.0 3000 3.42 15788.6 1.00 -----
3000 3000 3000 1.0 3000 3000 1.0 3000 2.33 23128.0 1.46 PASS
4000 4000 4000 1.0 4000 4000 1.0 4000 8.07 15861.7 1.00 -----
4000 4000 4000 1.0 4000 4000 1.0 4000 5.50 23266.5 1.47 PASS
5000 5000 5000 1.0 5000 5000 1.0 5000 15.70 15928.1 1.00 -----
5000 5000 5000 1.0 5000 5000 1.0 5000 10.74 23269.0 1.46 PASS
6000 6000 6000 1.0 6000 6000 1.0 6000 27.06 15963.7 1.00 -----
6000 6000 6000 1.0 6000 6000 1.0 6000 18.51 23336.2 1.46 PASS
7000 7000 7000 1.0 7000 7000 1.0 7000 43.06 15930.2 1.00 -----
7000 7000 7000 1.0 7000 7000 1.0 7000 29.31 23401.6 1.47 PASS
8000 8000 8000 1.0 8000 8000 1.0 8000 63.97 16007.7 1.00 -----
8000 8000 8000 1.0 8000 8000 1.0 8000 43.79 23381.7 1.46 PASS
9000 9000 9000 1.0 9000 9000 1.0 9000 91.77 15887.7 1.00 -----
9000 9000 9000 1.0 9000 9000 1.0 9000 62.44 23351.7 1.47 PASS
The even/odd thing was brought up on eigen mailing list, the thread starts at http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html but the archives haven't cought up yet. Apparently on an Intel CPU eigen and atlas have the same speed but eigen doesn't slow down if n=m=k=1115 instead of 1116.
As for my notation, I was thinking that small letters would be scalars and large ones vectors. Changing the equation a bit would also allow for matrices * vectors. Eigen should be able to loop only once through each variable while the C++ source code would have x = (a*A + b*B +...
Thanks for the info on the odd problem size. Indeed if you seperate out M,N,K, you get a small drop for each one that is odd, down to the worst-case you show when all dims are odd. I think this is down to the way ATLAS handles cleanup. As soon as I release the stable, I'm going to rewrite GEMM for increased performance, but the rewrite I have in mind is likely to make this problem even worse, since cleanup will be even more granular in the new design!
However, if I'm successful, the raw MFLOP should increase even for the odd cases, since the peak should be increased significantly.
Thanks for posting the further timings. Indeed, it appears ATLAS has some blocking advantage over Eigen, but they are both doing blocking. At a guess, does Eigen not copy A and B? If so, you may have TLB problems on some machines for larger problems.
What happens if you do :
./xdl3blastst -n 2000 -A 2 t n -B 2 t n
?
As for your notation, no I don't see how the BLAS can be used to do that equation efficiently. Like I said, the only place where operations like that matter for application performance that I know about are iterative methods, which tend to use sparse storage anyway. Most dense algorithms are dominated by L3BLAS . . .
Cheers,
Clint