Menu

#839 Testing a C++ header only library with ATLAS

Developer_(v3.11.x)
closed-out-of-date
Other (122)
5
2014-09-18
2012-07-06
ilja
No

I'd like to compare ATLAS to eigen but I'm having a hard time understanding what in atlas should be modified so that I can substitute or add eigen versions of some functions to be timed by atlas. I should probably start with one program from chapters 2, 4, 5 or 6 if I understood TestTime.txt correctly. Would that be for example l3blastst.c or some lower level function which I then substitute somewhere?

The eigen code should be easy to write, for example matrix and vector arithmetic is trivial in eigen: http://eigen.tuxfamily.org/dox/TutorialMatrixArithmetic.html#TutorialArithmeticMatrixMul
There even is an example showing how to use eigen from C code:
https://bitbucket.org/eigen/eigen/src/43de1660cb26/demos/mix_eigen_and_c/binary_library.h
https://bitbucket.org/eigen/eigen/src/43de1660cb26/demos/mix_eigen_and_c/binary_library.cpp
https://bitbucket.org/eigen/eigen/src/43de1660cb26/demos/mix_eigen_and_c/example.c

So if I want to reproduce the first benchmark given in http://eigen.tuxfamily.org/index.php?title=Benchmark (Y += alpha X) for eigen and atlas using atlas what would be the place to start in atlas code? I'm assuming atlas uses only dynamic vectors and matrices so writing one eigen function should suffice.
Thanks.

Discussion

1 2 > >> (Page 1 of 2)
  • R. Clint Whaley

    R. Clint Whaley - 2012-07-06

    Yes, the ATLAS timers can time any BLAS or lapack implementation. However, they expect a library, not a header file (or C++).

    What does the eigen package provide? Does it provide the C API to BLAS, for instance? If so, is it available as a library, or only a header file?

    Cheers,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-07-06
    • assigned_to: nobody --> rwhaley
     
  • ilja

    ilja - 2012-07-07

    It turns out that eigen provides an experimental version of blas (http://eigen.tuxfamily.org/index.php?title=Todo#BLAS_implementation_on_top_of_Eigen , https://bitbucket.org/eigen/eigen/src/43de1660cb26/blas\) but there seem to be symbols missing when running make xsl1blastst:

    cd /home/ilja/ATLAS-3.9.83/build/src/testing ; make slib
    make[1]: Siirrytään hakemistoon "/home/ilja/ATLAS-3.9.83/build/src/testing"
    make -j 6 slib.grd
    make[2]: Siirrytään hakemistoon "/home/ilja/ATLAS-3.9.83/build/src/testing"
    make[2]: "slib.grd" on ajan tasalla.
    make[2]: Poistutaan hakemistosta "/home/ilja/ATLAS-3.9.83/build/src/testing"
    make[1]: Poistutaan hakemistosta "/home/ilja/ATLAS-3.9.83/build/src/testing"
    /usr/bin/x86_64-unknown-linux-gnu-g++ -fomit-frame-pointer -mfpmath=sse -O2 -msse3 -m64 -o xsl1blastst sl1blastst.o \ /home/ilja/ATLAS-3.9.83/build/lib/libtstatlas.a /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a /home/ilja/ATLAS-3.9.83/build/lib/libatlas.a -lpthread -lm
    /home/ilja/ATLAS-3.9.83/build/lib/libtstatlas.a(ATL_sf77wrap.o): In function `dswrapdot_':
    ATL_sf77wrap.f:(.text+0x67): undefined reference to `dsdot_'
    /home/ilja/ATLAS-3.9.83/build/lib/libtstatlas.a(ATL_sf77wrap.o): In function `sdswrapdot_':
    ATL_sf77wrap.f:(.text+0x87): undefined reference to `sdsdot_'
    collect2: ld:n paluuarvo oli 1
    make: *** [xsl1blastst] Virhe 1

    Before trying that I had to change
    F77 = /usr/bin/x86_64-unknown-linux-gnu-gfortran
    to
    F77 = /usr/bin/x86_64-unknown-linux-gnu-g++
    otherwise I got this:

    /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `ssyr2_':
    single.cpp:(.text+0x1555): undefined reference to `operator delete[](void*)'
    single.cpp:(.text+0x1578): undefined reference to `operator delete[](void*)'
    /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `stbsv_':
    single.cpp:(.text+0x194c): undefined reference to `operator delete[](void*)'
    ...
    /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `global constructors keyed to saxpy_':
    single.cpp:(.text+0x6a6a): undefined reference to `std::ios_base::Init::Init()'
    ...
    /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(single.cpp.o): In function `Eigen::internal::throw_std_bad_alloc()':
    single.cpp:(.text._ZN5Eigen8internal19throw_std_bad_allocEv[Eigen::internal::throw_std_bad_alloc()]+0xa): undefined reference to `__cxa_allocate_exception'
    ...
    /home/ilja/eigen-3.1.0/build/blas/libeigen_blas_static.a(xerbla.cpp.o): In function `xerbla_':
    xerbla.cpp:(.text+0xa): undefined reference to `std::cerr'
    xerbla.cpp:(.text+0x1b): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)'
    xerbla.cpp:(.text+0x20): undefined reference to `std::cerr'
    ...

    But the level 2 and 3 programs seemed to compile, here's an example output from ./xsl3blastst:

    --------------------------------- GEMM ----------------------------------
    TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
    ==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
    0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 0.0 1.00 -----
    0 N N 100 100 100 1.0 1000 1000 1.0 1000 0.00 0.0 0.00 PASS
    1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.00 4000.0 1.00 -----
    1 N N 200 200 200 1.0 1000 1000 1.0 1000 0.00 0.0 0.00 PASS
    2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.00 13500.0 1.00 -----
    2 N N 300 300 300 1.0 1000 1000 1.0 1000 0.00 13500.0 1.00 PASS
    3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.01 15998.0 1.00 -----
    3 N N 400 400 400 1.0 1000 1000 1.0 1000 0.00 32000.0 2.00 PASS
    4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.02 15624.0 1.00 -----
    4 N N 500 500 500 1.0 1000 1000 1.0 1000 0.01 20833.3 1.33 PASS
    5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.03 13499.2 1.00 -----
    5 N N 600 600 600 1.0 1000 1000 1.0 1000 0.02 21597.8 1.60 PASS
    6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.04 15589.8 1.00 -----
    6 N N 700 700 700 1.0 1000 1000 1.0 1000 0.03 21436.2 1.38 PASS
    7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.07 15057.9 1.00 -----
    7 N N 800 800 800 1.0 1000 1000 1.0 1000 0.04 23271.1 1.55 PASS
    8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.09 16567.2 1.00 -----
    8 N N 900 900 900 1.0 1000 1000 1.0 1000 0.06 22779.8 1.37 PASS
    9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.13 15624.0 1.00 -----
    9 N N 1000 1000 1000 1.0 1000 1000 1.0 1000 0.09 22725.7 1.45 PASS

    10 tests run, 10 passed

    So eigen is slower with matrices > 300x300?

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-07-07

    OK, looks llike they don't provide the full L1BLAS. You can probably add the ATLAS interface routine after the eigen one so that it satisfies any missing symbols so that you can build the l1timer.

    Yes, from what I can see below, it looks like ATLAS is winning for all the sizes for which the timer is producing reasonable results. Did you install ATLAS without using -DPentiumCPS or -DWALL, or something? These timings look very crude, like when using the default CPU timer, which has very low resolution.

    Anyway, to succesfully time smaller problems, throw the -F flag to force the timing to be done multiple times, and then you can get more reasonable small-case timings. Eg., -F 200 or something similar. Just keep cranking up the number until the small cases are more repeatable.

    If your ATLAS install was compiled with CPU time, you may want to reinstall using a WALL (as described in the install guide) to get better libraries and much better timers.

    Note that you can also time the parallel BLAS; does Eigen provide parallel BLAS? For instance, try xdl3blastst_pt to get the parallel L3BLAS timer.

    Let me know,,
    Clint

     
  • ilja

    ilja - 2012-07-07

    Thanks I forgot about -DPentiumCPS. Eigen doesn't seem to have much parallelization yet so I won't try to test that.
    I'll start attaching instructions and results, here's my cpuinfo:
    processor : 5
    vendor_id : AuthenticAMD
    cpu family : 16
    model : 10
    model name : AMD Phenom(tm) II X6 1075T Processor
    stepping : 0
    microcode : 0x10000bf
    cpu MHz : 3000.000
    cache size : 512 KB
    physical id : 0
    siblings : 6
    core id : 5
    cpu cores : 6
    apicid : 5
    initial apicid : 5
    fpu : yes
    fpu_exception : yes
    cpuid level : 6
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter
    bogomips : 6000.31
    TLB size : 1024 4K pages
    clflush size : 64
    cache_alignment : 64
    address sizes : 48 bits physical, 48 bits virtual
    power management: ts ttp tm stc 100mhzsteps hwpstate cpb

     
  • ilja

    ilja - 2012-07-07

    Installation instructions

     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07
     
  • ilja

    ilja - 2012-07-07

    Uploaded the results, it seems most of the time atlas is a bit faster, sometimes a lot faster and sometimes eigen is noticeably faster:

    ./xcl2blastst -F 10 -R all

    --------------------------------- HEMV ---------------------------------
    TST# UP N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
    ==== == ==== ==== ==== ==== ==== ==== ==== ==== ====== ===== ===== =====
    40 L 100 1.0 0.0 1000 1 1.0 0.0 1 0.00 3467.7 1.00 -----
    40 L 100 1.0 0.0 1000 1 1.0 0.0 1 0.00 2541.5 0.73 PASS
    41 L 200 1.0 0.0 1000 1 1.0 0.0 1 0.00 4607.5 1.00 -----
    41 L 200 1.0 0.0 1000 1 1.0 0.0 1 0.00 3133.2 0.68 PASS
    42 L 300 1.0 0.0 1000 1 1.0 0.0 1 0.00 5261.8 1.00 -----
    42 L 300 1.0 0.0 1000 1 1.0 0.0 1 0.00 3795.6 0.72 PASS
    43 L 400 1.0 0.0 1000 1 1.0 0.0 1 0.00 5632.7 1.00 -----
    43 L 400 1.0 0.0 1000 1 1.0 0.0 1 0.00 4040.3 0.72 PASS
    44 L 500 1.0 0.0 1000 1 1.0 0.0 1 0.00 6000.7 1.00 -----
    44 L 500 1.0 0.0 1000 1 1.0 0.0 1 0.00 4403.2 0.73 PASS
    45 L 600 1.0 0.0 1000 1 1.0 0.0 1 0.00 6137.4 1.00 -----
    45 L 600 1.0 0.0 1000 1 1.0 0.0 1 0.00 4706.3 0.77 PASS
    46 L 700 1.0 0.0 1000 1 1.0 0.0 1 0.00 6142.8 1.00 -----
    46 L 700 1.0 0.0 1000 1 1.0 0.0 1 0.00 4772.4 0.78 PASS
    47 L 800 1.0 0.0 1000 1 1.0 0.0 1 0.00 6389.9 1.00 -----
    47 L 800 1.0 0.0 1000 1 1.0 0.0 1 0.00 4921.3 0.77 PASS
    48 L 900 1.0 0.0 1000 1 1.0 0.0 1 0.00 6426.6 1.00 -----
    48 L 900 1.0 0.0 1000 1 1.0 0.0 1 0.00 5016.0 0.78 PASS
    49 L 1000 1.0 0.0 1000 1 1.0 0.0 1 0.00 6614.7 1.00 -----
    49 L 1000 1.0 0.0 1000 1 1.0 0.0 1 0.00 5088.6 0.77 PASS

    ----------------------------- HPR2 ----------------------------
    TST# UPLO N ALPHA INCX INCY TIME MFLOP SpUp TEST
    ==== ==== ===== ===== ===== ==== ==== ====== ====== ===== =====
    150 L 100 1.0 0.0 1 1 0.00 5588.7 1.00 -----
    150 L 100 1.0 0.0 1 1 0.00 3881.2 0.69 PASS
    151 L 200 1.0 0.0 1 1 0.00 5993.6 1.00 -----
    151 L 200 1.0 0.0 1 1 0.00 4002.9 0.67 PASS
    152 L 300 1.0 0.0 1 1 0.00 6170.5 1.00 -----
    152 L 300 1.0 0.0 1 1 0.00 4082.4 0.66 PASS
    153 L 400 1.0 0.0 1 1 0.00 6135.5 1.00 -----
    153 L 400 1.0 0.0 1 1 0.00 4012.7 0.65 PASS
    154 L 500 1.0 0.0 1 1 0.00 6228.4 1.00 -----
    154 L 500 1.0 0.0 1 1 0.00 4028.5 0.65 PASS
    155 L 600 1.0 0.0 1 1 0.00 6199.2 1.00 -----
    155 L 600 1.0 0.0 1 1 0.00 4065.9 0.66 PASS
    156 L 700 1.0 0.0 1 1 0.00 6188.8 1.00 -----
    156 L 700 1.0 0.0 1 1 0.00 4017.0 0.65 PASS
    157 L 800 1.0 0.0 1 1 0.00 6346.8 1.00 -----
    157 L 800 1.0 0.0 1 1 0.00 4014.2 0.63 PASS
    158 L 900 1.0 0.0 1 1 0.00 6272.5 1.00 -----
    158 L 900 1.0 0.0 1 1 0.00 4065.4 0.65 PASS
    159 L 1000 1.0 0.0 1 1 0.00 6404.4 1.00 -----
    159 L 1000 1.0 0.0 1 1 0.00 4060.1 0.63 PASS

    ./xdl2blastst -F 10 -R all

    ----------------------------- SYMV -----------------------------
    TST# UP N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
    ==== == ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
    40 L 100 1.0 1000 1 1.0 1 0.00 2218.3 1.00 -----
    40 L 100 1.0 1000 1 1.0 1 0.00 1804.1 0.81 PASS
    41 L 200 1.0 1000 1 1.0 1 0.00 2795.6 1.00 -----
    41 L 200 1.0 1000 1 1.0 1 0.00 2248.5 0.80 PASS
    42 L 300 1.0 1000 1 1.0 1 0.00 3051.5 1.00 -----
    42 L 300 1.0 1000 1 1.0 1 0.00 2281.5 0.75 PASS
    43 L 400 1.0 1000 1 1.0 1 0.00 3135.4 1.00 -----
    43 L 400 1.0 1000 1 1.0 1 0.00 1937.8 0.62 PASS
    44 L 500 1.0 1000 1 1.0 1 0.00 3065.5 1.00 -----
    44 L 500 1.0 1000 1 1.0 1 0.00 1968.3 0.64 PASS
    45 L 600 1.0 1000 1 1.0 1 0.00 3071.2 1.00 -----
    45 L 600 1.0 1000 1 1.0 1 0.00 1976.5 0.64 PASS
    46 L 700 1.0 1000 1 1.0 1 0.00 2586.4 1.00 -----
    46 L 700 1.0 1000 1 1.0 1 0.00 1806.9 0.70 PASS
    47 L 800 1.0 1000 1 1.0 1 0.00 2964.4 1.00 -----
    47 L 800 1.0 1000 1 1.0 1 0.00 1940.6 0.65 PASS
    48 L 900 1.0 1000 1 1.0 1 0.00 2541.9 1.00 -----
    48 L 900 1.0 1000 1 1.0 1 0.00 1782.4 0.70 PASS
    49 L 1000 1.0 1000 1 1.0 1 0.00 2456.2 1.00 -----
    49 L 1000 1.0 1000 1 1.0 1 0.00 1603.7 0.65 PASS

    ./xsl2blastst -F 10 -R all

    ----------------------------- SYMV -----------------------------
    TST# UP N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
    ==== == ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
    40 L 100 1.0 1000 1 1.0 1 0.00 2709.0 1.00 -----
    40 L 100 1.0 1000 1 1.0 1 0.00 1731.7 0.64 PASS
    41 L 200 1.0 1000 1 1.0 1 0.00 3858.7 1.00 -----
    41 L 200 1.0 1000 1 1.0 1 0.00 2181.8 0.57 PASS
    42 L 300 1.0 1000 1 1.0 1 0.00 4739.5 1.00 -----
    42 L 300 1.0 1000 1 1.0 1 0.00 2809.3 0.59 PASS
    43 L 400 1.0 1000 1 1.0 1 0.00 5297.4 1.00 -----
    43 L 400 1.0 1000 1 1.0 1 0.00 3272.8 0.62 PASS
    44 L 500 1.0 1000 1 1.0 1 0.00 5459.3 1.00 -----
    44 L 500 1.0 1000 1 1.0 1 0.00 3510.8 0.64 PASS
    45 L 600 1.0 1000 1 1.0 1 0.00 5698.9 1.00 -----
    45 L 600 1.0 1000 1 1.0 1 0.00 3161.3 0.55 PASS
    46 L 700 1.0 1000 1 1.0 1 0.00 5164.9 1.00 -----
    46 L 700 1.0 1000 1 1.0 1 0.00 3134.7 0.61 PASS
    47 L 800 1.0 1000 1 1.0 1 0.00 5652.5 1.00 -----
    47 L 800 1.0 1000 1 1.0 1 0.00 3210.7 0.57 PASS
    48 L 900 1.0 1000 1 1.0 1 0.00 5104.5 1.00 -----
    48 L 900 1.0 1000 1 1.0 1 0.00 3160.4 0.62 PASS
    49 L 1000 1.0 1000 1 1.0 1 0.00 4205.0 1.00 -----
    49 L 1000 1.0 1000 1 1.0 1 0.00 2797.4 0.67 PASS

    ------------------------ TPSV -------------------------
    TST# UPLO TRAN DIAG N INCX TIME MFLOP SpUp TEST
    ==== ==== ==== ==== ==== ==== ====== ====== ===== =====
    90 L N N 100 1 0.00 2408.1 1.00 -----
    90 L N N 100 1 0.00 1743.5 0.72 PASS
    91 L N N 200 1 0.00 2929.3 1.00 -----
    91 L N N 200 1 0.00 1910.7 0.65 PASS
    92 L N N 300 1 0.00 3200.4 1.00 -----
    92 L N N 300 1 0.00 1996.7 0.62 PASS
    93 L N N 400 1 0.00 3285.7 1.00 -----
    93 L N N 400 1 0.00 1996.2 0.61 PASS
    94 L N N 500 1 0.00 3287.1 1.00 -----
    94 L N N 500 1 0.00 2010.1 0.61 PASS
    95 L N N 600 1 0.00 3393.1 1.00 -----
    95 L N N 600 1 0.00 2011.4 0.59 PASS
    96 L N N 700 1 0.00 2945.2 1.00 -----
    96 L N N 700 1 0.00 1914.5 0.65 PASS
    97 L N N 800 1 0.00 2816.6 1.00 -----
    97 L N N 800 1 0.00 1854.9 0.66 PASS
    98 L N N 900 1 0.00 2785.2 1.00 -----
    98 L N N 900 1 0.00 1854.6 0.67 PASS
    99 L N N 1000 1 0.00 2410.4 1.00 -----
    99 L N N 1000 1 0.00 1768.9 0.73 PASS

    ----------------------- SPR ------------------------
    TST# UPLO N ALPHA INCX TIME MFLOP SpUp TEST
    ==== ==== ===== ===== ==== ====== ====== ===== =====
    120 L 100 1.0 1 0.00 2366.5 1.00 -----
    120 L 100 1.0 1 0.00 1690.2 0.71 PASS
    121 L 200 1.0 1 0.00 2916.1 1.00 -----
    121 L 200 1.0 1 0.00 1780.1 0.61 PASS
    122 L 300 1.0 1 0.00 3197.0 1.00 -----
    122 L 300 1.0 1 0.00 1805.5 0.56 PASS
    123 L 400 1.0 1 0.00 3140.1 1.00 -----
    123 L 400 1.0 1 0.00 1799.3 0.57 PASS
    124 L 500 1.0 1 0.00 3134.2 1.00 -----
    124 L 500 1.0 1 0.00 1812.4 0.58 PASS
    125 L 600 1.0 1 0.00 3233.9 1.00 -----
    125 L 600 1.0 1 0.00 1667.8 0.52 PASS
    126 L 700 1.0 1 0.00 2859.8 1.00 -----
    126 L 700 1.0 1 0.00 1733.1 0.61 PASS
    127 L 800 1.0 1 0.00 2667.7 1.00 -----
    127 L 800 1.0 1 0.00 1692.4 0.63 PASS
    128 L 900 1.0 1 0.00 2619.4 1.00 -----
    128 L 900 1.0 1 0.00 1688.7 0.64 PASS
    129 L 1000 1.0 1 0.00 2223.0 1.00 -----
    129 L 1000 1.0 1 0.00 1598.4 0.72 PASS

    -------------------------- SPR2 -------------------------
    TST# UPLO N ALPHA INCX INCY TIME MFLOP SpUp TEST
    ==== ==== ===== ===== ==== ==== ====== ====== ===== =====
    140 L 100 1.0 1 1 0.00 3551.3 1.00 -----
    140 L 100 1.0 1 1 0.00 2579.4 0.73 PASS
    141 L 200 1.0 1 1 0.00 4278.7 1.00 -----
    141 L 200 1.0 1 1 0.00 2747.1 0.64 PASS
    142 L 300 1.0 1 1 0.00 4801.9 1.00 -----
    142 L 300 1.0 1 1 0.00 2812.3 0.59 PASS
    143 L 400 1.0 1 1 0.00 5291.5 1.00 -----
    143 L 400 1.0 1 1 0.00 2775.1 0.52 PASS
    144 L 500 1.0 1 1 0.00 5046.1 1.00 -----
    144 L 500 1.0 1 1 0.00 2817.1 0.56 PASS
    145 L 600 1.0 1 1 0.00 5051.2 1.00 -----
    145 L 600 1.0 1 1 0.00 2842.6 0.56 PASS
    146 L 700 1.0 1 1 0.00 4883.7 1.00 -----
    146 L 700 1.0 1 1 0.00 2720.8 0.56 PASS
    147 L 800 1.0 1 1 0.00 4672.9 1.00 -----
    147 L 800 1.0 1 1 0.00 2633.3 0.56 PASS
    148 L 900 1.0 1 1 0.00 4674.2 1.00 -----
    148 L 900 1.0 1 1 0.00 2725.9 0.58 PASS
    149 L 1000 1.0 1 1 0.00 4057.8 1.00 -----
    149 L 1000 1.0 1 1 0.00 2601.6 0.64 PASS

    ./xzl1blastst -F 10 -R all

    ---------------- ASUM -----------------
    TST# N INCX TIME MFLOP SpUp TEST
    ==== ==== ==== ====== ===== ===== =====
    61 100 1 0.00 2141.7 1.00 -----
    61 100 1 0.00 1878.6 0.88 PASS
    62 200 1 0.00 2194.8 1.00 -----
    62 200 1 0.00 1891.5 0.86 PASS
    63 300 1 0.00 2212.3 1.00 -----
    63 300 1 0.00 1888.8 0.85 PASS
    64 400 1 0.00 2219.1 1.00 -----
    64 400 1 0.00 1885.2 0.85 PASS
    65 500 1 0.00 2225.9 1.00 -----
    65 500 1 0.00 1882.9 0.85 PASS
    66 600 1 0.00 2217.9 1.00 -----
    66 600 1 0.00 1889.2 0.85 PASS
    67 700 1 0.00 2215.9 1.00 -----
    67 700 1 0.00 1884.6 0.85 PASS
    68 800 1 0.00 2203.0 1.00 -----
    68 800 1 0.00 1882.8 0.85 PASS
    69 900 1 0.00 2226.6 1.00 -----
    69 900 1 0.00 1885.0 0.85 PASS
    70 1000 1 0.00 2230.7 1.00 -----
    70 1000 1 0.00 1885.6 0.85 PASS

     
  • ilja

    ilja - 2012-07-07

    Does atlas time complex expressions like D = (a*A + b*B) * (c * C) ... where eigen should be very fast?

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-07-07

    BTW, what ATLAS version are you using? I just hugely increased HEMV, SYMV, TRSV and TRMV performance in the most recent releases. I think CGEMV is like 3x faster now on my system.

    If you get 3.9.84, then you can install with -DPentiumCPS= in order to get much more accurate timings as well as the new stuff that will be in the new stable release (3.10.0), which I hope to release next week.

    As for operations, ATLAS provides the BLAS and the 3 factorizations and related routiines from LAPACK. It does not optimize single expressions, of course, since it is a library . . .

    Cheers,
    Clint

     
  • ilja

    ilja - 2012-07-07

    I used both ATLAS-3.9.84 and -DPentiumCPS=3000. The installation instructions file has all the commands that I used (from wget ...atlas3.9.84.tar.bz2... to ./xzl3blastst -F 100 -R all | tee xzl3blastst.out).

    It would be nice to compare atlas using multiple blas calls [e.g. x = (a*A + b*B + c*C - d*D) dot e * E] with eigen transforming everything into just one loop...

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-07-07

    Glad to see you are using 84; I hadn't checked your attached files, just scoped the posted one, sorry. Just gave the posted files a quick scope:

    Makes sense ATLAS would lose to well-optimized HEMV: ATLAS's present approach to HEMV and SYMV is fundamentally flawed, because it takes two passes through memory instead of one, and memory is the main bottleneck. I haven't bothered to fix it, because I have yet to find an application where the performance of these routines is important. More broadly, the Level 1 and 2 BLAS don't tend to matter for many applications that I know about, though GEMV and GER can occasionally be important, as in the Hessenburg reduction (eigenvalues).

    You can also time other problem sizes. For instance add -N 1400 3000 400 to time all problem between 1400-3000. My guess is that the timings will continue to get worse and worse for eigen. Looks like they are not cache blocking, which is the kiss of death for large linear algebra. If this is true, you will see their performance drop as problem sizes in increased, while ATLAS will continue to improve.

    You can also time the factorizations if eigen provides lapack interfaces. W/o cache blocking, these guys will also die for large problems.

    I can't follow your notation to understand what you are asking. If the equation you give is all scalars, then of course ATLAS can't help. If they are matrix-matrix operations, ATLAS will probably do well without fusion. If they are matrix-vector operations, then the only critical thing is to make only one pass through memory.

    Cheers,
    Clint

     
  • ilja

    ilja - 2012-07-08

    Actually it would seem that atlas is consistently about 46 % faster than eigen up to n=m=k=9000. Also atlas is only 5 to 8 % faster than eigen if n=m=k= odd number (xsl3blastst):
    ------------------------ GEMM ----------------------------------
    M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
    ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
    1111 1111 1111 1.0 1111 1111 1.0 1111 0.18 15273.2 1.00 -----
    1111 1111 1111 1.0 1111 1111 1.0 1111 0.17 16535.9 1.08 PASS
    1112 1112 1112 1.0 1112 1112 1.0 1112 0.18 15690.5 1.00 -----
    1112 1112 1112 1.0 1112 1112 1.0 1112 0.13 21819.7 1.39 PASS
    1113 1113 1113 1.0 1113 1113 1.0 1113 0.18 15529.9 1.00 -----
    1113 1113 1113 1.0 1113 1113 1.0 1113 0.16 16925.3 1.09 PASS
    1114 1114 1114 1.0 1114 1114 1.0 1114 0.18 15416.4 1.00 -----
    1114 1114 1114 1.0 1114 1114 1.0 1114 0.15 18921.6 1.23 PASS
    1115 1115 1115 1.0 1115 1115 1.0 1115 0.18 15407.0 1.00 -----
    1115 1115 1115 1.0 1115 1115 1.0 1115 0.17 16626.9 1.08 PASS
    1116 1116 1116 1.0 1116 1116 1.0 1116 0.18 15645.1 1.00 -----
    1116 1116 1116 1.0 1116 1116 1.0 1116 0.13 21574.5 1.38 PASS
    1117 1117 1117 1.0 1117 1117 1.0 1117 0.18 15643.9 1.00 -----
    1117 1117 1117 1.0 1117 1117 1.0 1117 0.17 16471.1 1.05 PASS
    1118 1118 1118 1.0 1118 1118 1.0 1118 0.18 15601.0 1.00 -----
    1118 1118 1118 1.0 1118 1118 1.0 1118 0.15 18806.1 1.21 PASS
    1119 1119 1119 1.0 1119 1119 1.0 1119 0.18 15484.5 1.00 -----
    1119 1119 1119 1.0 1119 1119 1.0 1119 0.17 16207.1 1.05 PASS

    For large (even) matrices:
    ------------------------ GEMM ----------------------------------
    M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST
    ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
    1000 1000 1000 1.0 1000 1000 1.0 1000 0.12 16370.2 1.00 -----
    1000 1000 1000 1.0 1000 1000 1.0 1000 0.09 22272.8 1.36 PASS
    2000 2000 2000 1.0 2000 2000 1.0 2000 1.01 15918.3 1.00 -----
    2000 2000 2000 1.0 2000 2000 1.0 2000 0.69 23089.3 1.45 PASS
    3000 3000 3000 1.0 3000 3000 1.0 3000 3.42 15788.6 1.00 -----
    3000 3000 3000 1.0 3000 3000 1.0 3000 2.33 23128.0 1.46 PASS
    4000 4000 4000 1.0 4000 4000 1.0 4000 8.07 15861.7 1.00 -----
    4000 4000 4000 1.0 4000 4000 1.0 4000 5.50 23266.5 1.47 PASS
    5000 5000 5000 1.0 5000 5000 1.0 5000 15.70 15928.1 1.00 -----
    5000 5000 5000 1.0 5000 5000 1.0 5000 10.74 23269.0 1.46 PASS
    6000 6000 6000 1.0 6000 6000 1.0 6000 27.06 15963.7 1.00 -----
    6000 6000 6000 1.0 6000 6000 1.0 6000 18.51 23336.2 1.46 PASS
    7000 7000 7000 1.0 7000 7000 1.0 7000 43.06 15930.2 1.00 -----
    7000 7000 7000 1.0 7000 7000 1.0 7000 29.31 23401.6 1.47 PASS
    8000 8000 8000 1.0 8000 8000 1.0 8000 63.97 16007.7 1.00 -----
    8000 8000 8000 1.0 8000 8000 1.0 8000 43.79 23381.7 1.46 PASS
    9000 9000 9000 1.0 9000 9000 1.0 9000 91.77 15887.7 1.00 -----
    9000 9000 9000 1.0 9000 9000 1.0 9000 62.44 23351.7 1.47 PASS

    The even/odd thing was brought up on eigen mailing list, the thread starts at http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html but the archives haven't cought up yet. Apparently on an Intel CPU eigen and atlas have the same speed but eigen doesn't slow down if n=m=k=1115 instead of 1116.

    As for my notation, I was thinking that small letters would be scalars and large ones vectors. Changing the equation a bit would also allow for matrices * vectors. Eigen should be able to loop only once through each variable while the C++ source code would have x = (a*A + b*B +...

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-07-08

    Thanks for the info on the odd problem size. Indeed if you seperate out M,N,K, you get a small drop for each one that is odd, down to the worst-case you show when all dims are odd. I think this is down to the way ATLAS handles cleanup. As soon as I release the stable, I'm going to rewrite GEMM for increased performance, but the rewrite I have in mind is likely to make this problem even worse, since cleanup will be even more granular in the new design!
    However, if I'm successful, the raw MFLOP should increase even for the odd cases, since the peak should be increased significantly.

    Thanks for posting the further timings. Indeed, it appears ATLAS has some blocking advantage over Eigen, but they are both doing blocking. At a guess, does Eigen not copy A and B? If so, you may have TLB problems on some machines for larger problems.

    What happens if you do :
    ./xdl3blastst -n 2000 -A 2 t n -B 2 t n
    ?

    As for your notation, no I don't see how the BLAS can be used to do that equation efficiently. Like I said, the only place where operations like that matter for application performance that I know about are iterative methods, which tend to use sparse storage anyway. Most dense algorithms are dominated by L3BLAS . . .

    Cheers,
    Clint

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.