Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Performance: it++ vs armadillo

Help
Y. S.
2010-06-13
2014-01-20
  • Bogdan Cristea
    Bogdan Cristea
    2010-06-15

    I have compared armadillo with IT++ on openSUSE 11.2, 64 bits with AMD Athlon
    64. IT++ using ACML library.

    Armadillo seems to have a slight advantage when adding matrices, but when
    multiplying them IT++ has by far the best performance. Could you verify this?

    armadillo: time taken for addition = 0.0152217

    armadillo: time taken for multiplication = 10.3283

    IT++: time taken for addition = 0.0181877

    IT++: time taken for multiplication = 0.308879

    include <iostream>

    include <armadillo>

    int main()

    {

    int size = 1000;

    int N = 100;

    //Armadillo

    // size and N are specified by the user on the command line

    arma::mat A = arma::rand(size,size);

    arma::mat B = arma::rand(size,size);

    arma::mat Z = arma::zeros(size,size);

    int i;

    arma::wall_clock timer;

    timer.tic();

    for(i=0; i<N; ++i)

    Z = A+B; // or Z = A+B+C ... etc

    std::cout << "armadillo: time taken for addition = " << timer.toc() /
    double(N) << std::endl;

    timer.tic();

    for(i=0; i<N; ++i)

    Z = A*B; // or Z = A+B+C ... etc

    std::cout << "armadillo: time taken for multiplication = " << timer.toc() /
    double(N) << std::endl;

    }

    include <itpp itbase.h="">

    include <itpp base="" random.h="">

    include <itpp base="" timing.h="">

    int main()

    {

    int size = 1000;

    int N = 100;

    //IT++

    itpp::mat A2 = itpp::randu(size,size);

    itpp::mat B2 = itpp::randu(size,size);

    itpp::mat Z2 = itpp::zeros(size,size);

    int i;

    itpp::Real_Timer timer;

    timer.tic();

    for(i=0; i<N; ++i)

    Z2 = A2+B2;

    std::cout << "IT++: time taken for addition = " << timer.toc() / double(N) <<
    std::endl;

    timer.tic();

    for(i=0; i<N; ++i)

    Z2 = A2*B2; // or Z = A+B+C ... etc

    std::cout << "IT++: time taken for multiplication = " << timer.toc() /
    double(N) << std::endl;

    }

     
  • Stephan Ludwig
    Stephan Ludwig
    2010-06-15

    Hi chriteab & all,

    my performance results are for Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz on
    Kubuntu 10.04 Lucid Lynx 64 bit with default fftw3, libblas3gf (1.2-2build1),
    liblapack3gf (3.2.1-2), libarmadillo0 (0.8.0-1), IT++ from SVN:

    time ./itpp_perf

    IT++: time taken for addition = 0.00918638

    IT++: time taken for multiplication = 1.88509

    real 3m9.460s

    user 3m9.010s

    sys 0m0.090s

    time ./armadillo_perf

    armadillo: time taken for addition = 0.00815731

    armadillo: time taken for multiplication = 1.91089

    real 3m11.952s

    user 3m11.490s

    sys 0m0.070s

    Once installed armadillo 0.9.10 from homepage i get:

    g++ -O3 armadillo_perf.cpp -o armadillo_perf -larmadillo

    time ./amadillo_perf

    armadillo: time taken for addition = 0.0052694

    armadillo: time taken for multiplication = 1.36371

    real 2m16.941s

    user 2m16.650s

    sys 0m0.070s

    and as recommended by the documentation (watch this: less optimization, better
    performance!)

    g++ -O1 armadillo_perf.cpp -o armadillo_perf -larmadillo

    time ./amadillo_perf

    armadillo: time taken for addition = 0.0051866

    armadillo: time taken for multiplication = 1.30982

    real 2m11.544s

    user 2m11.490s

    sys 0m0.020s

    /donludovico

     
  • Stephan Ludwig
    Stephan Ludwig
    2010-06-15

    As I understand the the better performance of armadillo comes through late
    delayed evaluation. Hence, I do not expect a superior performance for addition
    and multiplication that large, but maybe for more complex calculations.

    I am not an expert on this topic (just an electr. eng.), but how about using
    armadillo as a basis for IT++ signal processing features (maybe switchable to
    the actual implementation?

    What is your opinion, do you think this could be implementable and a sensible
    performance improvement?

    /donludovico

     
  • Y. S.
    Y. S.
    2010-06-17

    CPU: Intel Core2 Duo E8200 2.66GHz

    OS: SLES11sp1 (2.6.32.12-0.7-default #1 SMP) x86_64

    ACML v4.4

    IT++ v4.0.7 (svn)

    armadillo v0.9.10

    IT++: time taken for addition = 0.0082791

    IT++: time taken for multiplication = 0.111379

    real 0m11.995s

    user 0m22.265s

    sys 0m0.284s

    armadillo: time taken for addition = 0.00572158

    armadillo: time taken for multiplication = 1.48905

    real 2m29.518s

    user 2m29.361s

    sys 0m0.012s

     
  • Hi everyone,

    There is probably some confusion as to why Armadillo gets such a wide range of
    timings for multiplication. The are several reasons: the choice of the
    underlying BLAS library, the optimisation level used, and the matrix order re-
    ordering by Armadillo.

    If BLAS or ATLAS is not installed, Armadillo will use a "better-than-nothing"
    built in multiplication routine. The performance will be highly dependant on
    what optimisation level you've used for compiling your code (Armadillo is all
    templates, there's nothing pre-compiled). If you use no optimisation, it will
    be quite slow. If you use -O1 or -O2, for small to medium sized matrices the
    performance will be roughly on par with BLAS.

    If BLAS is installed and Armadillo is configured to use it, the matrix
    multplication will be done by BLAS. Armadillo's configuration can be done
    manually (editing include/armadillo_bits/config.hpp) or via the included CMake
    based installation.

    If ATLAS is installed, Armadillo can also make use of that: either directly or
    indirectly. On many Linux installations ATLAS actually intercepts calls to
    BLAS and uses its own routines instead. Armadillo can also be made to use
    ATLAS explicitly, through ATLAS's CBLAS interface. Using ATLAS gives the
    fastest mutplication performance.

    When multiplying 3 and 4 matrices, Armadillo will try to re-order the
    multplications in order to create the smallest possible intermediary matrices
    -- this also causes a speedup.

    In general there are 5 main sources of speedups in Armadillo (when compared to
    IT++). Specifically:

    1. Preventing the C++ compiler from making matrix copies
    2. Using pre-allocated memory for small matrices
    3. Combining multiple operations (via recursive templates)
    4. Re-ordering of matrix multiplications
    5. Translating multiple expressions into one BLAS call

    1 is the method which can significantly speed-up a typical user program. The

    method is as follows.

    When constructing a matrix out of an expression, there are two options:

    (i) the target matrix object doesn't exist

    (ii) the target matrix object already exists

    For (i), we have user code along the lines of:

    mat C = A+B;

    For (ii), we have user code along the lines of:

    mat C;

    C = A+B;

    In the first case, the output of A+B is a matrix, as generated by operator+()
    within IT++. If the compiler is smart enough (and most are these days), it
    will use a technique known as "Return Value Optimization" (RVO) and avoid
    copying the generated matrix into C. Instead, C will be directly generated by
    operator+().

    In the second case, the compiler cannot use the RVO method. In IT++ this
    causes a lot of slow downs. A temporary matrix is first generated by
    operator+(), which is then copied into C. At this stage twice as much memory
    was used, and twice as much time was taken.

    In Armadillo the output of operator+() is not a matrix, but a simple object
    which merely contains references to the two objects being added. Within
    Armadillo's Mat<> class, there is a constructor (as well as operator= ), which
    accepts the above simple object. The simple object is then evaluated, which
    causes the Mat<> class to add the two matrices given by the object. At
    optimisation time, the compiler is smart enough to figure out that it can
    actually throw out the simple object. The resultant machine code looks like
    the simple object never existed.

    I've described this in a set of lecture notes:

    http://itee.uq.edu.au/~conrad/misc/sanderson_templates_lecture_uqcomp7305.pdf

    As for the the other sources of speed-ups, they can be quite difficult to
    grasp unless one has a good understanding of C++ templates. It would probably
    take another set of lecture notes in order to explain how to efficiently
    evaluate the following expression:

    X = 0.1A + 0.2B;

    The above expression will be quite slow under IT++, as there are at least 2
    temporary matrices being unnecessarily generated.

    With regards,

    Conrad

     
  • Here are some timing results on my machine.

    Intel Core2 Duo CPU T7250 @ 2 GHz, 2 Mb cache

    Using IT++ as provided with Fedora 12.

    Using Armadillo 0.9.10 RPM downloaded from
    http://arma.sf.net, also on Fedora 12.

    The system has Atlas installed.

    Compiled with G++ 4.4.3, using -O3

    I've extended the test programs to include a few more involved calculations,
    and also provide results for two scenarios:

    (i) out-of-cache, where size = 1000, N = 100

    (ii) in-cache, where size = 250, N = 400

    For size = 1000, each matrix is 1000x1000, which takes up 1000x1000x8 bytes =
    7.63 Mb.

    For size = 250, each matrix is 250x250, which takes up 250x250x8 bytes = 0.48
    Mb (hence 3 matrices fit inside the 2Mb cache).


    (i) out-of-cache: size = 1000, N = 100

    IT++: time taken for addition = 0.012716

    IT++: time taken for multiplication = 0.279902

    IT++: time taken for 0.1A2 + 0.2B2 + 0.3*A2 = 0.0513902

    IT++: time taken for 0.1transpose(A2)B2 = 0.311981

    IT++: time taken for 0.1transpose(A2)B2 + 0.5*A2 = 0.325298

    armadillo: time taken for addition = 0.00826418

    armadillo: time taken for multiplication = 0.275846

    armadillo: time taken for 0.1A2 + 0.2B2 = 0.00849165

    armadillo: time taken for 0.1*trans(A2) * B2 = 0.275714

    armadillo: time taken for 0.1trans(A2) * B2 + 0.5A2= 0.285077

    speedup of Armadillo (i.e. IT++ time divided by Armadillo time):

    addition = 1.5387

    multiplication = 1.0147

    0.1A2 + 0.2B2 = 6.0519

    0.1*trans(A2) * B2 = 1.1315

    0.1trans(A2) * B2 + 0.5A2= 1.1411

    The same performance for multiplication is not surprising, given that both
    IT++ and Armadillo end up calling dgemm().


    (ii) in-cache: size = 250, N = 400

    IT++: time taken for addition = 0.000407762

    IT++: time taken for multiplication = 0.00564197

    IT++: time taken for 0.1A2 + 0.2B2 + 0.3*A2 = 0.00280852

    IT++: time taken for 0.1transpose(A2)B2 = 0.00647226

    IT++: time taken for 0.1transpose(A2)B2 + 0.5*A2 = 0.00804803

    armadillo: time taken for addition = 0.000250305

    armadillo: time taken for multiplication = 0.0047002

    armadillo: time taken for 0.1A2 + 0.2B2 = 0.00026047

    armadillo: time taken for 0.1*trans(A2) * B2 = 0.00468873

    armadillo: time taken for 0.1trans(A2) * B2 + 0.5A2= 0.00589879

    speedup of Armadillo (i.e. IT++ time divided by Armadillo time):

    addition = 1.6291

    multiplication = 1.2004

    0.1A2 + 0.2B2 = 10.783

    0.1*trans(A2) * B2 = 1.3804

    0.1trans(A2) * B2 + 0.5A2= 1.3644

    The above tests are obviously rather artificial. There can be of course
    further tests with other functionality (e.g. submatrix access), however, what
    really matters is the final performance of user programs. On average I've
    found that an IT++ user program converted to use Armadillo runs about twice as
    fast (ranges from about 1.5x to 3x).

     
  • Y. S.
    Y. S.
    2010-06-24

    Hi Conrad,

    Thank you for your comments.

    Could you please share your bench codes ?

     
  • Do you mean for the timings that I previously posted, or other more general
    code ?

    If the former, the evaluated math expressions are as given in the output.

    If the latter, I'd like to, however legally I'm constrained: the code is from
    a project that is internal to company I work for. I don't own the copyright.

     
  • Frank Withers
    Frank Withers
    2014-01-20

    I've been a long time IT++ user, but I recently switched to Armadillo. For linear algebra, the function names and syntax in Armadillo are mostly the same as IT++, bar some arguments are switched around.

    I have done a few speed tests, using IT++ 4.3.1 and Armadillo 4.0.2. My machine has an Intel i7 processor @ 3.4 GHz, running Fedora 20 (linux kernel 3.12 and gcc 4.8.2). Using the system provided BLAS, LAPACK and ATLAS libraries, and the default installation for both IT++ and Armadillo (cmake . followed by make install)

    Used g++ -O3 when compiling.

    Time for svd() on a 1000x1000 random matrix:
    IT++: 5.377 sec
    Armadillo: 2.604 sec

    Time for inv() on a 1000x1000 random matrix:
    IT++: 1.184 sec
    Armadillo: 0.939 sec

    Time for eig_sym() on a 1000x1000 symmetric random matrix:
    IT++: 2.947 sec
    Armadillo: 1.608 sec

    Armadillo also has useful linear algebra functions that seem to be missing in IT++, for example pinv(), svd_econ(), eig_pair(), etc. The handling of sub-matrices also seems to be more advanced (eg. non-contiguous submatrices).

     
  • Bogdan Cristea
    Bogdan Cristea
    2014-01-20

    Indeed Armadillo seems to have a better optimization when handling external libraries. However, IT++ comes with many signal processing algorithms build upon this API for external libraries and also MATLAB bindings are available. Probably the best approach might be to find a way to merge these two projects.

     
    Last edit: Bogdan Cristea 2014-01-20