Subscribe

Performance: it++ vs armadillo

  1. 2010-06-13 16:43:00 PDT
    This benchmark is correct ? http://arma.sourceforge.net/speed.html
  2. 2010-06-14 21:54:29 PDT
    I have compared armadillo with IT++ on openSUSE 11.2, 64 bits with AMD Athlon 64. IT++ using ACML library. Armadillo seems to have a slight advantage when adding matrices, but when multiplying them IT++ has by far the best performance. Could you verify this? armadillo: time taken for addition = 0.0152217 armadillo: time taken for multiplication = 10.3283 IT++: time taken for addition = 0.0181877 IT++: time taken for multiplication = 0.308879 #include <iostream> #include <armadillo> int main() { int size = 1000; int N = 100; //Armadillo // size and N are specified by the user on the command line arma::mat A = arma::rand(size,size); arma::mat B = arma::rand(size,size); arma::mat Z = arma::zeros(size,size); int i; arma::wall_clock timer; timer.tic(); for(i=0; i<N; ++i) Z = A+B; // or Z = A+B+C ... etc std::cout << "armadillo: time taken for addition = " << timer.toc() / double(N) << std::endl; timer.tic(); for(i=0; i<N; ++i) Z = A*B; // or Z = A+B+C ... etc std::cout << "armadillo: time taken for multiplication = " << timer.toc() / double(N) << std::endl; } #include <itpp/itbase.h> #include <itpp/base/random.h> #include <itpp/base/timing.h> int main() { int size = 1000; int N = 100; //IT++ itpp::mat A2 = itpp::randu(size,size); itpp::mat B2 = itpp::randu(size,size); itpp::mat Z2 = itpp::zeros(size,size); int i; itpp::Real_Timer timer; timer.tic(); for(i=0; i<N; ++i) Z2 = A2+B2; std::cout << "IT++: time taken for addition = " << timer.toc() / double(N) << std::endl; timer.tic(); for(i=0; i<N; ++i) Z2 = A2*B2; // or Z = A+B+C ... etc std::cout << "IT++: time taken for multiplication = " << timer.toc() / double(N) << std::endl; }
  3. 2010-06-14 23:50:19 PDT
    Hi chriteab & all, my performance results are for Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz on Kubuntu 10.04 Lucid Lynx 64 bit with default fftw3, libblas3gf (1.2-2build1), liblapack3gf (3.2.1-2), libarmadillo0 (0.8.0-1), IT++ from SVN: time ./itpp_perf IT++: time taken for addition = 0.00918638 IT++: time taken for multiplication = 1.88509 real 3m9.460s user 3m9.010s sys 0m0.090s time ./armadillo_perf armadillo: time taken for addition = 0.00815731 armadillo: time taken for multiplication = 1.91089 real 3m11.952s user 3m11.490s sys 0m0.070s Once installed armadillo 0.9.10 from homepage i get: g++ -O3 armadillo_perf.cpp -o armadillo_perf -larmadillo time ./amadillo_perf armadillo: time taken for addition = 0.0052694 armadillo: time taken for multiplication = 1.36371 real 2m16.941s user 2m16.650s sys 0m0.070s and as recommended by the documentation (watch this: less optimization, better performance!) g++ -O1 armadillo_perf.cpp -o armadillo_perf -larmadillo time ./amadillo_perf armadillo: time taken for addition = 0.0051866 armadillo: time taken for multiplication = 1.30982 real 2m11.544s user 2m11.490s sys 0m0.020s /donludovico
  4. 2010-06-15 00:12:41 PDT
    As I understand the the better performance of armadillo comes through late delayed evaluation. Hence, I do not expect a superior performance for addition and multiplication that large, but maybe for more complex calculations. I am not an expert on this topic (just an electr. eng.), but how about using armadillo as a basis for IT++ signal processing features (maybe switchable to the actual implementation? What is your opinion, do you think this could be implementable and a sensible performance improvement? /donludovico
  5. 2010-06-17 10:49:02 PDT
    CPU: Intel Core2 Duo E8200 2.66GHz OS: SLES11sp1 (2.6.32.12-0.7-default #1 SMP) x86_64 ACML v4.4 IT++ v4.0.7 (svn) armadillo v0.9.10 ################################### IT++: time taken for addition = 0.0082791 IT++: time taken for multiplication = 0.111379 real 0m11.995s user 0m22.265s sys 0m0.284s ################################### armadillo: time taken for addition = 0.00572158 armadillo: time taken for multiplication = 1.48905 real 2m29.518s user 2m29.361s sys 0m0.012s
  6. 2010-06-23 04:33:13 PDT
    Hi everyone, There is probably some confusion as to why Armadillo gets such a wide range of timings for multiplication. The are several reasons: the choice of the underlying BLAS library, the optimisation level used, and the matrix order re-ordering by Armadillo. If BLAS or ATLAS is not installed, Armadillo will use a "better-than-nothing" built in multiplication routine. The performance will be highly dependant on what optimisation level you've used for compiling your code (Armadillo is all templates, there's nothing pre-compiled). If you use no optimisation, it will be quite slow. If you use -O1 or -O2, for small to medium sized matrices the performance will be roughly on par with BLAS. If BLAS is installed and Armadillo is configured to use it, the matrix multplication will be done by BLAS. Armadillo's configuration can be done manually (editing include/armadillo_bits/config.hpp) or via the included CMake based installation. If ATLAS is installed, Armadillo can also make use of that: either directly or indirectly. On many Linux installations ATLAS actually intercepts calls to BLAS and uses its own routines instead. Armadillo can also be made to use ATLAS explicitly, through ATLAS's CBLAS interface. Using ATLAS gives the fastest mutplication performance. When multiplying 3 and 4 matrices, Armadillo will try to re-order the multplications in order to create the smallest possible intermediary matrices -- this also causes a speedup. In general there are 5 main sources of speedups in Armadillo (when compared to IT++). Specifically: 1. Preventing the C++ compiler from making matrix copies 2. Using pre-allocated memory for small matrices 3. Combining multiple operations (via recursive templates) 4. Re-ordering of matrix multiplications 5. Translating multiple expressions into one BLAS call #1 is the method which can significantly speed-up a typical user program. The method is as follows. When constructing a matrix out of an expression, there are two options: (i) the target matrix object doesn't exist (ii) the target matrix object already exists For (i), we have user code along the lines of: mat C = A+B; For (ii), we have user code along the lines of: mat C; C = A+B; In the first case, the output of A+B is a matrix, as generated by operator+() within IT++. If the compiler is smart enough (and most are these days), it will use a technique known as "Return Value Optimization" (RVO) and avoid copying the generated matrix into C. Instead, C will be directly generated by operator+(). In the second case, the compiler cannot use the RVO method. In IT++ this causes a lot of slow downs. A temporary matrix is first generated by operator+(), which is then copied into C. At this stage twice as much memory was used, and twice as much time was taken. In Armadillo the output of operator+() is not a matrix, but a simple object which merely contains references to the two objects being added. Within Armadillo's Mat<> class, there is a constructor (as well as operator= ), which accepts the above simple object. The simple object is then evaluated, which causes the Mat<> class to add the two matrices given by the object. At optimisation time, the compiler is smart enough to figure out that it can actually throw out the simple object. The resultant machine code looks like the simple object never existed. I've described this in a set of lecture notes: http://itee.uq.edu.au/~conrad/misc/sanderson_templates_lecture_uqcomp7305.pdf As for the the other sources of speed-ups, they can be quite difficult to grasp unless one has a good understanding of C++ templates. It would probably take another set of lecture notes in order to explain how to efficiently evaluate the following expression: X = 0.1*A + 0.2*B; The above expression will be quite slow under IT++, as there are at least 2 temporary matrices being unnecessarily generated. With regards, Conrad
  7. 2010-06-23 10:28:04 PDT
    Here are some timing results on my machine. Intel Core2 Duo CPU T7250 @ 2 GHz, 2 Mb cache Using IT++ as provided with Fedora 12. Using Armadillo 0.9.10 RPM downloaded from http://arma.sf.net, also on Fedora 12. The system has Atlas installed. Compiled with G++ 4.4.3, using -O3 I've extended the test programs to include a few more involved calculations, and also provide results for two scenarios: (i) out-of-cache, where size = 1000, N = 100 (ii) in-cache, where size = 250, N = 400 For size = 1000, each matrix is 1000x1000, which takes up 1000x1000x8 bytes = 7.63 Mb. For size = 250, each matrix is 250x250, which takes up 250x250x8 bytes = 0.48 Mb (hence 3 matrices fit inside the 2Mb cache). --- (i) out-of-cache: size = 1000, N = 100 IT++: time taken for addition = 0.012716 IT++: time taken for multiplication = 0.279902 IT++: time taken for 0.1*A2 + 0.2*B2 + 0.3*A2 = 0.0513902 IT++: time taken for 0.1*transpose(A2)*B2 = 0.311981 IT++: time taken for 0.1*transpose(A2)*B2 + 0.5*A2 = 0.325298 armadillo: time taken for addition = 0.00826418 armadillo: time taken for multiplication = 0.275846 armadillo: time taken for 0.1*A2 + 0.2*B2 = 0.00849165 armadillo: time taken for 0.1*trans(A2) * B2 = 0.275714 armadillo: time taken for 0.1*trans(A2) * B2 + 0.5*A2= 0.285077 speedup of Armadillo (i.e. IT++ time divided by Armadillo time): addition = 1.5387 multiplication = 1.0147 0.1*A2 + 0.2*B2 = 6.0519 0.1*trans(A2) * B2 = 1.1315 0.1*trans(A2) * B2 + 0.5*A2= 1.1411 The same performance for multiplication is not surprising, given that both IT++ and Armadillo end up calling dgemm(). --- (ii) in-cache: size = 250, N = 400 IT++: time taken for addition = 0.000407762 IT++: time taken for multiplication = 0.00564197 IT++: time taken for 0.1*A2 + 0.2*B2 + 0.3*A2 = 0.00280852 IT++: time taken for 0.1*transpose(A2)*B2 = 0.00647226 IT++: time taken for 0.1*transpose(A2)*B2 + 0.5*A2 = 0.00804803 armadillo: time taken for addition = 0.000250305 armadillo: time taken for multiplication = 0.0047002 armadillo: time taken for 0.1*A2 + 0.2*B2 = 0.00026047 armadillo: time taken for 0.1*trans(A2) * B2 = 0.00468873 armadillo: time taken for 0.1*trans(A2) * B2 + 0.5*A2= 0.00589879 speedup of Armadillo (i.e. IT++ time divided by Armadillo time): addition = 1.6291 multiplication = 1.2004 0.1*A2 + 0.2*B2 = 10.783 0.1*trans(A2) * B2 = 1.3804 0.1*trans(A2) * B2 + 0.5*A2= 1.3644 The above tests are obviously rather artificial. There can be of course further tests with other functionality (e.g. submatrix access), however, what really matters is the final performance of user programs. On average I've found that an IT++ user program converted to use Armadillo runs about twice as fast (ranges from about 1.5x to 3x).
  8. 2010-06-23 17:02:21 PDT
    Hi Conrad, Thank you for your comments. Could you please share your bench codes ?
  9. 2010-06-23 18:40:05 PDT
    Do you mean for the timings that I previously posted, or other more general code ? If the former, the evaluated math expressions are as given in the output. If the latter, I'd like to, however legally I'm constrained: the code is from a project that is internal to company I work for. I don't own the copyright.
Jump To:
< Previous | 1 | Next >

Add a Reply

This forum does not allow anonymous participation.

Log in to add a reply. Not registered? Create an account to participate and receive email updates when replies are posted to this topic.