Y. S.
2010-06-13
This benchmark is correct ?
Bogdan Cristea
2010-06-15
I have compared armadillo with IT++ on openSUSE 11.2, 64 bits with AMD Athlon
64. IT++ using ACML library.
Armadillo seems to have a slight advantage when adding matrices, but when
multiplying them IT++ has by far the best performance. Could you verify this?
armadillo: time taken for addition = 0.0152217
armadillo: time taken for multiplication = 10.3283
IT++: time taken for addition = 0.0181877
IT++: time taken for multiplication = 0.308879
int main()
{
int size = 1000;
int N = 100;
//Armadillo
// size and N are specified by the user on the command line
arma::mat A = arma::rand(size,size);
arma::mat B = arma::rand(size,size);
arma::mat Z = arma::zeros(size,size);
int i;
arma::wall_clock timer;
timer.tic();
for(i=0; i<N; ++i)
Z = A+B; // or Z = A+B+C ... etc
std::cout << "armadillo: time taken for addition = " << timer.toc() /
double(N) << std::endl;
timer.tic();
for(i=0; i<N; ++i)
Z = A*B; // or Z = A+B+C ... etc
std::cout << "armadillo: time taken for multiplication = " << timer.toc() /
double(N) << std::endl;
}
int main()
{
int size = 1000;
int N = 100;
//IT++
itpp::mat A2 = itpp::randu(size,size);
itpp::mat B2 = itpp::randu(size,size);
itpp::mat Z2 = itpp::zeros(size,size);
int i;
itpp::Real_Timer timer;
timer.tic();
for(i=0; i<N; ++i)
Z2 = A2+B2;
std::cout << "IT++: time taken for addition = " << timer.toc() / double(N) <<
std::endl;
timer.tic();
for(i=0; i<N; ++i)
Z2 = A2*B2; // or Z = A+B+C ... etc
std::cout << "IT++: time taken for multiplication = " << timer.toc() /
double(N) << std::endl;
}
Stephan Ludwig
2010-06-15
Hi chriteab & all,
my performance results are for Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz on
Kubuntu 10.04 Lucid Lynx 64 bit with default fftw3, libblas3gf (1.2-2build1),
liblapack3gf (3.2.1-2), libarmadillo0 (0.8.0-1), IT++ from SVN:
time ./itpp_perf
IT++: time taken for addition = 0.00918638
IT++: time taken for multiplication = 1.88509
real 3m9.460s
user 3m9.010s
sys 0m0.090s
time ./armadillo_perf
armadillo: time taken for addition = 0.00815731
armadillo: time taken for multiplication = 1.91089
real 3m11.952s
user 3m11.490s
sys 0m0.070s
Once installed armadillo 0.9.10 from homepage i get:
g++ -O3 armadillo_perf.cpp -o armadillo_perf -larmadillo
time ./amadillo_perf
armadillo: time taken for addition = 0.0052694
armadillo: time taken for multiplication = 1.36371
real 2m16.941s
user 2m16.650s
sys 0m0.070s
and as recommended by the documentation (watch this: less optimization, better
performance!)
g++ -O1 armadillo_perf.cpp -o armadillo_perf -larmadillo
time ./amadillo_perf
armadillo: time taken for addition = 0.0051866
armadillo: time taken for multiplication = 1.30982
real 2m11.544s
user 2m11.490s
sys 0m0.020s
/donludovico
Stephan Ludwig
2010-06-15
As I understand the the better performance of armadillo comes through late
delayed evaluation. Hence, I do not expect a superior performance for addition
and multiplication that large, but maybe for more complex calculations.
I am not an expert on this topic (just an electr. eng.), but how about using
armadillo as a basis for IT++ signal processing features (maybe switchable to
the actual implementation?
What is your opinion, do you think this could be implementable and a sensible
performance improvement?
/donludovico
Y. S.
2010-06-17
CPU: Intel Core2 Duo E8200 2.66GHz
OS: SLES11sp1 (2.6.32.12-0.7-default #1 SMP) x86_64
ACML v4.4
IT++ v4.0.7 (svn)
armadillo v0.9.10
IT++: time taken for addition = 0.0082791
IT++: time taken for multiplication = 0.111379
real 0m11.995s
user 0m22.265s
sys 0m0.284s
armadillo: time taken for addition = 0.00572158
armadillo: time taken for multiplication = 1.48905
real 2m29.518s
user 2m29.361s
sys 0m0.012s
Conrad Sanderson
2010-06-23
Hi everyone,
There is probably some confusion as to why Armadillo gets such a wide range of
timings for multiplication. The are several reasons: the choice of the
underlying BLAS library, the optimisation level used, and the matrix order re-
ordering by Armadillo.
If BLAS or ATLAS is not installed, Armadillo will use a "better-than-nothing"
built in multiplication routine. The performance will be highly dependant on
what optimisation level you've used for compiling your code (Armadillo is all
templates, there's nothing pre-compiled). If you use no optimisation, it will
be quite slow. If you use -O1 or -O2, for small to medium sized matrices the
performance will be roughly on par with BLAS.
If BLAS is installed and Armadillo is configured to use it, the matrix
multplication will be done by BLAS. Armadillo's configuration can be done
manually (editing include/armadillo_bits/config.hpp) or via the included CMake
based installation.
If ATLAS is installed, Armadillo can also make use of that: either directly or
indirectly. On many Linux installations ATLAS actually intercepts calls to
BLAS and uses its own routines instead. Armadillo can also be made to use
ATLAS explicitly, through ATLAS's CBLAS interface. Using ATLAS gives the
fastest mutplication performance.
When multiplying 3 and 4 matrices, Armadillo will try to re-order the
multplications in order to create the smallest possible intermediary matrices
-- this also causes a speedup.
In general there are 5 main sources of speedups in Armadillo (when compared to
IT++). Specifically:
method is as follows.
When constructing a matrix out of an expression, there are two options:
(i) the target matrix object doesn't exist
(ii) the target matrix object already exists
For (i), we have user code along the lines of:
mat C = A+B;
For (ii), we have user code along the lines of:
mat C;
C = A+B;
In the first case, the output of A+B is a matrix, as generated by operator+()
within IT++. If the compiler is smart enough (and most are these days), it
will use a technique known as "Return Value Optimization" (RVO) and avoid
copying the generated matrix into C. Instead, C will be directly generated by
operator+().
In the second case, the compiler cannot use the RVO method. In IT++ this
causes a lot of slow downs. A temporary matrix is first generated by
operator+(), which is then copied into C. At this stage twice as much memory
was used, and twice as much time was taken.
In Armadillo the output of operator+() is not a matrix, but a simple object
which merely contains references to the two objects being added. Within
Armadillo's Mat<> class, there is a constructor (as well as operator= ), which
accepts the above simple object. The simple object is then evaluated, which
causes the Mat<> class to add the two matrices given by the object. At
optimisation time, the compiler is smart enough to figure out that it can
actually throw out the simple object. The resultant machine code looks like
the simple object never existed.
I've described this in a set of lecture notes:
http://itee.uq.edu.au/~conrad/misc/sanderson_templates_lecture_uqcomp7305.pdf
As for the the other sources of speed-ups, they can be quite difficult to
grasp unless one has a good understanding of C++ templates. It would probably
take another set of lecture notes in order to explain how to efficiently
evaluate the following expression:
X = 0.1A + 0.2B;
The above expression will be quite slow under IT++, as there are at least 2
temporary matrices being unnecessarily generated.
With regards,
Conrad
Conrad Sanderson
2010-06-23
Here are some timing results on my machine.
Intel Core2 Duo CPU T7250 @ 2 GHz, 2 Mb cache
Using IT++ as provided with Fedora 12.
Using Armadillo 0.9.10 RPM downloaded from
http://arma.sf.net, also on Fedora 12.
The system has Atlas installed.
Compiled with G++ 4.4.3, using -O3
I've extended the test programs to include a few more involved calculations,
and also provide results for two scenarios:
(i) out-of-cache, where size = 1000, N = 100
(ii) in-cache, where size = 250, N = 400
For size = 1000, each matrix is 1000x1000, which takes up 1000x1000x8 bytes =
7.63 Mb.
For size = 250, each matrix is 250x250, which takes up 250x250x8 bytes = 0.48
Mb (hence 3 matrices fit inside the 2Mb cache).
(i) out-of-cache: size = 1000, N = 100
IT++: time taken for addition = 0.012716
IT++: time taken for multiplication = 0.279902
IT++: time taken for 0.1A2 + 0.2B2 + 0.3*A2 = 0.0513902
IT++: time taken for 0.1transpose(A2)B2 = 0.311981
IT++: time taken for 0.1transpose(A2)B2 + 0.5*A2 = 0.325298
armadillo: time taken for addition = 0.00826418
armadillo: time taken for multiplication = 0.275846
armadillo: time taken for 0.1A2 + 0.2B2 = 0.00849165
armadillo: time taken for 0.1*trans(A2) * B2 = 0.275714
armadillo: time taken for 0.1trans(A2) * B2 + 0.5A2= 0.285077
speedup of Armadillo (i.e. IT++ time divided by Armadillo time):
addition = 1.5387
multiplication = 1.0147
0.1A2 + 0.2B2 = 6.0519
0.1*trans(A2) * B2 = 1.1315
0.1trans(A2) * B2 + 0.5A2= 1.1411
The same performance for multiplication is not surprising, given that both
IT++ and Armadillo end up calling dgemm().
(ii) in-cache: size = 250, N = 400
IT++: time taken for addition = 0.000407762
IT++: time taken for multiplication = 0.00564197
IT++: time taken for 0.1A2 + 0.2B2 + 0.3*A2 = 0.00280852
IT++: time taken for 0.1transpose(A2)B2 = 0.00647226
IT++: time taken for 0.1transpose(A2)B2 + 0.5*A2 = 0.00804803
armadillo: time taken for addition = 0.000250305
armadillo: time taken for multiplication = 0.0047002
armadillo: time taken for 0.1A2 + 0.2B2 = 0.00026047
armadillo: time taken for 0.1*trans(A2) * B2 = 0.00468873
armadillo: time taken for 0.1trans(A2) * B2 + 0.5A2= 0.00589879
speedup of Armadillo (i.e. IT++ time divided by Armadillo time):
addition = 1.6291
multiplication = 1.2004
0.1A2 + 0.2B2 = 10.783
0.1*trans(A2) * B2 = 1.3804
0.1trans(A2) * B2 + 0.5A2= 1.3644
The above tests are obviously rather artificial. There can be of course
further tests with other functionality (e.g. submatrix access), however, what
really matters is the final performance of user programs. On average I've
found that an IT++ user program converted to use Armadillo runs about twice as
fast (ranges from about 1.5x to 3x).
Y. S.
2010-06-24
Hi Conrad,
Thank you for your comments.
Could you please share your bench codes ?
Conrad Sanderson
2010-06-24
Do you mean for the timings that I previously posted, or other more general
code ?
If the former, the evaluated math expressions are as given in the output.
If the latter, I'd like to, however legally I'm constrained: the code is from
a project that is internal to company I work for. I don't own the copyright.
Frank Withers
2014-01-20
I've been a long time IT++ user, but I recently switched to Armadillo. For linear algebra, the function names and syntax in Armadillo are mostly the same as IT++, bar some arguments are switched around.
I have done a few speed tests, using IT++ 4.3.1 and Armadillo 4.0.2. My machine has an Intel i7 processor @ 3.4 GHz, running Fedora 20 (linux kernel 3.12 and gcc 4.8.2). Using the system provided BLAS, LAPACK and ATLAS libraries, and the default installation for both IT++ and Armadillo (cmake . followed by make install)
Used g++ -O3 when compiling.
Time for svd() on a 1000x1000 random matrix:
IT++: 5.377 sec
Armadillo: 2.604 sec
Time for inv() on a 1000x1000 random matrix:
IT++: 1.184 sec
Armadillo: 0.939 sec
Time for eig_sym() on a 1000x1000 symmetric random matrix:
IT++: 2.947 sec
Armadillo: 1.608 sec
Armadillo also has useful linear algebra functions that seem to be missing in IT++, for example pinv(), svd_econ(), eig_pair(), etc. The handling of sub-matrices also seems to be more advanced (eg. non-contiguous submatrices).
Bogdan Cristea
2014-01-20
Indeed Armadillo seems to have a better optimization when handling external libraries. However, IT++ comes with many signal processing algorithms build upon this API for external libraries and also MATLAB bindings are available. Probably the best approach might be to find a way to merge these two projects.