2010-06-23 04:33:13 PDT
Hi everyone,
There is probably some confusion as to why Armadillo gets such a wide range of timings for multiplication. The are several reasons: the choice of the underlying BLAS library, the optimisation level used, and the matrix order re-ordering by Armadillo.
If BLAS or ATLAS is not installed, Armadillo will use a "better-than-nothing" built in multiplication routine. The performance will be highly dependant on what optimisation level you've used for compiling your code (Armadillo is all templates, there's nothing pre-compiled). If you use no optimisation, it will be quite slow. If you use -O1 or -O2, for small to medium sized matrices the performance will be roughly on par with BLAS.
If BLAS is installed and Armadillo is configured to use it, the matrix multplication will be done by BLAS. Armadillo's configuration can be done manually (editing include/armadillo_bits/config.hpp) or via the included CMake based installation.
If ATLAS is installed, Armadillo can also make use of that: either directly or indirectly. On many Linux installations ATLAS actually intercepts calls to BLAS and uses its own routines instead. Armadillo can also be made to use ATLAS explicitly, through ATLAS's CBLAS interface. Using ATLAS gives the fastest mutplication performance.
When multiplying 3 and 4 matrices, Armadillo will try to re-order the multplications in order to create the smallest possible intermediary matrices -- this also causes a speedup.
In general there are 5 main sources of speedups in Armadillo (when compared to IT++). Specifically:
1. Preventing the C++ compiler from making matrix copies
2. Using pre-allocated memory for small matrices
3. Combining multiple operations (via recursive templates)
4. Re-ordering of matrix multiplications
5. Translating multiple expressions into one BLAS call
#1 is the method which can significantly speed-up a typical user program. The method is as follows.
When constructing a matrix out of an expression, there are two options:
(i) the target matrix object doesn't exist
(ii) the target matrix object already exists
For (i), we have user code along the lines of:
mat C = A+B;
For (ii), we have user code along the lines of:
mat C;
C = A+B;
In the first case, the output of A+B is a matrix, as generated by operator+() within IT++. If the compiler is smart enough (and most are these days), it will use a technique known as "Return Value Optimization" (RVO) and avoid copying the generated matrix into C. Instead, C will be directly generated by operator+().
In the second case, the compiler cannot use the RVO method. In IT++ this causes a lot of slow downs. A temporary matrix is first generated by operator+(), which is then copied into C. At this stage twice as much memory was used, and twice as much time was taken.
In Armadillo the output of operator+() is not a matrix, but a simple object which merely contains references to the two objects being added. Within Armadillo's Mat<> class, there is a constructor (as well as operator= ), which accepts the above simple object. The simple object is then evaluated, which causes the Mat<> class to add the two matrices given by the object. At optimisation time, the compiler is smart enough to figure out that it can actually throw out the simple object. The resultant machine code looks like the simple object never existed.
I've described this in a set of lecture notes:
http://itee.uq.edu.au/~conrad/misc/sanderson_templates_lecture_uqcomp7305.pdf
As for the the other sources of speed-ups, they can be quite difficult to grasp unless one has a good understanding of C++ templates. It would probably take another set of lecture notes in order to explain how to efficiently evaluate the following expression:
X = 0.1*A + 0.2*B;
The above expression will be quite slow under IT++, as there are at least 2 temporary matrices being unnecessarily generated.
With regards,
Conrad