What is more disappointng is that neither of the versions improves on the sequential version which takes:
uBlas sequential (no omp): Elapsed time: 2' 15.350 sec
This is even more disappointing since another version of the algorithm, which uses no library, got more of an order of magnitude improvement by using code like this:
#pragma omp parallel for
for(j=0;j<l;j++)
data = (Qfloat)(this->*kernel_function)(real_i,j);
Possibly the benefit here derives from the parallel computation of the function.
Should I conclude that I cannot expect benefits using OpenMP on just simple array operatoins?
Thank you
_
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
which backend are you using for the CPU? OpenMP or OpenCL? Which vector sizes are you using?
What is the type of LR? Did you define NDEBUG to get rid of all the assertions?
The CPU/OpenMP backend is still fairly new in ViennaCL, it's currently more efficient for the sparse case than for the dense case. For 'small' workloads it is indeed hard to get good performance with OpenMP, because the thread startup needs to be compensated. As a rule of thumb, operations should involve at least 10k-100k operations to see any notable gain.
Best regards,
Karli
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I successfuly ported my code from uBLAS to Vienna but the performance degraded.
uBLAS: Elapsed time: 3' 22.833 sec
Vienna: Elapsed time: 7' 38.710 sec
The code is running on a CPU with 24 cores and they are all reported 100% busy by top.
The differences in the code are minimal.
The core of the algorithm is this:
Vector gb1(numHidden);
# ifdef USE_UBLAS
noalias(gb1) = element_div(prod(w2, gb2), hprimeInv);
# else
gb1 = element_div(prod(w2, gb2), hprimeInv);
# endif
const size_t fsize = features.size();
#pragma omp parallel for collapse(2)
for (size_t i = 0; i < fsize; i++)
for (int j = 0; j < (const int)numHidden; j++)
w1(features_, j) -= gb1(j) * LR;
b1 -= gb1 * LR;
if (numLayers == 2) {
Matrix gwh(numHidden, numHidden);
Vector gbh(numHidden);
wh -= gwh * LR;
bh -= gbh * LR;
}
w2 -= gw2 * LR;
b2 -= gb2 * LR;
where the differences are hidden just in te definition of Vector and Matrix:
#ifdef USE_UBLAS
typedef boost::numeric::ublas::vector<double> Vector;
typedef matrix<double> Matrix;
#else
typedef viennacl::vector<double> Vector;
typedef viennacl::matrix<double, viennacl::row_major> Matrix;
#endif
What is more disappointng is that neither of the versions improves on the sequential version which takes:
uBlas sequential (no omp): Elapsed time: 2' 15.350 sec
This is even more disappointing since another version of the algorithm, which uses no library, got more of an order of magnitude improvement by using code like this:
#pragma omp parallel for
for(j=0;j<l;j++)
data = (Qfloat)(this->*kernel_function)(real_i,j);
Possibly the benefit here derives from the parallel computation of the function.
Should I conclude that I cannot expect benefits using OpenMP on just simple array operatoins?
Thank you
_
Hi,
which backend are you using for the CPU? OpenMP or OpenCL? Which vector sizes are you using?
What is the type of LR? Did you define NDEBUG to get rid of all the assertions?
The CPU/OpenMP backend is still fairly new in ViennaCL, it's currently more efficient for the sparse case than for the dense case. For 'small' workloads it is indeed hard to get good performance with OpenMP, because the thread startup needs to be compensated. As a rule of thumb, operations should involve at least 10k-100k operations to see any notable gain.
Best regards,
Karli