Vienna vs uBLAS performance

  • Giuseppe Attardi

    I successfuly ported my code from uBLAS to Vienna but the performance degraded.

    uBLAS: Elapsed time: 3' 22.833 sec
    Vienna: Elapsed time: 7' 38.710 sec

    The code is running on a CPU with 24 cores and they are all reported 100% busy by top.

    The differences in the code are minimal.
    The core of the algorithm is this:

      Vector gb1(numHidden);
    # ifdef USE_UBLAS
      noalias(gb1) = element_div(prod(w2, gb2), hprimeInv);
    # else
      gb1 = element_div(prod(w2, gb2), hprimeInv);
    # endif

      const size_t fsize = features.size();
    #pragma omp parallel for collapse(2)
      for (size_t i = 0; i < fsize; i++)
        for (int j = 0; j < (const int)numHidden; j++)
          w1(features_, j) -= gb1(j) * LR;

      b1 -= gb1 * LR;
      if (numLayers == 2) {
        Matrix gwh(numHidden, numHidden);
        Vector gbh(numHidden);
        wh -= gwh * LR;
        bh -= gbh * LR;
      w2 -= gw2 * LR;
      b2 -= gb2 * LR;

    where the differences are hidden just in te definition of Vector and Matrix:

    #ifdef USE_UBLAS
    typedef boost::numeric::ublas::vector<double>   Vector;
    typedef matrix<double>  Matrix;
    typedef viennacl::vector<double>                Vector;
    typedef viennacl::matrix<double, viennacl::row_major>    Matrix;

    What is more disappointng is that neither of the versions improves on the sequential version which takes:

    uBlas sequential (no omp): Elapsed time: 2' 15.350 sec

    This is even more disappointing since another version of the algorithm, which uses no library, got more of an order of magnitude improvement by using code like this:

    #pragma omp parallel for
             data = (Qfloat)(this->*kernel_function)(real_i,j);

    Possibly the benefit here derives from the parallel computation of the function.

    Should I conclude that I cannot expect benefits using OpenMP on just simple array operatoins?

    Thank you

  • Karl Rupp

    Karl Rupp - 2013-05-07


    which backend are you using for the CPU? OpenMP or OpenCL? Which vector sizes are you using?
    What is the type of LR? Did you define NDEBUG to get rid of all the assertions?

    The CPU/OpenMP backend is still fairly new in ViennaCL, it's currently more efficient for the sparse case than for the dense case. For 'small' workloads it is indeed hard to get good performance with OpenMP, because the thread startup needs to be compensated. As a rule of thumb, operations should involve at least 10k-100k operations to see any notable gain.

    Best regards,


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks