Menu

ViennaCL spMxV benchmarking

2016-06-13
2016-06-15
  • Peter Aaltonen

    Peter Aaltonen - 2016-06-13

    Hi! I've been trying to run some benchmarks using ViennaCL's linalg::prod method on sparse matrix-vector multiplication, but the performance I am getting so far has been a bit unexpected, comparing to the figures under the "Benchmark" section:

    spMV:

    webbase1M
    Throughput: 1.941 GFlops
    Bandwidth: 15.531 GB/s

    protein(pdb1HYS):
    Throughput: 2.028 GFlops
    Bandwidth: 12.168 GB/s

    The "throughput" results I get here are significantly lower than the results from here:
    http://viennacl.sourceforge.net/viennacl-benchmark-spmv.html

    I am using a Tesla K40c card with 32GB of system memory on ubuntu 14.04. The graphics card should be at least as powerful as the K20m used in the benchmark given in the page above.

    The code that I used to measure throughput can be found below, which is modified based on the example given for linalg::prod. I measure the "throughput" based on the number of non-zeros in the sparse matrix and the time it takes for 1000 runs to complete, and I am using the compressed matrix format, which is the fastest format shown in the official benchmark for both graphs.

    #define iterations 1000
    
    template<typename ScalarType>
    void run_benchmark(const std::string mat_file)
    {
      viennacl::tools::timer timer;
      double exec_time;
    
      boost::numeric::ublas::compressed_matrix<ScalarType> ublas_matrix;
      if (!viennacl::io::read_matrix_market_file(ublas_matrix, mat_file))
      {
        std::cout << "Error reading Matrix file" << std::endl;
      }
      std::cout << "done reading matrix" << std::endl;
    
      viennacl::compressed_matrix<ScalarType> A;
      viennacl::vector<ScalarType> vect(ublas_matrix.size1());
      for(auto i=0;i<ublas_matrix.size1();++i){
          vect[i]=1;
      }
    
      //cpu to gpu:
      viennacl::copy(ublas_matrix, A);
    
      viennacl::vector<ScalarType> C;
    
      timer.start();
      for(auto i=0;i<iterations;++i)
      {
          C = viennacl::linalg::prod(A,vect);
      }
      exec_time = timer.get();
      double elapsed = exec_time;
    
      double throughput = (double)iterations * A.nnz() / elapsed;
      double bandwidth = (2 * sizeof(ScalarType) + sizeof(int)) * throughput;
      printf("%30s\t %9d NZ\t %8.3lf GFlops\t %8.3lf GB/s\t elapsed time %fs\n",
              mat_file.c_str(),
              A.nnz(), 2 * throughput / 1.0e9, bandwidth / 1.0e9, elapsed);
    }
    
    int main(int argc, char ** argv)
    {
      std::string filename(argv[1]);
      std::cout<<"float"<<std::endl;
      run_benchmark<float>(filename);
    
      return 0;
    }
    

    Is there something that I am doing incorrectly here? The throughput measurement code is from NVlab's moderngpu package.

    Thank you!

     

    Last edit: Peter Aaltonen 2016-06-13
  • Karl Rupp

    Karl Rupp - 2016-06-13

    Hi,

    from your code snippet I suspect that you're not using the GPU backends. Did you define either VIENNACL_WITH_CUDA or VIENNACL_WITH_OPENCL and link accordingly?

    Best regards,
    Karli

     
    • Peter Aaltonen

      Peter Aaltonen - 2016-06-14

      Hi Karl,

      Thanks for the reply. I tried to compile with VIENNACL_WITH_CUDA=1, as well as modifying the ENABLE_CUDA flag in CMakeCache.txt (which gets CUDA to compile), and in either case I am getting about the same throughput and bandwidth numbers shown above.

      Linking with OpenCL:

      $ ldd libviennacl.so                    
      ./libviennacl.so: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by ./libviennacl.so)
          libOpenCL.so.1 => /usr/local/cuda/lib64/libOpenCL.so.1 (0x00007ff125486000)
          ...
      
      $ ldd mxv
       ...
              libviennacl.so => ../ViennaCL-1.7.1/build/libviennacl/libviennacl.so (0x00007fae644ac000)
              libOpenCL.so.1 => /usr/local/cuda/lib64/libOpenCL.so.1 (0x00007fae642a6000)
      
      $ mxv webbase-1M.mtx
      ./mxv: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available
      
          webbase-1M.mtx     3105536 NZ       1.151 GFlops            6.906 GB/s   elapsed time 5.395928s
      

      Linking with CUDA:

      $ ldd libviennacl.so
      ...
          libcudart.so.7.5 => /usr/local/cuda/lib64/libcudart.so.7.5 (0x00007fdfc3e1e000)
      
      $ ldd mxv-cuda
       ...
          libcudart.so.7.5 => /usr/local/cuda/lib64/libcudart.so.7.5 (0x00007f9861b0f000)
      
      $ mxv-cuda webbase-1M.mtx
       webbase-1M.mtx          3105536 NZ       1.121 GFlops            6.726 GB/s   elapsed time 5.540465s
      

      I tested the same code on a different machine with a slightly inferior graphics card, and still got comparable results.

      Thanks in advance!

       

      Last edit: Peter Aaltonen 2016-06-14
  • Karl Rupp

    Karl Rupp - 2016-06-15

    Hi,

    I just reconfirmed that the resuls we get for the webbase-1M matrix are 11.5 GFLOPs when using ViennaCL with CUDA on a K40. To reproduce, compile the attached code with

    nvcc viennacl.cu -I/path/to/viennacl -o viennacl_cuda -O3 -DVIENNACL_WITH_CUDA -arch=sm_20
    

    (after adjusting paths) and run the executable by passing the matrix market file name as an argument. The main differences to your code is that it picks the median of execution times per iteration rather than averaging over all iterations. If you don't specify VIENNACL_WITH_CUDA, the benchmark is run on the CPU and the results are in the 1 GFLOP regime (hence my suspicion that you didn't pass that correctly).

    Best regards,
    Karli

     
    • Peter Aaltonen

      Peter Aaltonen - 2016-06-15

      Hi Karl,

      I was able to replicate the results using your code. Thanks a lot for the help! I must've used incorrect compilation parameters the first time.

      Regards,
      Peter

       

Log in to post a comment.