ViennaCL / Discussion / General Discussion: ViennaCL spMxV benchmarking

Hi! I've been trying to run some benchmarks using ViennaCL's linalg::prod method on sparse matrix-vector multiplication, but the performance I am getting so far has been a bit unexpected, comparing to the figures under the "Benchmark" section:

spMV:

webbase1M
Throughput: 1.941 GFlops
Bandwidth: 15.531 GB/s

protein(pdb1HYS):
Throughput: 2.028 GFlops
Bandwidth: 12.168 GB/s

The "throughput" results I get here are significantly lower than the results from here:
http://viennacl.sourceforge.net/viennacl-benchmark-spmv.html

I am using a Tesla K40c card with 32GB of system memory on ubuntu 14.04. The graphics card should be at least as powerful as the K20m used in the benchmark given in the page above.

The code that I used to measure throughput can be found below, which is modified based on the example given for linalg::prod. I measure the "throughput" based on the number of non-zeros in the sparse matrix and the time it takes for 1000 runs to complete, and I am using the compressed matrix format, which is the fastest format shown in the official benchmark for both graphs.

#define iterations 1000

template<typename ScalarType>
void run_benchmark(const std::string mat_file)
{
  viennacl::tools::timer timer;
  double exec_time;

  boost::numeric::ublas::compressed_matrix<ScalarType> ublas_matrix;
  if (!viennacl::io::read_matrix_market_file(ublas_matrix, mat_file))
  {
    std::cout << "Error reading Matrix file" << std::endl;
  }
  std::cout << "done reading matrix" << std::endl;

  viennacl::compressed_matrix<ScalarType> A;
  viennacl::vector<ScalarType> vect(ublas_matrix.size1());
  for(auto i=0;i<ublas_matrix.size1();++i){
      vect[i]=1;
  }

  //cpu to gpu:
  viennacl::copy(ublas_matrix, A);

  viennacl::vector<ScalarType> C;

  timer.start();
  for(auto i=0;i<iterations;++i)
  {
      C = viennacl::linalg::prod(A,vect);
  }
  exec_time = timer.get();
  double elapsed = exec_time;

  double throughput = (double)iterations * A.nnz() / elapsed;
  double bandwidth = (2 * sizeof(ScalarType) + sizeof(int)) * throughput;
  printf("%30s\t %9d NZ\t %8.3lf GFlops\t %8.3lf GB/s\t elapsed time %fs\n",
          mat_file.c_str(),
          A.nnz(), 2 * throughput / 1.0e9, bandwidth / 1.0e9, elapsed);
}

int main(int argc, char ** argv)
{
  std::string filename(argv[1]);
  std::cout<<"float"<<std::endl;
  run_benchmark<float>(filename);

  return 0;
}

Is there something that I am doing incorrectly here? The throughput measurement code is from NVlab's moderngpu package.

Thank you!

Last edit: Peter Aaltonen 2016-06-13

Karl Rupp - 2016-06-13

Hi,

from your code snippet I suspect that you're not using the GPU backends. Did you define either VIENNACL_WITH_CUDA or VIENNACL_WITH_OPENCL and link accordingly?

Best regards,
Karli

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peter Aaltonen - 2016-06-14
  
  Hi Karl,
  
  Thanks for the reply. I tried to compile with VIENNACL_WITH_CUDA=1, as well as modifying the ENABLE_CUDA flag in CMakeCache.txt (which gets CUDA to compile), and in either case I am getting about the same throughput and bandwidth numbers shown above.
  
  Linking with OpenCL:
  
  $ ldd libviennacl.so ./libviennacl.so: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available (required by ./libviennacl.so) libOpenCL.so.1 => /usr/local/cuda/lib64/libOpenCL.so.1 (0x00007ff125486000) ... $ ldd mxv ... libviennacl.so => ../ViennaCL-1.7.1/build/libviennacl/libviennacl.so (0x00007fae644ac000) libOpenCL.so.1 => /usr/local/cuda/lib64/libOpenCL.so.1 (0x00007fae642a6000) $ mxv webbase-1M.mtx ./mxv: /usr/local/cuda/lib64/libOpenCL.so.1: no version information available webbase-1M.mtx 3105536 NZ 1.151 GFlops 6.906 GB/s elapsed time 5.395928s
  
  Linking with CUDA:
  
  $ ldd libviennacl.so ... libcudart.so.7.5 => /usr/local/cuda/lib64/libcudart.so.7.5 (0x00007fdfc3e1e000) $ ldd mxv-cuda ... libcudart.so.7.5 => /usr/local/cuda/lib64/libcudart.so.7.5 (0x00007f9861b0f000) $ mxv-cuda webbase-1M.mtx webbase-1M.mtx 3105536 NZ 1.121 GFlops 6.726 GB/s elapsed time 5.540465s
  
  I tested the same code on a different machine with a slightly inferior graphics card, and still got comparable results.
  
  Thanks in advance!
  
  Last edit: Peter Aaltonen 2016-06-14
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2016-06-15

Hi,

I just reconfirmed that the resuls we get for the webbase-1M matrix are 11.5 GFLOPs when using ViennaCL with CUDA on a K40. To reproduce, compile the attached code with

nvcc viennacl.cu -I/path/to/viennacl -o viennacl_cuda -O3 -DVIENNACL_WITH_CUDA -arch=sm_20

(after adjusting paths) and run the executable by passing the matrix market file name as an argument. The main differences to your code is that it picks the median of execution times per iteration rather than averaging over all iterations. If you don't specify VIENNACL_WITH_CUDA, the benchmark is run on the CPU and the results are in the 1 GFLOP regime (hence my suspicion that you didn't pass that correctly).

Best regards,
Karli

viennacl.cu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peter Aaltonen - 2016-06-15
  
  Hi Karl,
  
  I was able to replicate the results using your code. Thanks a lot for the help! I must've used incorrect compilation parameters the first time.
  
  Regards,
  Peter
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ViennaCL spMxV benchmarking

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Forums

Help

ViennaCL spMxV benchmarking

ViennaCL spMxV benchmarking

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Forums

Help

ViennaCL spMxV benchmarking document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

ViennaCL spMxV benchmarking