ViennaCL / Discussion / General Discussion: Is any version of ILU running fully on GPU.

Mohamed Aissa - 2015-12-17

I tried many solvers on GPU and the bottleneck is ILU. for my work the system matrix and rhs reside on the GPU and are computed with CUDA. Before starting any conversion from CUDA to openCL I would like to make sure that at least the ILU version of Pr Chow is runnign fully on the GPU.
A side question: can I pass col, row and value arrays as cuda pointer to compressed viennaCl matrix?

Last edit: Mohamed Aissa 2015-12-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi,
you can use our parallel Chow-Patel-ILU with OpenMP, CUDA, and OpenCL. In other words, you don't need to migrate from CUDA to OpenCL, you can just use your CUDA buffers. With CUDA and OpenCL the Chow-Patel-ILU fully runs on the GPU.

If your col/row/value data is already in CSR format and you use OpenCL, you can use the provided constructor for OpenCL. With CUDA I recommend you copy&paste the OpenCL-enabled constructor (compressed_matrix.hpp, around line 670) to make it work with CUDA. The added code should be something similar to the following (not tested):

  explicit compressed_matrix(unsigned int *mem_row_buffer, unsigned int *mem_col_buffer, NumericT *mem_elements,
                             vcl_size_t rows, vcl_size_t cols, vcl_size_t nonzeros) :
    rows_(rows), cols_(cols), nonzeros_(nonzeros), row_block_num_(0)
  {
    row_buffer_.switch_active_handle_id(viennacl::CUDA_MEMORY);
    row_buffer_.cuda_handle() = mem_row_buffer;
    row_buffer_.cuda_handle().inc();             //prevents that the user-provided memory is deleted once the matrix object is destroyed.
    row_buffer_.raw_size(sizeof(unsigned int) * (rows + 1));

    col_buffer_.switch_active_handle_id(viennacl::CUDA_MEMORY);
    col_buffer_.cuda_handle() = mem_col_buffer;
    col_buffer_.cuda_handle().inc();             //prevents that the user-provided memory is deleted once the matrix object is destroyed.
    col_buffer_.raw_size(sizeof(unsigned int) * nonzeros);

    elements_.switch_active_handle_id(viennacl::CUDA_MEMORY);
    elements_.cuda_handle() = mem_elements;
    elements_.cuda_handle().inc();               //prevents that the user-provided memory is deleted once the matrix object is destroyed.
    elements_.raw_size(sizeof(NumericT) * nonzeros);

    //generate block information for CSR-adaptive:
    generate_row_block_information();
  }

If you replace 'unsigned int' with 'int', things should still work (lower bits are the same in two-complement).

Best regards,
Karli

Mohamed Aissa - 2015-12-18

thanks for the prompt answer. I will test it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mohamed Aissa - 2015-12-21

I used the same constructor you sugested with int instead of unsigned int. I call it that way :
Matrix=new viennacl::compressed_matrix<double,4>(drow_offset, dcol, dval,NRow,NCol,num_entries);

and the errors are of that kind:
no operator "=" matches these operands operand types are: viennacl::backend::mem_handle::cuda_handle_type = int *

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2015-12-21

Ah, type mismatch. Please replace the respective lines of type

elements_.cuda_handle() = mem_elements;

with

elements_.cuda_handle().reset(reinterpret_cast<char*>(mem_elements));
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mohamed Aissa - 2015-12-22

I got a small code to work finding the right solution of a 6x6 Ax=b system but the cuda profiler nvvp shows me that during the solving there is some use of the PCI in both direction. Is it intended although bad for perfromance or in this code something is wrong?:

#include <iostream>
#include <cstdlib>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include "viennacl/compressed_matrix.hpp"
#include "viennacl/vector.hpp"
#include "viennacl/linalg/ilu.hpp"
#include "viennacl/linalg/gmres.hpp"
#include <unistd.h>

int main(int argc, char* argv[]) {

int NRow=6,NCol=6,num_entries=9;

// CSR row offset
int row_offset[7] = {0, 1, 3, 4, 6, 8, 9};
// CSR cols
int col[9] = {0, 1, 2, 2, 2, 3, 4, 5, 5};
// CSR vals
double val[9] = {2., 3., 6., 4., 5., 6.3, 1.5, 2.5, 4.5};
// rhs
double b[6] = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0};

std::vector<double> stl_vec(6);

// Copy to GPU
int drow_offset = NULL;
int dcol = NULL;
double dval = NULL;
double drhs = NULL;
double *dsol = NULL;

cudaMalloc((void)&drow_offset,7*sizeof(int));
cudaMalloc((void)&dcol,9sizeof(int));
cudaMalloc((void)&dval,9*sizeof(double));
cudaMalloc((void)&drhs,9sizeof(double));
cudaMalloc((void*)&dsol,6sizeof(double));

cudaMemcpy(drow_offset,&row_offset,7sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dcol,&col,9sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dval,&val,9sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy(drhs,&b,9sizeof(double),cudaMemcpyHostToDevice);

viennacl::compressed_matrix<double> Matrix;
Matrix =viennacl::compressed_matrix<double>(drow_offset, dcol, dval,NRow,NCol,num_entries) ;
viennacl::vector_base<double> Vector= viennacl::vector_base<double>(drhs, viennacl::CUDA_MEMORY,NRow,0,1);
viennacl::vector_base<double> Solution(dsol,viennacl::CUDA_MEMORY,NRow,0,1);

//for profiling: separate solving in the timeline
usleep(1000 * 1000);

//viennacl::linalg::chow_patel_ilu_precond< viennacl::compressed_matrix<double> > vcl_ilu0(Matrix, viennacl::linalg::chow_patel_tag());
Solution = viennacl::linalg::solve(Matrix, Vector, viennacl::linalg::gmres_tag(1e-6, 20));

usleep(1000 * 1000);

copy(Solution.begin(), Solution.end(), stl_vec.begin());
for(int i=0;i<6;i++)printf("%f \t",stl_vec[i]);printf("\n");
return 0;

}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2016-01-19

Sorry, sourceforge silently put your post into the moderation queue.
Yes, PCI-Express communication in both directions is required for iterative solvers (at the very least, one has to check convergence in each iteration).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Is any version of ILU running fully on GPU.

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Forums

Help

Is any version of ILU running fully on GPU.

Is any version of ILU running fully on GPU.

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Forums

Help

Is any version of ILU running fully on GPU. document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Is any version of ILU running fully on GPU.