Menu

Is any version of ILU running fully on GPU.

2015-12-17
2016-01-19
  • Mohamed Aissa

    Mohamed Aissa - 2015-12-17

    I tried many solvers on GPU and the bottleneck is ILU. for my work the system matrix and rhs reside on the GPU and are computed with CUDA. Before starting any conversion from CUDA to openCL I would like to make sure that at least the ILU version of Pr Chow is runnign fully on the GPU.
    A side question: can I pass col, row and value arrays as cuda pointer to compressed viennaCl matrix?

     

    Last edit: Mohamed Aissa 2015-12-17
  • Karl Rupp

    Karl Rupp - 2015-12-17

    Hi,
    you can use our parallel Chow-Patel-ILU with OpenMP, CUDA, and OpenCL. In other words, you don't need to migrate from CUDA to OpenCL, you can just use your CUDA buffers. With CUDA and OpenCL the Chow-Patel-ILU fully runs on the GPU.

    If your col/row/value data is already in CSR format and you use OpenCL, you can use the provided constructor for OpenCL. With CUDA I recommend you copy&paste the OpenCL-enabled constructor (compressed_matrix.hpp, around line 670) to make it work with CUDA. The added code should be something similar to the following (not tested):

      explicit compressed_matrix(unsigned int *mem_row_buffer, unsigned int *mem_col_buffer, NumericT *mem_elements,
                                 vcl_size_t rows, vcl_size_t cols, vcl_size_t nonzeros) :
        rows_(rows), cols_(cols), nonzeros_(nonzeros), row_block_num_(0)
      {
        row_buffer_.switch_active_handle_id(viennacl::CUDA_MEMORY);
        row_buffer_.cuda_handle() = mem_row_buffer;
        row_buffer_.cuda_handle().inc();             //prevents that the user-provided memory is deleted once the matrix object is destroyed.
        row_buffer_.raw_size(sizeof(unsigned int) * (rows + 1));
    
        col_buffer_.switch_active_handle_id(viennacl::CUDA_MEMORY);
        col_buffer_.cuda_handle() = mem_col_buffer;
        col_buffer_.cuda_handle().inc();             //prevents that the user-provided memory is deleted once the matrix object is destroyed.
        col_buffer_.raw_size(sizeof(unsigned int) * nonzeros);
    
        elements_.switch_active_handle_id(viennacl::CUDA_MEMORY);
        elements_.cuda_handle() = mem_elements;
        elements_.cuda_handle().inc();               //prevents that the user-provided memory is deleted once the matrix object is destroyed.
        elements_.raw_size(sizeof(NumericT) * nonzeros);
    
        //generate block information for CSR-adaptive:
        generate_row_block_information();
      }
    

    If you replace 'unsigned int' with 'int', things should still work (lower bits are the same in two-complement).

    Best regards,
    Karli

     
  • Mohamed Aissa

    Mohamed Aissa - 2015-12-18

    thanks for the prompt answer. I will test it.

     
  • Mohamed Aissa

    Mohamed Aissa - 2015-12-21

    I used the same constructor you sugested with int instead of unsigned int. I call it that way :
    Matrix=new viennacl::compressed_matrix<double,4>(drow_offset, dcol, dval,NRow,NCol,num_entries);

    and the errors are of that kind:
    no operator "=" matches these operands operand types are: viennacl::backend::mem_handle::cuda_handle_type = int *

     
  • Karl Rupp

    Karl Rupp - 2015-12-21

    Ah, type mismatch. Please replace the respective lines of type

    elements_.cuda_handle() = mem_elements;
    

    with

    elements_.cuda_handle().reset(reinterpret_cast<char*>(mem_elements));
    
     
  • Mohamed Aissa

    Mohamed Aissa - 2015-12-22

    I got a small code to work finding the right solution of a 6x6 Ax=b system but the cuda profiler nvvp shows me that during the solving there is some use of the PCI in both direction. Is it intended although bad for perfromance or in this code something is wrong?:

    #include <iostream>
    #include <cstdlib>
    #include <cuda.h>
    #include <cuda_runtime_api.h>
    #include "viennacl/compressed_matrix.hpp"
    #include "viennacl/vector.hpp"
    #include "viennacl/linalg/ilu.hpp"
    #include "viennacl/linalg/gmres.hpp"
    #include <unistd.h>

    int main(int argc, char* argv[]) {

    int NRow=6,NCol=6,num_entries=9;

    // CSR row offset
    int row_offset[7] = {0, 1, 3, 4, 6, 8, 9};
    // CSR cols
    int col[9] = {0, 1, 2, 2, 2, 3, 4, 5, 5};
    // CSR vals
    double val[9] = {2., 3., 6., 4., 5., 6.3, 1.5, 2.5, 4.5};
    // rhs
    double b[6] = {1.0, 1.0, 1.0, 1.0, 1.0, 1.0};

    std::vector<double> stl_vec(6);

    // Copy to GPU
    int drow_offset = NULL;
    int
    dcol = NULL;
    double dval = NULL;
    double
    drhs = NULL;
    double *dsol = NULL;

    cudaMalloc((void)&drow_offset,7*sizeof(int));
    cudaMalloc((void
    )&dcol,9sizeof(int));
    cudaMalloc((void)&dval,9*sizeof(double));
    cudaMalloc((void
    )&drhs,9
    sizeof(double));
    cudaMalloc((void*)&dsol,6sizeof(double));

    cudaMemcpy(drow_offset,&row_offset,7sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dcol,&col,9
    sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dval,&val,9sizeof(double),cudaMemcpyHostToDevice);
    cudaMemcpy(drhs,&b,9
    sizeof(double),cudaMemcpyHostToDevice);

    viennacl::compressed_matrix<double> Matrix;
    Matrix =viennacl::compressed_matrix<double>(drow_offset, dcol, dval,NRow,NCol,num_entries) ;
    viennacl::vector_base<double> Vector= viennacl::vector_base<double>(drhs, viennacl::CUDA_MEMORY,NRow,0,1);
    viennacl::vector_base<double> Solution(dsol,viennacl::CUDA_MEMORY,NRow,0,1);

    //for profiling: separate solving in the timeline
    usleep(1000 * 1000);

    //viennacl::linalg::chow_patel_ilu_precond< viennacl::compressed_matrix<double> > vcl_ilu0(Matrix, viennacl::linalg::chow_patel_tag());
    Solution = viennacl::linalg::solve(Matrix, Vector, viennacl::linalg::gmres_tag(1e-6, 20));

    usleep(1000 * 1000);

    copy(Solution.begin(), Solution.end(), stl_vec.begin());
    for(int i=0;i<6;i++)printf("%f \t",stl_vec[i]);printf("\n");
    return 0;

    }

     
  • Karl Rupp

    Karl Rupp - 2016-01-19

    Sorry, sourceforge silently put your post into the moderation queue.
    Yes, PCI-Express communication in both directions is required for iterative solvers (at the very least, one has to check convergence in each iteration).

     

Log in to post a comment.