Menu

custom kernel writing to a sparse matrix

suchismit
2014-11-16
2014-11-17
  • suchismit

    suchismit - 2014-11-16

    hi folks,
    i was wondering if it is possible to write a custom opencl based kernel which can write to a sparse viennacl/eigen matrix ?

    i spent some time thinking about it and it seemed improbable since the kernel based parallelism was based on the value of each cell of the output matrix being calculated independent of the others. and the concept of a sparse matrix stems from maintaining the 3 lists of information for the nonzero entries which again might make the individual element calculations dependent on each other (well, not exactly the calculation of the individual entries but writing back to the output matrix i.e. updating the 3 lists). :(

    Question. what do i want to do exactly ?
    Answer. calculate the entries of a million-by-million covariance matrix using kernel based parallelism. problem :- storing a million-squared float's or double's might be out of question and hence sparse matrices. the million on each side is a typical value and can be much more.

    i have an implementation already(see below) which does the calculation using eigen triplets and subsequently populates the sparse matrix but it takes a longggg time to do it. :(

    using namespace Eigen;
    
    typedef SparseMatrix<T, Eigen::RowMajor> SMat;
    typedef Triplet<T> EigTriple;
    
    #define EPSILON 0.0000000001
    ...
    
    std::vector<EigTriple> triples;
    SMat Z(N, N);
    ...
    
    for(size_t rindex = 0; rindex < N; rindex++){
        for (size_t cindex = rindex; cindex < N; cindex++){
        ...
            if (kval >= EPSILON) {
                    triples.push_back(EigTriple(rindex, cindex, kval));
                }
        ...
    
        }
    } 
    
    Z.setFromTriplets(triples.begin(), triples.end());
    triples.clear();
    ...
    
    viennacl::linalg::cg_tag cg(CG_TOLERANCE, CG_MAX_ITERATIONS);
    X = viennacl::linalg::solve(Z, YN, cg);
    

    Question. why do i want the covariance matrix ?
    Answer. the covariance matrix is the first step of a series of calculations to solve a gaussian process based estimation problem.

    Question. can i control the level of sparsity of the covariance matrix ?
    Answer. yes, there are hyperparameters involved which control the level of sparsity of the covariance matrix. 0.1%, 1%, 5%, 10% all achievable through the appropriate selection of hyperparameters.

    please do comment and chip in. :)
    all suggestions welcome.

    warm regards,
    suchismit

     

    Last edit: suchismit 2014-11-16
  • Karl Rupp

    Karl Rupp - 2014-11-16

    Hi Suchismit,

    populating a sparse matrix on the GPU typically involves two steps: The first step is to determine the sparsity pattern, from which the respective arrays are allocated. The second step then is to write the actual values to it. From your code snippet it seems that you can fairly easily determine the sparsity pattern (I assume you have some additional logic to not populate the full matrix). You can push such information to a viennacl::compressed_method<> by using the set() member function, which is also used in the viennacl::copy() routines to e.g. copy the data from an Eigen sparse matrix to a viennacl::compressed_matrix<>.

    Another question is whether you need an assembled matrix. If you only need to be able to compute matrix-vector products, then it may be sufficient to only provide a routine which takes in a vector and computes the result of the matrix-vector product without ever explicitly setting up a sparse matrix. If this is sufficient, then this is the recommended method because it is not only easier, but often also more efficient in terms of memory consumption. We don't have an example for such so-called 'matrix-free' methods in ViennaCL yet, but it will be addressed soon: https://github.com/viennacl/viennacl-dev/issues/74

    Best regards,
    Karli

     
  • suchismit

    suchismit - 2014-11-16

    hi karli,
    let me elaborate a little more on my requirement.
    i want to calculate the covariance matrix in a parallelized manner using device kernel code.

    suppose N = 1 million (# of data points)
    suppose M = 3 (# of features associated with each data point)
    
    X = data matrix (size N-by-M where each row is a data point)
    K = covariance matrix (size N-by-N)
    where K(i,j) = 1/exp( (X(i,:)-X(j,:))^2 / (2*ls*ls) )  ----- (1)
    where X(i,:) and X(j,:) are the ith and jth data points.
    
    to calculate K using (1) we can use the following pseudo code,
    
    for loop over i = 1 to N
        for loop over j = 1 to N
            f = 1/exp( (X(i,:)-X(j,:))^2 / (2*ls*ls) )
    
            if (f >= some threshold)
                K(i,j) = f
            else
                do not assign K(i,j) i.e. K(i,j) = 0
            end if 
    
        end for
    end for
    

    two things to note regarding the above pseudo code :-
    i) K(i,j) calculation does not depend on other entries and thus is highly parallelizable. hence the question can we use some custom kernel code to do it on the gpu ?

    ii) K can be really sparse since the degree of sparsity gets controlled by threshold and ls values. cannot expect K to be a dense matrix when N >= million.

    however, there is no sparse matrix operation that can be used to calculate K(i,j).
    every element needs to be calculated individually.

    one more thing to point out here. in case i can achieve calculating K using kernel code on the gpu device, i do not need to copy back K to the host since K will be used in a subsequent cg solve done using viennacl on the gpu device. :)

    warm regards,
    suchismit

     
  • Karl Rupp

    Karl Rupp - 2014-11-17

    Hi,
    two notes even before considering any GPU-acceleration:
    1.) Your computational complexity is currently N^2. Consider using datastructures like oct-trees or kd-trees so that you don't have to deal with any-to-any comparisons for computing 'f'.
    2.) Even if you want to stay with N^2 complexity, you can substantially reduce the computational effort when rearranging

    if (f >= some threshold)
    

    such that you don't have to evaluate the exponential for f. You can equivalently rewrite this into the much cheaper

    if ( (X(i,:) - X(j,:))^2 <= other_threshold)
    

    If we want to run the whole calculation on the GPU, you will need the following steps:
    1.) Determine the number of nonzeros in each row
    2.) Populate the row- and column-arrays for the CSR format
    3.) Fill the entries-array in the CSR format by computing K(i,j) = f for each nonzero.

    You might get best overall performance if you compute 1.) and 2.) on the host using good datastructures, and only run the computationally intensive part 3.) on the GPU.

    Best regards,
    Karli

     

Log in to post a comment.