I'm interested in accelerating JAGS using accelerators like GPUs. Can someone give me tips on where to start from. I'm looking for compute intensive parts of JAGS that can be ported. Please let me know how to approach this.
Well that's an interesting question. I have looked into GPU programming a little bit. I have the book by David Kirk and Wen-mei Hwu on Programming Massively Parallel Processors and have followed the Coursera course on Heterogeneous Parallel Programming that follows the first few chapters of the book.
My initial impressions are not very encouraging. GPU programming is a very low-level exercise that requires you to think about the hardware in a very concrete way. This is not much fun, but more importantly, it seems to make efficient MCMC rather difficult.
Efficient GPU programming requires (1) Avoiding control divergence between threads in the same warp and (2) optimizing memory access by having threads in the same warp access adjacent memory locations. Furthermore, problems amenable to GPU programming have a (multi-dimensional) array structure due to the way threads are indexed and (2) above.
In the context of JAGS, these are quite serious constraints. You may find an application with a distribution defined on a large array that can be broken down into easily parallelizable chunks. For example, the cudaBayesreg package for R does MCMC on fMRI data.
You also need to think about transfering data between CPU and GPU memory. This will need to be done at each iteration, which may therefore be a performance bottleneck. You would also need to re-implement the L'Ecuyer RNG on the GPU as the current JAGS API assumes one stream of random numbers per thread.
For more a list of resources on GPU programming for statistics, see the GPUSS home page. Note that a lot of the links are concerned with population Monte Carlo which is inherently easier to parallelize.
I'm not any kind of GPU programming expert, but are you aware of OpenCL? My understanding is that it provides a way to program against compute resources, including CPUs and GPUs, in a hardware agnostic way (although not necessarily at a higher level of abstraction). This might allow something like JAGS to be modified to run very quickly without necessarily sacrificing portability.
Thanks for your comments.
I agree that GPU architecture is not well suited for JAGS acceleration. But I see BLAS GEMM and some LAPACK functions being used. Do these parts involve huge matrices that can benefit from GPU acceleration?
Also, we have a few CUDA-based random number generators. Can we use one of these from NVIDIA CURAND library : https://developer.nvidia.com/curand
I'm actually interested in profiling JAGS for a sample model and see whats parts can be GPU accelerated. If would be great if you can point me to sections of the code (like sgemm calls and others) that are compute intensive and suited for GPU acceleration.
Thanks for your help!
Any comments, Thanks!!