Cristiano Bozza - 2012-10-28

GPU Interruptible Kernels

The problem

I am currently porting many of my algorithms (mostly for scientific data processing) to GPU's. In a few months, having tested several NVIDIA boards ranging fron ION to GeForce GTX 590, 620 and Tesla C2050, I realized that I recurrently face the following two problems:
1) too many parameters/variable in a kernel;
2) kernel execution time is too long and the launch times out.
The former is particularly annoying when you develop your code in 32-bit and then switch it to 64-bit. All pointers grow in size, and suddenly your kernel that used to work needs to be reshaped.
The latter issue is normally more worrying, because for some algorithms you can't really predict how many iterations they will need. This happens when loops depend on data, such as for ordering algorithms, or for superstructure recognition (e.g. image analysis). Indeed, one wants to use the GPU because it provides a lot of processing power for a problem that is intrinsically computationally intensive.
The actual timeout in terms of loops depends on the board and on its available power. Anyway, you can always produce a dataset that exceeds usual timeouts.

The solution

If you want your code to be portable across several versions of the CUDA driver and on several boards, you have to be prepared to interrupt and relaunch kernels before they run out of time. There must be a mechanism to allow the kernel to save its computation state across termination and relaunching. A specific memory area is allocated to each thread of each block to store its status, as it would occur in a cooperative multithreading environment. The kernel itself decides when and where it's safe to "pause" the computation if needed. The memory area for local variables also solves the problem of storing many variables: data stored there are not accessed so fast as registers, but careful design can help keeping the overhead low.
Each interruptible kernel receives precisely two parameters, which are respectively a pointer to the general arguments of the call and a pointer to the memory area of local variables. The kernel should use some macro (implemented as #define) for initialization, and soon after check whether its task was already completed in a previous launch of if there is still work to do. In the latter case, execution should jump to where it was interrupted. The interruption point is actually stored in a member of the struct that contains local variable. Jumping to the interruption point is obtained by using a "goto" statement, hence one must pay attention that all variables that are not stored in the struct are properly initialized. All kernel exit points must also use a proper macro to document the status of the execution and whether relaunch is needed.
While a CUDA kernel can usually be launched by the <<< >>> syntax, this library provides a template class that wraps three elements in a single unit:
1) Argument struct;
2) Status struct (including local variables);
3) kernel function to be executed.
Execution is managed by the "Launch" method, which also allows specifying a maximum number of interrupt points to be hit before execution stops and takes care that not too many launches are required. This offers a way to monitor lengthy operations while not risking that driver is reset. "Launch" also allows limiting the number of launches.

The files

This library contains a single file, in the shape of a C++ header, named gpu_interruptible_kernels.h . It comes with a file called "example.cu" that shows a trivial case of kernel that needs to be reshaped for safe execution. In this example only one breakpoint is used, but many more can be used.

How to use the library

1) Include the C++ header in your project.
2) Define which kernels need to be reshaped to become interruptible.
3) Each reshaped kernel will have the following elements:
a) _IKGPU_PROLOG(pargs, pstatus) as the first instruction.
b) initialization of variables that are forgotten across launches.
c) for each interrupt checkpoint a macro like #define _IKGPU_RESUMEFROM(at, pstatus). The at parameters defines the interrupt checkpoint to resume execution from.
d) every place where the kernel has its persistent stored variables in a consistent state is eligible to become an interrupt point. However, interrupts increase thread divergence and add overhead, so try not to put too many. Interrupt points are marked by _IKGPU_INTERRUPT(at, pstatus, pargs)
e) an _IKGPU_END(pstatus) macro must guard every exit point of the kernel.

A typical kernel would look like:


global void mykernel(myarg pargs, mystatus * pstatus)
{
_IKGPU_PROLOG(pargs, pstatus);
_IKGPU_RESUMEFROM(1, pstatus);
_IKGPU_RESUMEFROM(2, pstatus);
for (pstatus->i = 0; pstatus->i < pargs->maxcount; pstatus->i++)
{
_IKGPU_INTERRUPT(1, pstatus, pargs);
/
do some work /
for (pstatus->j = 0; pstatus->j < pargs->maxcount; pstatus->j++)
{
_IKGPU_INTERRUPT(2, pstatus, pargs);
/
do some work */
}
}
_IKGPU_END(pstatus);
}

In order to use such kernels, a class is needed to handle their first launch and possible relaunches. The class should be something like:

IntKernel<myarg, mystatus,="" mykernel=""> Launcher;

and argument initalization would look like:


Launcher.Arguments.maxcount = 100;

The kernel would be launched by the following line:

Launcher.Launch(iblocks, ithreads);

Additional parameters are available for the Launch method to control maximum iterations and frequency of interrupts. More info are available in the files distributed.

Have fun!

 

Last edit: Cristiano Bozza 2012-10-29