Welcome, Guest! Log In | Create Account

Technical Developments

Introduction

Recent developments in high performance graphics hardware have given rise to fast parallel-processing processors capable of performing at several hundred gigaflops. These graphics hardware are relatively inexpensive and easily installable on most workstations while allowing for a multifold increase in potential computational power. However the hardware is not easily programmable which is a concerning issue for its use in scientific computing.

Imported from wikispaces

In early 2007, GPU manufactures such as NVIDIA and AMD/ATI have released a type of compute coprocessors that have exposed certain functionalities of the graphics device through the use of new device models accessible via common low-level programming languages. The AMD/ATI Firestream GPUs are programmable through an open-source C-based language Brook+ and allow access to the Compute Abstraction Layer (CAL), a low level access to the GPU. On the NVIDIA side, recent advances of its own C-supported programming model (Compute Unified Device Architecture or CUDA) have yielded promising results in terms of speedups and programmability over previous generations. We have chosen the latter architecture for development, and subsequently use NVIDIA graphics boards (G80 GPU series) for testing.

While NVIDIA’s NVCC compiler for CUDA allows for compilation of host and kernel C code, its purpose is best suited for producing optimized programs on the GPU. This is ideal for the development of small applications but not for larger and more complex programs. GPUs in this context should be considered as compute coprocessors capable of processing large sets of data provided from a CPU host cluster. To address this issue, we developed a middleware library that added GPU functionality to high level languages such as Fortran 9x. The middleware library contains a multitude of functions that allow these languages to directly manipulate data on the GPU, perform a host of common mathematical operations, and provide additional support for problem specific tasks. This middleware library thereby serves as a layer of abstraction between the low level CUDA model and the higher order applications used in scientific computing.

What is DevObject

DevObject is a middleware designed to provide an abstract layer over the C-based CUDA architecture for Fortran 9x users in scientific computing. It is designed to prioritize ease of use for Fortran 9x users and to establish an efficient communication channel between the CPU and GPU hardware. The DevObject application model is divided into three layers: Fortran interface, C interface, and the CUDA kernels. While the Fortran interface is visible to the user, the C environment is not and thusly responsible for the management of the DevObject library itself. Typical Fortran function calls in DevObject are thereby wrapper functions to C code which is supported by the CUDA architecture. Kernel functions, as well as NVIDIA included CUBLAS and CUFFT library functions are made accessible to the Fortran user through this process.

Framework

The upper level of the DevObject FORTRAN framework defines a data structure (devVar or device variable), that encapsulates a number of parameters associated with the device memory addresses, dimensions of data, alignment information, and memory allocation status. In practice, device variables, which are stored on the FORTRAN host side, allow for data migration between the GPU and CPU in vector and matrix storage formats. Disparities between the memory formats on the host and device are addressed in the next section.

DevObject FORTRAN implements a number of wrapper functions to interface with C level code. Included in the C level are function calls used to access particular CUDA kernels as well as loader functions used to open external modules (cubin files) that make allowances to the open-development of user created CUDA functions.

Imported from wikispaces

The DevObject library also has direct access to the released single-precision CUBLAS and CUFFT library functions. Similarly, DevObject uses a combination of FORTRAN wrappers and device variables to make successive calls to these libraries.

Memory Model

One notable disparity between FORTRAN and C is its use of different storage formats for multi-dimension arrays in linear memory. FORTRAN’s column-major format conflicts with C and CUDA memory row-major storage formats. The solution presented itself in the CUBLAS library; the library was originally developed to interface with the FORTRAN language and consequently uses the same column-major 1-base indexing format. DevObject integrates these CUBLAS functions into its memory management process.

DevObject?’s memory model is optimized for fast processing on the GPU device. A number of considerations to CUDA’s memory architecture are addressed so that proper data alignment, grid/block/thread sizes, and shared memory constraints are met prior to kernel execution. In terms of the memory model, DevObject forces the dimensions of data elements on the device to align with multiples of 256 for vectors, 16x16 for 2-D arrays, and 8x8x4 for 3-D matrices. No data padding is used due to the large overhead costs of realignment between memory transfers.

Imported from wikispaces


The device variable encapsulates a number of parameters and attributes of the data structure transferred between host and device. The user is able to instantiate device variables in the FORTRAN code once the library is loaded. Use of DevObject?’s memory functions allow the device variables to track newly allocated space on the GPU, pass data between FORTRAN arrays and the device memory, and free up device variables or device memory when more space is needed.

Imported from wikispaces

Runtime and Driver API

The DevObject library concurrently uses the CUDA driver and runtime APIs. The low-level driver API is able to link to both generated host code or execute external cubin objects on the device. The high-level runtime API is only able to link to host code for execution. Aside from the aforementioned differences, both APIs have otherwise the same capabilities in terms of kernel speed, interoperability with other 3rd party APIs such as OpenGL and DirectX, and execution model.

The use of the two CUDA APIs addresses concerns regarding DevObject?’s development and ease of use. Compared to the Driver API, the runtime API contains a simplified kernel execution procedure with its predefined implicit initialization, context management, and module management via configuration syntax. Furthermore, all kernel host code generated by the NVCC compiler is based off CUDA runtime so; the driver API that links to kernel code compiled on the host may not be streamlined. The driver API is language independent but requires explicit configurations and kernel parameters for kernel launches. Additionally, it does not provide kernel or software emulation for debugging purposes.


Imported from wikispaces

DevObject?’s use of both APIs gives it a flexible framework for managing and launching its own internal functions while providing options for external development and incorporation of custom CUDA functions. The runtime API manages all the core and ancillary functionalities related to the CUBLAS/CUFFT libraries, memory control, and data scaling. The ancillary branch includes additional integrated libraries such as the CUDA Data Parallel Primitives (CUDPP) and other matrix based operations.

The driver API is designed to load and execute cubin objects that are written externally and imported by the user. Module management of multiple cubin objects is done on the C level and remains independent of any FORTRAN processes. The DevObject library also contains a number of generic function invokers, callable from the FORTRAN that pre-computes the kernel block-shape, passes the necessary arguments to the kernel object, and launches the grid for execution. User developed CUDA functions that follow a particular parameters format are thereby executable by our kernel caller.

Work Cycle

The DevObject work-cycle begins by compiling and linking our library with any existing FORTRAN code. To use the library, the library’s initialization function must be called. Here, the particular GPU device is set for use and the driver API context is created. Once the library is loaded for use, DevObject subroutines and functions can be called from any FORTRAN environment.

Device variables, which are DevObject?’s functional link between host and device, are created by the user. The data type, dimension, and size of the device variables are all specified by user parameters and dynamically allocated in the GPU global memory space. Any number of device variables can be created in this fashion while memory is available on the GPU. DevObject will not garbage collect unused device variables or device memory so an explicit deallocation subroutine must be called to free up additional space.

Imported from wikispaces

Once the device variables are allocated, host data can be quickly copied onto the device. The work-cycle then primarily consists of calling subsequent CUDA function kernels to perform operations on the GPU data. The advantage of this method lies in the faster parallel computation of large data sets as a result of asynchronous kernel executions and the preclusion of any memory copying operations between the host and device during this time.