Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
cume_base.cu | 2022-12-04 | 48 Bytes | |
exception.cpp | 2022-12-04 | 1.5 kB | |
exception.h | 2022-12-04 | 2.1 kB | |
cume_pinned_array.h | 2022-12-04 | 3.8 kB | |
cume_variable.h | 2022-12-04 | 1.8 kB | |
cume_zero_copy_array.h | 2022-12-04 | 3.8 kB | |
cume_kernel.cu | 2022-12-04 | 5.7 kB | |
cume_kernel.h | 2022-12-04 | 10.2 kB | |
cume_matrix.h | 2022-12-04 | 4.2 kB | |
cume_devices.cu | 2022-12-04 | 2.0 kB | |
cume_devices.h | 2022-12-04 | 1.5 kB | |
cume_gpu_timer.cu | 2022-12-04 | 682 Bytes | |
cume_gpu_timer.h | 2022-12-04 | 1.1 kB | |
cume_base.h | 2022-12-04 | 6.9 kB | |
cume_cpu_timer.cpp | 2022-12-04 | 1.4 kB | |
cume_cpu_timer.h | 2022-12-04 | 1.5 kB | |
cume.h | 2022-12-04 | 804 Bytes | |
cume_array.h | 2022-12-04 | 4.4 kB | |
Totals: 18 Items | 53.6 kB | 0 |
===================================================================== CUME (CUda Made Easy) version 2.0 - 2023 Author: Jean-Michel RICHER email: jean-michel.richer@univ-angers.fr http://http://www.info.univ-angers.fr/pub/richer/ ===================================================================== -------------- What is CUME ? -------------- CUME stands for CUda Made Easy and aims to simplify the writing of CUDA C/C++ code. Version 1.0 was created in 2015. Version 2.0 was created in 2022. ------------------------------ What do you mean by simplify ? ------------------------------ The major drawbacks of the CUDA API are: 1) it is required to check the return value of each call to a CUDA function in order to determine if some error occurred before, during or after the call to the kernel 2) there is no automatic way to get the "global thread index" (gtid) of each thread: the formula of gtid depends of the grid and block configuration 3) allocating, freeing and copying arrays is a boring an unproductive task 4) the notion of host and device memory is troublesome when you have to cudaMemcpy(adr1, adr2, size, cudaMemcpyDeviceToHost or HostToDevice) 5) some configuration parameters are hardware dependant: for example when defining a block of a kernel call you must not exceed the maximum number of threads allowed by the device or you will get an error. It would then be neccesary to check that the size of the block respects the constraint of the hardware --------------------------------------- To answer all this drawbacks we propose --------------------------------------- 1) a cuda_check macro instruction that is used to check every return call of CUDA API and other macro instructions to simplify the call to CUDA API (see file src/cume_base.h) 2) a Kernel class that will help setup or determine the size of the grid and block parameters. This class can also be used with the macro insruction kernel_call (see file src/cume_kernel.h) in order to automatically get the right formula to obtain the global thread index 3) an Array class that will handle data on the host and device memory (see the file src/cume_array.h) 4) push and pop methods : push will transfer data from host to device memory and pop will transfer from device to host -------------------- Comparison with CUDA -------------------- To compare the CUME version with the simple CUDA API we provide three versions of the sum of vectors using different techniques: - examples/compute/cuda_vector_sum.cu : CUDA API - examples/compute/cume_vector_sum_no_array.cu : first version using CUME without the use of Array - examples/compute/cume_vector_sum_array.cu : second version using CUME with the use of the Array class and Resource We also provide the file tests/test_kernel_configs.cu that tests the different formulae to obtain the global thread index -------------------- What you can do -------------------- Type 'make' to generate binaries Type 'make run_tests' to run a simple test Type 'make run_examples' to run all the examples