Recent changes to tutorial

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Mon, 25 Sep 2017 06:40:32 -0000

--- v17
+++ v18
@@ -76,7 +76,7 @@

 c) transfer of data between host and device memory

-Use the push and pop methods of the array to respectively transfer data from host to device, and device to host memory.
+Use the push and pull methods of the array to respectively transfer data from host to device, and device to host memory.

 d) redefinition of operator &

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Mon, 25 Sep 2017 06:21:19 -0000

--- v16
+++ v17
@@ -38,7 +38,7 @@

 1. the cume_push function to transfer data from host to device memory
-2. and the cume_pop function to transfer data from device to host memory
+2. and the cume_pull function to transfer data from device to host memory

 
     int *cpu_array = new int \[100\];
@@ -47,7 +47,7 @@
     cume_push(gpu_array, cpu_array, int, 100);
     ... call kernel
     // cume_push(destination in host memory, source in device memory, type, nbr_items)
-    cume_pop(cpu_array, gpu_array, int, 100);    
+    cume_pull(cpu_array, gpu_array, int, 100);    
 

 2. How to use the CUME Array class to handle arrays

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Sat, 19 Sep 2015 12:54:53 -0000

--- v15
+++ v16
@@ -8,6 +8,7 @@

 The cume_new_var and cume_new_array macro instructions help allocate memory on the device:

+
     // we allocate one integer in the device memory
     int *gpu_integer;
     // cume_new_var(pointer, type) 
@@ -17,7 +18,8 @@
     int *gpu_array;
     // cume_new_array(pointer, type, nbr_items)
     cume_new_array(gpu_array, int, 100);
-    
+
+
 Another interesting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.

     int *gpu_array;
@@ -38,7 +40,7 @@
 1. the cume_push function to transfer data from host to device memory
 2. and the cume_pop function to transfer data from device to host memory

-
+
     int *cpu_array = new int \[100\];
     cume_new_array(gpu_array, int, 100);
     // cume_push(destination in device memory, source in host memory, type, nbr_items)
@@ -46,7 +48,7 @@
     ... call kernel
     // cume_push(destination in host memory, source in device memory, type, nbr_items)
     cume_pop(cpu_array, gpu_array, int, 100);    
-
+

 2. How to use the CUME Array class to handle arrays

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Fri, 13 Mar 2015 12:25:24 -0000

--- v14
+++ v15
@@ -8,25 +8,27 @@

 The cume_new_var and cume_new_array macro instructions help allocate memory on the device:

-    // create one integer
-    int *some_integer;
-    cume_new_var(some_integer, int);
+    // we allocate one integer in the device memory
+    int *gpu_integer;
+    // cume_new_var(pointer, type) 
+    cume_new_var(gpu_integer, int);

-    // create array of 100 integers
-    int *gpu_tab;
-    cume_new_array(gpu_tab, int, 100);
+    // allocate an array of 100 integers in device memory
+    int *gpu_array;
+    // cume_new_array(pointer, type, nbr_items)
+    cume_new_array(gpu_array, int, 100);

 Another interesting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.

-    int *gpu_tab;
-    cume_new_array_zero(gpu_tab, int, 100);
+    int *gpu_array;
+    cume_new_array_zero(gpu_array, int, 100);

 b) memory deallocation

 Use the cume_free macro instruction:

-    cume_free(gpu_tab);
+    cume_free(gpu_array);

 b) memory transfer

@@ -37,11 +39,13 @@
 2. and the cume_pop function to transfer data from device to host memory

-    int *cpu_tab = new int \[100\];
-    int *gpu_tab = cume_malloc(100 * sizeof(int));
-    cume_push(gpu_tab, cpu_tab, int, 100);
+    int *cpu_array = new int \[100\];
+    cume_new_array(gpu_array, int, 100);
+    // cume_push(destination in device memory, source in host memory, type, nbr_items)
+    cume_push(gpu_array, cpu_array, int, 100);
     ... call kernel
-    cume_pop(cpu_tab, gpu_tab, int, 100);    
+    // cume_push(destination in host memory, source in device memory, type, nbr_items)
+    cume_pop(cpu_array, gpu_array, int, 100);    

 2. How to use the CUME Array class to handle arrays
@@ -70,7 +74,7 @@

 c) transfer of data between host and device memory

-Use the push and pop method of the array to respectively transfer data from host to device, and device to host memory.
+Use the push and pop methods of the array to respectively transfer data from host to device, and device to host memory.

 d) redefinition of operator &

@@ -90,9 +94,9 @@
 + REQUIRED_THREADS is the number of threads you need
 + GRID_TYPE is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
 + BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
-+ parameters are the size of the grid and blocks following the GRID_DEFINITION and BLOCK_DEFINITION
++ parameters are the size of the grid and blocks following the GRID_TYPE and BLOCK_TYPE

-The different constants GRID_1, GRIX_X, .... have the following meaning:
+The different constants GRID_1, GRID_X, .... have the following meaning:
 + GRID_1 : a grid with 1 block
 + GRID_X : a 1D grid with several blocks on x axis (dimGrid.x >= 1)
 + GRID_XY : a 2D grid with several blocks on x  and y axis 
@@ -100,12 +104,19 @@

 The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.

-For example if we need to work with 1024 threads with a grid of 2 x 16 and a block of 32 then we will write
+The different constants BLOCK_1, BLOCK_X, .... have the following meaning:
++ BLOCK_1 : a block with 1 thread
++ BLOCK_X : a 1D block with several threads on x axis (dimGrid.x >= 1)
++ BLOCK_XY : a 2D block with several threads on x  and y axis 
++ BLOCK_XYZ : a 3D block with several threads on x, y  and z axis  
+
+
+For example if we need to work with 1024 threads with a grid of 2 x 16 blocks and each block has 32 threads then we will write

     Kernel k(1024);
     k.configure(GRID_XY, BLOCK_X, 2, 16, 32)

-If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a 1D block of 32 threads, then use the following code:
+If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a grid of 1D blocks and each blocks has 32 threads, then use the following code:

     Kernel k(1027);
     k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32)
@@ -120,7 +131,7 @@
 + kernel_call_no_resource: call the kernel with No Resource
 + kernel_call: call kernel With Resource (preferred)

-The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically retrieved from the Resource using the get_global_tid function. 
+The difference of No Resource and With Resource is that a data structure Kernel::Resource will be passed as the first argument of the kernel and the global thread index formula will be automatically retrieved from the Resource using the get_global_tid() function. 

 Let's compare the two methods:

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Wed, 11 Mar 2015 16:37:04 -0000

--- v13
+++ v14
@@ -105,26 +105,28 @@
     Kernel k(1024);
     k.configure(GRID_XY, BLOCK_X, 2, 16, 32)

-If you want the Kernel class to determine the size of the grid for you if you need 1027 threads, then use the following code:
+If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a 1D block of 32 threads, then use the following code:

     Kernel k(1027);
-    k.configure(GRID_GUESS, BLOCK_X, 32)
+    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32)

 The grid will then be defined of type GRID_X of size 33. 

-4. How to get global thread index inside the kernel
+4. How to get the global thread index inside the kernel

 Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in cume_kernel.h

 + kernel_call_no_resource: call the kernel with No Resource
-+ kernel_call: call kernel With Resource
++ kernel_call: call kernel With Resource (preferred)

-The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 
+The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically retrieved from the Resource using the get_global_tid function. 

 Let's compare the two methods:

 a) call with no Resource
+
+In this example you will need to use the formula of the gtid that corresponds to the organization of threads in terms of grid and block. If we use a 1D grid composed of 1D blocks then we will use the cume_gtid_x_x() macro instruction:

     __global__ void kernel_sum(int *a, int *b, int *c, int size) {
         // **************************************************************
@@ -138,11 +140,21 @@
     }

     Kernel k(SIZE);
-    k.configure(GRID_GUESS, BLOCK_X, 32);
+    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
     kernel_call_no_resource(kernel_sum, k, &a, &b, &c, a.get_size());
+
+If later you want to change and have a 1D grid with 2D blocks then you will need to modify the line
+   
+    int gtid = cume_gtid_x_x();
+
+by
+
+    int gtid = cume_gtid_x_xy();

 b) call with Resource
+
+In this case, by using the res->get_global_tid() function you will automatically get the right formula.

     __global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
         // **************************************************************
@@ -157,6 +169,7 @@
     }

     Kernel k(SIZE);
-    k.configure(GRID_GUESS, BLOCK_X, 32);
+    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
     kernel_call(kernel_sum, k, &a, &b, &c, a.get_size());
-        
+
+If later you want to change and have a 1D grid with 2D blocks then you won't need to modify your code inside the kernel.

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Wed, 11 Mar 2015 16:27:46 -0000

--- v12
+++ v13
@@ -16,7 +16,7 @@
     int *gpu_tab;
     cume_new_array(gpu_tab, int, 100);

-Another intersting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.
+Another interesting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.

     int *gpu_tab;
     cume_new_array_zero(gpu_tab, int, 100);
@@ -92,6 +92,14 @@
 + BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
 + parameters are the size of the grid and blocks following the GRID_DEFINITION and BLOCK_DEFINITION

+The different constants GRID_1, GRIX_X, .... have the following meaning:
++ GRID_1 : a grid with 1 block
++ GRID_X : a 1D grid with several blocks on x axis (dimGrid.x >= 1)
++ GRID_XY : a 2D grid with several blocks on x  and y axis 
++ GRID_XYZ : a 3D grid with several blocks on x, y  and z axis  
+
+The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.
+
 For example if we need to work with 1024 threads with a grid of 2 x 16 and a block of 32 then we will write

     Kernel k(1024);

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Wed, 11 Mar 2015 16:23:41 -0000

--- v11
+++ v12
@@ -10,11 +10,11 @@

     // create one integer
     int *some_integer;
-    cume_malloc(some_integer, int);
+    cume_new_var(some_integer, int);

     // create array of 100 integers
     int *gpu_tab;
-    cume_malloc(gpu_tab, int, 100);
+    cume_new_array(gpu_tab, int, 100);

 Another intersting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Wed, 11 Mar 2015 16:22:08 -0000

--- v10
+++ v11
@@ -2,35 +2,47 @@

 1. How to use CUME functions

-The cume_base.h file introduces a set of functions to simplify the use of the CUDA API.
+The cume_base.h file introduces a set of macro instructions to simplify the use of the CUDA API for memory allocation.

 a) memory allocation

-The cume_malloc and cume_free template functions help allocate and free memory on the device:
+The cume_new_var and cume_new_array macro instructions help allocate memory on the device:

-    int *gpu_tab = cume_malloc(100 * sizeof(int));
-    ...
-    cume_free(tab);
+    // create one integer
+    int *some_integer;
+    cume_malloc(some_integer, int);
+
+    // create array of 100 integers
+    int *gpu_tab;
+    cume_malloc(gpu_tab, int, 100);

-Another intersting function is cume_malloc_zero which has the same behavior has cume_malloc but initializes the memory with zero bytes.
+Another intersting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.

-    int *gpu_tab = cume_malloc_zero(100 * sizeof(int));
+    int *gpu_tab;
+    cume_new_array_zero(gpu_tab, int, 100);

+
+b) memory deallocation
+
+Use the cume_free macro instruction:
+
+    cume_free(gpu_tab);

 b) memory transfer

 We use:

-* the cume_push function to transfer data from host to device memory
-+ and the cume_pop function to transfer data from device to host memory
-
+1. the cume_push function to transfer data from host to device memory
+2. and the cume_pop function to transfer data from device to host memory

     int *cpu_tab = new int \[100\];
     int *gpu_tab = cume_malloc(100 * sizeof(int));
-    cume_push(gpu_tab, cpu_tab, 100*sizeof(int));
-    
+    cume_push(gpu_tab, cpu_tab, int, 100);
+    ... call kernel
+    cume_pop(cpu_tab, gpu_tab, int, 100);    
+

 2. How to use the CUME Array class to handle arrays

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Tue, 10 Mar 2015 09:32:38 -0000

--- v9
+++ v10
@@ -60,43 +60,47 @@

 Use the push and pop method of the array to respectively transfer data from host to device, and device to host memory.

-3. How to use the KernelConfig class to call a kernel
+d) redefinition of operator &
+
+The operator& has been overloaded and returns the address of data in the device memory.
+
+3. How to use the Kernel class to call a kernel

 This is the most interesting class of CUME that is used to setup grid and block dimensions and call the kernel.

 First you must define the size of the grid and block:

-    KernelConfig kcfg(REQUIRED_THREADS)
-    kcfg.setup(GRID_DEFINITION, BLOCK_DEFINITION, parameters)
+    Kernel k(REQUIRED_THREADS)
+    k.configure(GRID_TYPE, BLOCK_TYPE, parameters)

 where:

 + REQUIRED_THREADS is the number of threads you need
-+ GRID_DEFINITION is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
-+ BLOCK_DEFINITION is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
++ GRID_TYPE is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
++ BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
 + parameters are the size of the grid and blocks following the GRID_DEFINITION and BLOCK_DEFINITION

 For example if we need to work with 1024 threads with a grid of 2 x 16 and a block of 32 then we will write

-    KernelConfig kcfg(1024, GRID_XY, BLOCK_X, 2, 16, 32)
+    Kernel k(1024);
+    k.configure(GRID_XY, BLOCK_X, 2, 16, 32)

-If you want the KernelConfig class to determine the size of the grid for you if you need 1027 threads, then use the following code:
+If you want the Kernel class to determine the size of the grid for you if you need 1027 threads, then use the following code:

-    KernelConfig kcfg(1027, GRID_GUESS, BLOCK_X, 32)
+    Kernel k(1027);
+    k.configure(GRID_GUESS, BLOCK_X, 32)

 The grid will then be defined of type GRID_X of size 33. 

 4. How to get global thread index inside the kernel

-Once you have defined the size of grid and block you can call the kernel using one of the four macros instructions defined in cume_kernel.h
+Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in cume_kernel.h

-+ CUME_KERNEL_RUN_NR: call the kernel with No Resource
-+ CUME_KERNEL_RUN_NR_TIMER: call the kernel with No Resource and use a timer to display the execution time of the kernel on the device
-+ CUME_KERNEL_RUN_WR: call kernel With Resource
-+ CUME_KERNEL_RUN_WR_TIMER: call kernel With Resource and use a timer to display the execution time of the kernel on the device
++ kernel_call_no_resource: call the kernel with No Resource
++ kernel_call: call kernel With Resource

-The difference of NR (No Resource) and WR (With Resource) is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 
+The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 

 Let's compare the two methods:

@@ -113,14 +117,14 @@
         }
     }

-    KernelConfig kcfg(SIZE);
-    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
-    CUME_KERNEL_RUN_NR(kernel_sum, kcfg, 
-        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size()););
+    Kernel k(SIZE);
+    k.configure(GRID_GUESS, BLOCK_X, 32);
+    kernel_call_no_resource(kernel_sum, k, &a, &b, &c, a.get_size());
+

 b) call with Resource

-    __global__ void kernel_sum(KernelConfig::Resource *res, int *a, int *b, int *c, int size) {
+    __global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
         // **************************************************************
         // automatically get global thread index in function of kernel
         // type: no need to wonder which formula to use
@@ -132,11 +136,7 @@
         }
     }

-    KernelConfig kcfg(SIZE);
-    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
-    CUME_KERNEL_RUN_WR_TIMER(kernel_sum, kcfg, 
-        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size());
+    Kernel k(SIZE);
+    k.configure(GRID_GUESS, BLOCK_X, 32);
+    kernel_call(kernel_sum, k, &a, &b, &c, a.get_size());

-        
-
-

tutorial modified by Jean-Michel Richer

Jean-Michel Richer — Wed, 04 Mar 2015 11:29:08 -0000

--- v8
+++ v9
@@ -66,7 +66,8 @@

 First you must define the size of the grid and block:

-    KernelConfig kcfg(REQUIRED_THREADS, GRID_DEFINITION, BLOCK_DEFINITION, parameters)
+    KernelConfig kcfg(REQUIRED_THREADS)
+    kcfg.setup(GRID_DEFINITION, BLOCK_DEFINITION, parameters)

 where:

@@ -88,4 +89,54 @@

 4. How to get global thread index inside the kernel

-Once you have defined the 
+Once you have defined the size of grid and block you can call the kernel using one of the four macros instructions defined in cume_kernel.h
+
++ CUME_KERNEL_RUN_NR: call the kernel with No Resource
++ CUME_KERNEL_RUN_NR_TIMER: call the kernel with No Resource and use a timer to display the execution time of the kernel on the device
++ CUME_KERNEL_RUN_WR: call kernel With Resource
++ CUME_KERNEL_RUN_WR_TIMER: call kernel With Resource and use a timer to display the execution time of the kernel on the device
+
+The difference of NR (No Resource) and WR (With Resource) is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 
+
+Let's compare the two methods:
+
+a) call with no Resource
+
+    __global__ void kernel_sum(int *a, int *b, int *c, int size) {
+        // **************************************************************
+        // get global thread index with cume macro instruction
+        // **************************************************************
+        int gtid = cume_gtid_x_x();     
+        
+        if (gtid < size) {
+                c[gtid] = a[gtid] + b[gtid];
+        }
+    }
+
+    KernelConfig kcfg(SIZE);
+    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
+    CUME_KERNEL_RUN_NR(kernel_sum, kcfg, 
+        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size()););
+
+b) call with Resource
+
+    __global__ void kernel_sum(KernelConfig::Resource *res, int *a, int *b, int *c, int size) {
+        // **************************************************************
+        // automatically get global thread index in function of kernel
+        // type: no need to wonder which formula to use
+        // **************************************************************
+        int gtid = res->get_global_tid();
+        
+        if (gtid < size) {
+                c[gtid] = a[gtid] + b[gtid];
+        }
+    }
+
+    KernelConfig kcfg(SIZE);
+    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
+    CUME_KERNEL_RUN_WR_TIMER(kernel_sum, kcfg, 
+        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size());
+        
+        
+
+