<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to tutorial</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>Recent changes to tutorial</description><atom:link href="https://sourceforge.net/p/cume/wiki/tutorial/feed" rel="self"/><language>en</language><lastBuildDate>Mon, 25 Sep 2017 06:40:32 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/cume/wiki/tutorial/feed" rel="self" type="application/rss+xml"/><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v17
+++ v18
@@ -76,7 +76,7 @@

 &lt;h5&gt;c) transfer of data between host and device memory&lt;/h5&gt;

-Use the &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pop&lt;/code&gt; methods of the array to respectively transfer data from host to device, and device to host memory.
+Use the &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pull&lt;/code&gt; methods of the array to respectively transfer data from host to device, and device to host memory.

 &lt;h5&gt;d) redefinition of operator &amp;amp;&lt;/h5&gt;

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Mon, 25 Sep 2017 06:40:32 -0000</pubDate><guid>https://sourceforge.net6acbbd4664f028cd6e27f0da8c99cdf8c513814c</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v16
+++ v17
@@ -38,7 +38,7 @@

 1. the &lt;code&gt;cume_push&lt;/code&gt; function to transfer data from host to device memory
-2. and the &lt;code&gt;cume_pop&lt;/code&gt; function to transfer data from device to host memory
+2. and the &lt;code&gt;cume_pull&lt;/code&gt; function to transfer data from device to host memory

 &lt;code&gt;
     int *cpu_array = new int \[100\];
@@ -47,7 +47,7 @@
     cume_push(gpu_array, cpu_array, int, 100);
     ... call kernel
     // cume_push(destination in host memory, source in device memory, type, nbr_items)
-    cume_pop(cpu_array, gpu_array, int, 100);    
+    cume_pull(cpu_array, gpu_array, int, 100);    
 &lt;/code&gt;

 &lt;h2&gt;2. How to use the CUME Array class to handle arrays&lt;/h2&gt;
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Mon, 25 Sep 2017 06:21:19 -0000</pubDate><guid>https://sourceforge.netf0de65a9a385079c3bbb8e065b8b05aacfef0f48</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v15
+++ v16
@@ -8,6 +8,7 @@

 The &lt;code&gt;cume_new_var&lt;/code&gt; and &lt;code&gt;cume_new_array&lt;/code&gt; macro instructions help allocate memory on the device:

+&lt;code&gt;
     // we allocate one integer in the device memory
     int *gpu_integer;
     // cume_new_var(pointer, type) 
@@ -17,7 +18,8 @@
     int *gpu_array;
     // cume_new_array(pointer, type, nbr_items)
     cume_new_array(gpu_array, int, 100);
-    
+&lt;/code&gt;
+
 Another interesting function is &lt;code&gt;cume_new_array_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_new_array&lt;/code&gt; but initializes the memory with zero bytes.

     int *gpu_array;
@@ -38,7 +40,7 @@
 1. the &lt;code&gt;cume_push&lt;/code&gt; function to transfer data from host to device memory
 2. and the &lt;code&gt;cume_pop&lt;/code&gt; function to transfer data from device to host memory

-
+&lt;code&gt;
     int *cpu_array = new int \[100\];
     cume_new_array(gpu_array, int, 100);
     // cume_push(destination in device memory, source in host memory, type, nbr_items)
@@ -46,7 +48,7 @@
     ... call kernel
     // cume_push(destination in host memory, source in device memory, type, nbr_items)
     cume_pop(cpu_array, gpu_array, int, 100);    
-
+&lt;/code&gt;

 &lt;h2&gt;2. How to use the CUME Array class to handle arrays&lt;/h2&gt;

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Sat, 19 Sep 2015 12:54:53 -0000</pubDate><guid>https://sourceforge.netcaf0a25f9332586599de867986f73aecc83a2e73</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v14
+++ v15
@@ -8,25 +8,27 @@

 The &lt;code&gt;cume_new_var&lt;/code&gt; and &lt;code&gt;cume_new_array&lt;/code&gt; macro instructions help allocate memory on the device:

-    // create one integer
-    int *some_integer;
-    cume_new_var(some_integer, int);
+    // we allocate one integer in the device memory
+    int *gpu_integer;
+    // cume_new_var(pointer, type) 
+    cume_new_var(gpu_integer, int);

-    // create array of 100 integers
-    int *gpu_tab;
-    cume_new_array(gpu_tab, int, 100);
+    // allocate an array of 100 integers in device memory
+    int *gpu_array;
+    // cume_new_array(pointer, type, nbr_items)
+    cume_new_array(gpu_array, int, 100);

 Another interesting function is &lt;code&gt;cume_new_array_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_new_array&lt;/code&gt; but initializes the memory with zero bytes.

-    int *gpu_tab;
-    cume_new_array_zero(gpu_tab, int, 100);
+    int *gpu_array;
+    cume_new_array_zero(gpu_array, int, 100);

 &lt;h5&gt;b) memory deallocation&lt;/h5&gt;

 Use the &lt;code&gt;cume_free&lt;/code&gt; macro instruction:

-    cume_free(gpu_tab);
+    cume_free(gpu_array);

 &lt;h5&gt;b) memory transfer&lt;/h5&gt;

@@ -37,11 +39,13 @@
 2. and the &lt;code&gt;cume_pop&lt;/code&gt; function to transfer data from device to host memory

-    int *cpu_tab = new int \[100\];
-    int *gpu_tab = cume_malloc(100 * sizeof(int));
-    cume_push(gpu_tab, cpu_tab, int, 100);
+    int *cpu_array = new int \[100\];
+    cume_new_array(gpu_array, int, 100);
+    // cume_push(destination in device memory, source in host memory, type, nbr_items)
+    cume_push(gpu_array, cpu_array, int, 100);
     ... call kernel
-    cume_pop(cpu_tab, gpu_tab, int, 100);    
+    // cume_push(destination in host memory, source in device memory, type, nbr_items)
+    cume_pop(cpu_array, gpu_array, int, 100);    

 &lt;h2&gt;2. How to use the CUME Array class to handle arrays&lt;/h2&gt;
@@ -70,7 +74,7 @@

 &lt;h5&gt;c) transfer of data between host and device memory&lt;/h5&gt;

-Use the &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pop&lt;/code&gt; method of the array to respectively transfer data from host to device, and device to host memory.
+Use the &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pop&lt;/code&gt; methods of the array to respectively transfer data from host to device, and device to host memory.

 &lt;h5&gt;d) redefinition of operator &amp;amp;&lt;/h5&gt;

@@ -90,9 +94,9 @@
 + REQUIRED_THREADS is the number of threads you need
 + GRID_TYPE is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
 + BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
-+ parameters are the size of the grid and blocks following the GRID_DEFINITION and BLOCK_DEFINITION
++ parameters are the size of the grid and blocks following the GRID_TYPE and BLOCK_TYPE

-The different constants GRID_1, GRIX_X, .... have the following meaning:
+The different constants GRID_1, GRID_X, .... have the following meaning:
 + GRID_1 : a grid with 1 block
 + GRID_X : a 1D grid with several blocks on x axis (dimGrid.x &amp;gt;= 1)
 + GRID_XY : a 2D grid with several blocks on x  and y axis 
@@ -100,12 +104,19 @@

 The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.

-For example if we need to work with 1024 threads with a grid of 2 x 16 and a block of 32 then we will write
+The different constants BLOCK_1, BLOCK_X, .... have the following meaning:
++ BLOCK_1 : a block with 1 thread
++ BLOCK_X : a 1D block with several threads on x axis (dimGrid.x &amp;gt;= 1)
++ BLOCK_XY : a 2D block with several threads on x  and y axis 
++ BLOCK_XYZ : a 3D block with several threads on x, y  and z axis  
+
+
+For example if we need to work with 1024 threads with a grid of 2 x 16 blocks and each block has 32 threads then we will write

     Kernel k(1024);
     k.configure(GRID_XY, BLOCK_X, 2, 16, 32)

-If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a 1D block of 32 threads, then use the following code:
+If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a grid of 1D blocks and each blocks has 32 threads, then use the following code:

     Kernel k(1027);
     k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32)
@@ -120,7 +131,7 @@
 + kernel_call_no_resource: call the kernel with No Resource
 + kernel_call: call kernel With Resource (preferred)

-The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically retrieved from the Resource using the &lt;code&gt;get_global_tid&lt;/code&gt; function. 
+The difference of No Resource and With Resource is that a data structure Kernel::Resource will be passed as the first argument of the kernel and the global thread index formula will be automatically retrieved from the Resource using the &lt;code&gt;get_global_tid()&lt;/code&gt; function. 

 Let's compare the two methods:

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Fri, 13 Mar 2015 12:25:24 -0000</pubDate><guid>https://sourceforge.netd780b978f72cd5d2ee10e68c0d9addf04954b175</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v13
+++ v14
@@ -105,26 +105,28 @@
     Kernel k(1024);
     k.configure(GRID_XY, BLOCK_X, 2, 16, 32)

-If you want the Kernel class to determine the size of the grid for you if you need 1027 threads, then use the following code:
+If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a 1D block of 32 threads, then use the following code:

     Kernel k(1027);
-    k.configure(GRID_GUESS, BLOCK_X, 32)
+    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32)

 The grid will then be defined of type GRID_X of size 33. 

-&lt;h2&gt;4. How to get global thread index inside the kernel&lt;/h2&gt;
+&lt;h2&gt;4. How to get the global thread index inside the kernel&lt;/h2&gt;

 Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in &lt;code&gt;cume_kernel.h&lt;/code&gt;

 + kernel_call_no_resource: call the kernel with No Resource
-+ kernel_call: call kernel With Resource
++ kernel_call: call kernel With Resource (preferred)

-The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 
+The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically retrieved from the Resource using the &lt;code&gt;get_global_tid&lt;/code&gt; function. 

 Let's compare the two methods:

 &lt;h5&gt;a) call with no Resource&lt;/h5&gt;
+
+In this example you will need to use the formula of the gtid that corresponds to the organization of threads in terms of grid and block. If we use a 1D grid composed of 1D blocks then we will use the &lt;code&gt;cume_gtid_x_x()&lt;/code&gt; macro instruction:

     __global__ void kernel_sum(int *a, int *b, int *c, int size) {
         // **************************************************************
@@ -138,11 +140,21 @@
     }

     Kernel k(SIZE);
-    k.configure(GRID_GUESS, BLOCK_X, 32);
+    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
     kernel_call_no_resource(kernel_sum, k, &amp;amp;a, &amp;amp;b, &amp;amp;c, a.get_size());
+
+If later you want to change and have a 1D grid with 2D blocks then you will need to modify the line
+   
+    int gtid = cume_gtid_x_x();
+
+by
+
+    int gtid = cume_gtid_x_xy();

 &lt;h5&gt;b) call with Resource&lt;/h5&gt;
+
+In this case, by using the &lt;code&gt;res-&amp;gt;get_global_tid()&lt;/code&gt; function you will automatically get the right formula.

     __global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
         // **************************************************************
@@ -157,6 +169,7 @@
     }

     Kernel k(SIZE);
-    k.configure(GRID_GUESS, BLOCK_X, 32);
+    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
     kernel_call(kernel_sum, k, &amp;amp;a, &amp;amp;b, &amp;amp;c, a.get_size());
-        
+
+If later you want to change and have a 1D grid with 2D blocks then you won't need to modify your code inside the kernel.       
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Wed, 11 Mar 2015 16:37:04 -0000</pubDate><guid>https://sourceforge.net2ccf604e154bccb401b17019eef25bb939abc50b</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v12
+++ v13
@@ -16,7 +16,7 @@
     int *gpu_tab;
     cume_new_array(gpu_tab, int, 100);

-Another intersting function is &lt;code&gt;cume_new_array_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_new_array&lt;/code&gt; but initializes the memory with zero bytes.
+Another interesting function is &lt;code&gt;cume_new_array_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_new_array&lt;/code&gt; but initializes the memory with zero bytes.

     int *gpu_tab;
     cume_new_array_zero(gpu_tab, int, 100);
@@ -92,6 +92,14 @@
 + BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
 + parameters are the size of the grid and blocks following the GRID_DEFINITION and BLOCK_DEFINITION

+The different constants GRID_1, GRIX_X, .... have the following meaning:
++ GRID_1 : a grid with 1 block
++ GRID_X : a 1D grid with several blocks on x axis (dimGrid.x &amp;gt;= 1)
++ GRID_XY : a 2D grid with several blocks on x  and y axis 
++ GRID_XYZ : a 3D grid with several blocks on x, y  and z axis  
+
+The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.
+
 For example if we need to work with 1024 threads with a grid of 2 x 16 and a block of 32 then we will write

     Kernel k(1024);
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Wed, 11 Mar 2015 16:27:46 -0000</pubDate><guid>https://sourceforge.netbbf7549b1a64b1e42507fb13879cffa6609003b4</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v11
+++ v12
@@ -10,11 +10,11 @@

     // create one integer
     int *some_integer;
-    cume_malloc(some_integer, int);
+    cume_new_var(some_integer, int);

     // create array of 100 integers
     int *gpu_tab;
-    cume_malloc(gpu_tab, int, 100);
+    cume_new_array(gpu_tab, int, 100);

 Another intersting function is &lt;code&gt;cume_new_array_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_new_array&lt;/code&gt; but initializes the memory with zero bytes.

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Wed, 11 Mar 2015 16:23:41 -0000</pubDate><guid>https://sourceforge.netb0f03199cbc3a7b397548361afbfeaf051e82e0f</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v10
+++ v11
@@ -2,35 +2,47 @@

 &lt;h2&gt;1. How to use CUME functions&lt;/h2&gt;

-The &lt;code&gt;cume_base.h&lt;/code&gt; file introduces a set of functions to simplify the use of the CUDA API.
+The &lt;code&gt;cume_base.h&lt;/code&gt; file introduces a set of macro instructions to simplify the use of the CUDA API for memory allocation.

 &lt;h5&gt;a) memory allocation&lt;/h5&gt;

-The &lt;code&gt;cume_malloc&lt;/code&gt; and &lt;code&gt;cume_free&lt;/code&gt; template functions help allocate and free memory on the device:
+The &lt;code&gt;cume_new_var&lt;/code&gt; and &lt;code&gt;cume_new_array&lt;/code&gt; macro instructions help allocate memory on the device:

-    int *gpu_tab = cume_malloc(100 * sizeof(int));
-    ...
-    cume_free(tab);
+    // create one integer
+    int *some_integer;
+    cume_malloc(some_integer, int);
+
+    // create array of 100 integers
+    int *gpu_tab;
+    cume_malloc(gpu_tab, int, 100);

-Another intersting function is &lt;code&gt;cume_malloc_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_malloc&lt;/code&gt; but initializes the memory with zero bytes.
+Another intersting function is &lt;code&gt;cume_new_array_zero&lt;/code&gt; which has the same behavior has &lt;code&gt;cume_new_array&lt;/code&gt; but initializes the memory with zero bytes.

-    int *gpu_tab = cume_malloc_zero(100 * sizeof(int));
+    int *gpu_tab;
+    cume_new_array_zero(gpu_tab, int, 100);

+
+&lt;h5&gt;b) memory deallocation&lt;/h5&gt;
+
+Use the &lt;code&gt;cume_free&lt;/code&gt; macro instruction:
+
+    cume_free(gpu_tab);

 &lt;h5&gt;b) memory transfer&lt;/h5&gt;

 We use:

-* the &lt;code&gt;cume_push&lt;/code&gt; function to transfer data from host to device memory
-+ and the &lt;code&gt;cume_pop&lt;/code&gt; function to transfer data from device to host memory
-
+1. the &lt;code&gt;cume_push&lt;/code&gt; function to transfer data from host to device memory
+2. and the &lt;code&gt;cume_pop&lt;/code&gt; function to transfer data from device to host memory

     int *cpu_tab = new int \[100\];
     int *gpu_tab = cume_malloc(100 * sizeof(int));
-    cume_push(gpu_tab, cpu_tab, 100*sizeof(int));
-    
+    cume_push(gpu_tab, cpu_tab, int, 100);
+    ... call kernel
+    cume_pop(cpu_tab, gpu_tab, int, 100);    
+

 &lt;h2&gt;2. How to use the CUME Array class to handle arrays&lt;/h2&gt;

&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Wed, 11 Mar 2015 16:22:08 -0000</pubDate><guid>https://sourceforge.net6484f1c19f7d6b8f15d59cbe1cfc2625cac42951</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v9
+++ v10
@@ -60,43 +60,47 @@

 Use the &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pop&lt;/code&gt; method of the array to respectively transfer data from host to device, and device to host memory.

-&lt;h2&gt;3. How to use the KernelConfig class to call a kernel&lt;/h2&gt;
+&lt;h5&gt;d) redefinition of operator &amp;amp;&lt;/h5&gt;
+
+The operator&amp;amp; has been overloaded and returns the address of data in the device memory.
+
+&lt;h2&gt;3. How to use the Kernel class to call a kernel&lt;/h2&gt;

 This is the most interesting class of CUME that is used to setup grid and block dimensions and call the kernel.

 First you must define the size of the grid and block:

-    KernelConfig kcfg(REQUIRED_THREADS)
-    kcfg.setup(GRID_DEFINITION, BLOCK_DEFINITION, parameters)
+    Kernel k(REQUIRED_THREADS)
+    k.configure(GRID_TYPE, BLOCK_TYPE, parameters)

 where:

 + REQUIRED_THREADS is the number of threads you need
-+ GRID_DEFINITION is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
-+ BLOCK_DEFINITION is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
++ GRID_TYPE is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
++ BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
 + parameters are the size of the grid and blocks following the GRID_DEFINITION and BLOCK_DEFINITION

 For example if we need to work with 1024 threads with a grid of 2 x 16 and a block of 32 then we will write

-    KernelConfig kcfg(1024, GRID_XY, BLOCK_X, 2, 16, 32)
+    Kernel k(1024);
+    k.configure(GRID_XY, BLOCK_X, 2, 16, 32)

-If you want the KernelConfig class to determine the size of the grid for you if you need 1027 threads, then use the following code:
+If you want the Kernel class to determine the size of the grid for you if you need 1027 threads, then use the following code:

-    KernelConfig kcfg(1027, GRID_GUESS, BLOCK_X, 32)
+    Kernel k(1027);
+    k.configure(GRID_GUESS, BLOCK_X, 32)

 The grid will then be defined of type GRID_X of size 33. 

 &lt;h2&gt;4. How to get global thread index inside the kernel&lt;/h2&gt;

-Once you have defined the size of grid and block you can call the kernel using one of the four macros instructions defined in &lt;code&gt;cume_kernel.h&lt;/code&gt;
+Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in &lt;code&gt;cume_kernel.h&lt;/code&gt;

-+ CUME_KERNEL_RUN_NR: call the kernel with No Resource
-+ CUME_KERNEL_RUN_NR_TIMER: call the kernel with No Resource and use a timer to display the execution time of the kernel on the device
-+ CUME_KERNEL_RUN_WR: call kernel With Resource
-+ CUME_KERNEL_RUN_WR_TIMER: call kernel With Resource and use a timer to display the execution time of the kernel on the device
++ kernel_call_no_resource: call the kernel with No Resource
++ kernel_call: call kernel With Resource

-The difference of NR (No Resource) and WR (With Resource) is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 
+The difference of No Resource and With Resource is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 

 Let's compare the two methods:

@@ -113,14 +117,14 @@
         }
     }

-    KernelConfig kcfg(SIZE);
-    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
-    CUME_KERNEL_RUN_NR(kernel_sum, kcfg, 
-        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size()););
+    Kernel k(SIZE);
+    k.configure(GRID_GUESS, BLOCK_X, 32);
+    kernel_call_no_resource(kernel_sum, k, &amp;amp;a, &amp;amp;b, &amp;amp;c, a.get_size());
+

 &lt;h5&gt;b) call with Resource&lt;/h5&gt;

-    __global__ void kernel_sum(KernelConfig::Resource *res, int *a, int *b, int *c, int size) {
+    __global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
         // **************************************************************
         // automatically get global thread index in function of kernel
         // type: no need to wonder which formula to use
@@ -132,11 +136,7 @@
         }
     }

-    KernelConfig kcfg(SIZE);
-    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
-    CUME_KERNEL_RUN_WR_TIMER(kernel_sum, kcfg, 
-        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size());
+    Kernel k(SIZE);
+    k.configure(GRID_GUESS, BLOCK_X, 32);
+    kernel_call(kernel_sum, k, &amp;amp;a, &amp;amp;b, &amp;amp;c, a.get_size());

-        
-
-    
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Tue, 10 Mar 2015 09:32:38 -0000</pubDate><guid>https://sourceforge.net04342114fb754ae49590374a54418f5c45051ac7</guid></item><item><title>tutorial modified by Jean-Michel Richer</title><link>https://sourceforge.net/p/cume/wiki/tutorial/</link><description>&lt;div class="markdown_content"&gt;&lt;pre&gt;--- v8
+++ v9
@@ -66,7 +66,8 @@

 First you must define the size of the grid and block:

-    KernelConfig kcfg(REQUIRED_THREADS, GRID_DEFINITION, BLOCK_DEFINITION, parameters)
+    KernelConfig kcfg(REQUIRED_THREADS)
+    kcfg.setup(GRID_DEFINITION, BLOCK_DEFINITION, parameters)

 where:

@@ -88,4 +89,54 @@

 &lt;h2&gt;4. How to get global thread index inside the kernel&lt;/h2&gt;

-Once you have defined the 
+Once you have defined the size of grid and block you can call the kernel using one of the four macros instructions defined in &lt;code&gt;cume_kernel.h&lt;/code&gt;
+
++ CUME_KERNEL_RUN_NR: call the kernel with No Resource
++ CUME_KERNEL_RUN_NR_TIMER: call the kernel with No Resource and use a timer to display the execution time of the kernel on the device
++ CUME_KERNEL_RUN_WR: call kernel With Resource
++ CUME_KERNEL_RUN_WR_TIMER: call kernel With Resource and use a timer to display the execution time of the kernel on the device
+
+The difference of NR (No Resource) and WR (With Resource) is that a data structure called Resource will be passed as an argument of the kernel and the global thread index formula will be automatically obtained from the Resource. 
+
+Let's compare the two methods:
+
+&lt;h5&gt;a) call with no Resource&lt;/h5&gt;
+
+    __global__ void kernel_sum(int *a, int *b, int *c, int size) {
+        // **************************************************************
+        // get global thread index with cume macro instruction
+        // **************************************************************
+        int gtid = cume_gtid_x_x();     
+        
+        if (gtid &amp;lt; size) {
+                c[gtid] = a[gtid] + b[gtid];
+        }
+    }
+
+    KernelConfig kcfg(SIZE);
+    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
+    CUME_KERNEL_RUN_NR(kernel_sum, kcfg, 
+        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size()););
+
+&lt;h5&gt;b) call with Resource&lt;/h5&gt;
+
+    __global__ void kernel_sum(KernelConfig::Resource *res, int *a, int *b, int *c, int size) {
+        // **************************************************************
+        // automatically get global thread index in function of kernel
+        // type: no need to wonder which formula to use
+        // **************************************************************
+        int gtid = res-&amp;gt;get_global_tid();
+        
+        if (gtid &amp;lt; size) {
+                c[gtid] = a[gtid] + b[gtid];
+        }
+    }
+
+    KernelConfig kcfg(SIZE);
+    kcfg.set_config(KernelConfig::GRID_GUESS, KernelConfig::BLOCK_X, 32);
+    CUME_KERNEL_RUN_WR_TIMER(kernel_sum, kcfg, 
+        a.get_daddr(), b.get_daddr(), c.get_daddr(), a.get_size());
+        
+        
+
+    
&lt;/pre&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Jean-Michel Richer</dc:creator><pubDate>Wed, 04 Mar 2015 11:29:08 -0000</pubDate><guid>https://sourceforge.netdf4f0c5dacdd60059d70fe2779f7afb0e7f49813</guid></item></channel></rss>