Screenshot instructions:
Windows
Mac
Red Hat Linux
Ubuntu
Click URL instructions:
Rightclick on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)
From: Philippe Tillet <phil.tillet@gm...>  20121115 13:11:56
Attachments:
Message as HTML

Hi Karl, Here is the minimal code reproducing the code  #define VIENNACL_WITH_OPENCL #include <iostream> #include "CL/cl.hpp" #include "viennacl/ocl/utils.hpp" #include "viennacl/vector.hpp" #include "viennacl/linalg/inner_prod.hpp" int main(){ unsigned int size = 1000; std::vector<float> v1(size); std::vector<float> v2(size); viennacl::vector<float> vcl_v1(size); viennacl::vector<float> vcl_v2(size); for (std::size_t i=0; i<vcl_v1.size(); ++i) { v1[i] = float(i) ; v2[i] = float(i); } viennacl::copy(v1,vcl_v1); viennacl::copy(v2,vcl_v2); viennacl::scalar<float> res; res = viennacl::linalg::inner_prod(vcl_v1,vcl_v2); viennacl::ocl::get_queue().finish(); return 0; }  Also, I have benchmarked the inner product as it is for size 1 000 000 against the inner product of the generator (it is based on viennacl's inner product but the summation step is slightly different) and saw that the viennacl's implementation was about 10 times slower on this size. If I have time I will try to see where this come from. My current guess is from the sum kernel. There is a line, "if(option>0) ... else ...", which is greatly discouraged in the opencl guides I have read, quoted from the AMD Guide (quoted below) I'll double check my toy benchmark to be sure of sure though ============== Use predication rather than controlflow. The predication allows the GPU to execute both paths of execution in parallel, which can be faster than attempting to minimize the work through clever controlflow. The reason for this is that if no memory operation exists in a ?: operator (also called a ternary operator), this operation is translated into a single cmov_logical instruction, which is executed in a single cycle. An example of this is: If (A>B) { C += D; } else { C = D; } Replace this with: int factor = (A>B) ? 1:1; C += factor*D; In the first block of code, this translates into an IF/ELSE/ENDIF sequence of conditional code, each taking ~8 cycles. If divergent, this code executes in ~36 clocks; otherwise, in ~28 clocks. A branch not taken costs four cycles (one instruction slot); a branch taken adds four slots of latency to fetch instructions from the instruction cache, for a total of 16 clocks. Since the execution mask is saved, then modified, then restored for the branch, ~12 clocks are added when divergent, ~8 clocks when not. In the second block of code, the ?: operator executes in the vector units, so no extra CF instructions are generated. Since the instructions are sequentially dependent, this block of code executes in 12 cycles, for a 1.3x speed improvement. To see this, the first cycle is the (A>B) comparison, the result of which is input to the second cycle, which is the cmov_logical factor, bool, 1, 1. The final cycle is a MAD instruction that: mad C, factor, D, C. If the ratio between conditional code and ALU instructions is low, this is a good pattern to remove the control flow. ==================== basically option==0?a:b would probably offer better performances. Regards, Philippe 2012/11/15 Karl Rupp <rupp@...> > Hi, > > thanks, Philiipe  any hints on how to reproduce this? inner_prod is now > split into a final reduction on GPU and CPU, so maybe one of them is buggy. > > Also, since it only affects the developer version, please use the > developerlist in order not to confuse users of the stable versions. > > Thanks and best regards, > Karli > > > > > On 11/15/2012 06:27 AM, Philippe Tillet wrote: > >> Hello, >> >> While doing some test I came accross a bug in the sourceforge repository. >> When defining #define VIENNACL_USE_OPENCL, the inner product returns an >> ocl::invalid_arg_size exception! >> >> No idea of where this is coming from :p >> >> >> >> >> **** >>  >> Monitor your physical, virtual and cloud infrastructure from a single >> web console. Get indepth insight into apps, servers, databases, vmware, >> SAP, cloud infrastructure, etc. Download 30day Free Trial. >> Pricing starts from $795 for 25 servers or applications! >> http://p.sf.net/sfu/zoho_**dev2dev_nov<http://p.sf.net/sfu/zoho_dev2dev_nov>; >> >> >> >> ______________________________**_________________ >> ViennaCLsupport mailing list >> ViennaCLsupport@...**sourceforge.net<ViennaCLsupport@...> >> https://lists.sourceforge.net/**lists/listinfo/viennacl**support<https://lists.sourceforge.net/lists/listinfo/viennaclsupport>; >> >> > 
From: Karl Rupp <rupp@iu...>  20121115 15:01:36

Hi, I've checked the code and fortunately the bug does not show up in my latest branch (I'll soon push that to sourceforge). As for the performance regression: I've compared my current branch against the generator (vector size 10^6, 10 runs) * Single precision: GTX 285 inner_prod() time: 0.001483 GTX 285 generator time: 0.001473 HD 7970 inner_prod() time: 0.010 HD 7970 generator time: 0.010 * Double precision: GTX 285 inner_prod() time: 0.002252 GTX 285 generator time: 0.002237 HD 7970 inner_prod() time: 0.013855 HD 7970 generator time: 0.013184 Thus, the difference is within the accuracy of the timer in all four cases. It is, however, interesting to observe that the HD 7970 is showing rather poor performance despite its higher memory bandwidth  I think I know how to fix this. > My current guess is from the sum kernel. There is a line, "if(option>0) > ... else ...", which is greatly discouraged in the opencl guides I have > read, quoted from the AMD Guide (quoted below) > I'll double check my toy benchmark to be sure of sure though > > ============== > (...) > ==================== > > basically option==0?a:b would probably offer better performances. The guide is certainly right in terms of raw cycles. However, the sum kernel is HEAVILY dominated by kernel launch overhead anyway, as it just sums ~128 entries. Also, the if ... else ... construct does not really matter for any of the vector kernels: First, there is no thread divergence, and second this is only a startup dispatch. (Yes, I have benchmarks confirming that claim ;) ) The reason for the optionthing is that basically all the following operations are dealt with in the same kernel: v1 = v2 * a + v3 * b v1 = v2 / a + v3 * b v1 = v2 * a + v3 / b v1 = v2 / a + v3 / b v1 = v2 * a + v3 * b v1 = v2 / a + v3 * b v1 = v2 * a + v3 / b v1 = v2 / a + v3 / b v1 = v2 * a  v3 * b v1 = v2 / a  v3 * b v1 = v2 * a  v3 / b v1 = v2 / a  v3 / b v1 = v2 * a  v3 * b v1 = v2 / a  v3 * b v1 = v2 * a  v3 / b v1 = v2 / a  v3 / b In addition, one needs to distinguish between a and b being CPU or GPU scalars and eventually take an additional (a) and (b) coming from the GPU into account. In a naive approach one would end up with at least 64 kernels, while the optionthing allows to use only four kernels. Best regards, Karli 
Sign up for the SourceForge newsletter:
No, thanks