Re: [ViennaCL-devel] [ViennaCL-support] Bug in inner_prod with #define VIENNACL_WITH_OPENCL

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

I've checked the code and fortunately the bug does not show up in my 
latest branch (I'll soon push that to sourceforge).

As for the performance regression: I've compared my current branch 
against the generator (vector size 10^6, 10 runs)

* Single precision:
   GTX 285 inner_prod() time: 0.001483
   GTX 285 generator time: 0.001473
   HD 7970 inner_prod() time: 0.010
   HD 7970 generator time: 0.010

* Double precision:
   GTX 285 inner_prod() time: 0.002252
   GTX 285 generator time: 0.002237
   HD 7970 inner_prod() time: 0.013855
   HD 7970 generator time: 0.013184

Thus, the difference is within the accuracy of the timer in all four 
cases. It is, however, interesting to observe that the HD 7970 is 
showing rather poor performance despite its higher memory bandwidth - I 
think I know how to fix this.

 > My current guess is from the sum kernel. There is a line, "if(option>0)
> ... else ...", which is greatly discouraged in the opencl guides I have
> read, quoted from the AMD Guide (quoted below)
> I'll double check my toy benchmark to be sure of sure though
>
> ==============
> (...)
> ====================
>
> basically option==0?a:b would probably offer better performances.

The guide is certainly right in terms of raw cycles. However, the sum 
kernel is HEAVILY dominated by kernel launch overhead anyway, as it just 
sums ~128 entries. Also, the if ... else ... construct does not really 
matter for any of the vector kernels: First, there is no thread 
divergence, and second this is only a startup dispatch. (Yes, I have 
benchmarks confirming that claim ;-) )

The reason for the option-thing is that basically all the following 
operations are dealt with in the same kernel:
  v1 = v2 * a + v3 * b
  v1 = v2 / a + v3 * b
  v1 = v2 * a + v3 / b
  v1 = v2 / a + v3 / b

  v1 = -v2 * a + v3 * b
  v1 = -v2 / a + v3 * b
  v1 = -v2 * a + v3 / b
  v1 = -v2 / a + v3 / b

  v1 = v2 * a - v3 * b
  v1 = v2 / a - v3 * b
  v1 = v2 * a - v3 / b
  v1 = v2 / a - v3 / b

  v1 = -v2 * a - v3 * b
  v1 = -v2 / a - v3 * b
  v1 = -v2 * a - v3 / b
  v1 = -v2 / a - v3 / b

In addition, one needs to distinguish between a and b being CPU or GPU 
scalars and eventually take an additional (-a) and (-b) coming from the 
GPU into account. In a naive approach one would end up with at least 64 
kernels, while the option-thing allows to use only four kernels.

Best regards,
Karli

Re: [ViennaCL-devel] [ViennaCL-support] Bug in inner_prod with #define VIENNACL_WITH_OPENCL

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Re: [ViennaCL-devel] [ViennaCL-support] Bug in inner_prod with #define VIENNACL_WITH_OPENCL