From: Karl R. <ru...@iu...> - 2012-11-15 15:01:36
|
Hi, I've checked the code and fortunately the bug does not show up in my latest branch (I'll soon push that to sourceforge). As for the performance regression: I've compared my current branch against the generator (vector size 10^6, 10 runs) * Single precision: GTX 285 inner_prod() time: 0.001483 GTX 285 generator time: 0.001473 HD 7970 inner_prod() time: 0.010 HD 7970 generator time: 0.010 * Double precision: GTX 285 inner_prod() time: 0.002252 GTX 285 generator time: 0.002237 HD 7970 inner_prod() time: 0.013855 HD 7970 generator time: 0.013184 Thus, the difference is within the accuracy of the timer in all four cases. It is, however, interesting to observe that the HD 7970 is showing rather poor performance despite its higher memory bandwidth - I think I know how to fix this. > My current guess is from the sum kernel. There is a line, "if(option>0) > ... else ...", which is greatly discouraged in the opencl guides I have > read, quoted from the AMD Guide (quoted below) > I'll double check my toy benchmark to be sure of sure though > > ============== > (...) > ==================== > > basically option==0?a:b would probably offer better performances. The guide is certainly right in terms of raw cycles. However, the sum kernel is HEAVILY dominated by kernel launch overhead anyway, as it just sums ~128 entries. Also, the if ... else ... construct does not really matter for any of the vector kernels: First, there is no thread divergence, and second this is only a startup dispatch. (Yes, I have benchmarks confirming that claim ;-) ) The reason for the option-thing is that basically all the following operations are dealt with in the same kernel: v1 = v2 * a + v3 * b v1 = v2 / a + v3 * b v1 = v2 * a + v3 / b v1 = v2 / a + v3 / b v1 = -v2 * a + v3 * b v1 = -v2 / a + v3 * b v1 = -v2 * a + v3 / b v1 = -v2 / a + v3 / b v1 = v2 * a - v3 * b v1 = v2 / a - v3 * b v1 = v2 * a - v3 / b v1 = v2 / a - v3 / b v1 = -v2 * a - v3 * b v1 = -v2 / a - v3 * b v1 = -v2 * a - v3 / b v1 = -v2 / a - v3 / b In addition, one needs to distinguish between a and b being CPU or GPU scalars and eventually take an additional (-a) and (-b) coming from the GPU into account. In a naive approach one would end up with at least 64 kernels, while the option-thing allows to use only four kernels. Best regards, Karli |