From: Philippe T. <phi...@gm...> - 2012-08-20 12:37:37
|
Hello, 2012/8/20 Karl Rupp <ru...@iu...> > Hello, > > > For now, there is absolutely no data locality control, which means I end >> up with relativity slow performances for now (250GFLops on GTX260 + >> Tesla C2050), which is still a bit faster than using only 1GPU. I'll >> now focus on improving data locality... If it is possible to implement >> such a locality control, doing lazy transfer might just be a pretty >> effective approach :) >> > > Considering the additional functionality in the OpenCL 1.2 standard, I > think that this locality control is available (soon). > > Has anybody done (or read in a paper) a comparison of the performances >> of images and buffer ? >> > > There is no general recommendation possible, because the different SDKs > have (you guess it) different behavior. As there is currently no support > for the image type, I would refrain from using images right now. The > matrix-matrix multiplication kernels for buffers are pretty much efficient > right now, so I don't think it makes sense right now to play many hours > with images just to get a 10 percent performance increase. > > > Also, now that I have done a fully multi-threaded program, I have >> started to see some little weaknesses in the backend's design, when it >> comes to multi-threading. >> I think this would be a better idea to set a device argument ( or even a >> queue argument, but that may be a bit overkill.) to all the functions >> that enqueue something. For example : >> >> mutex_.lock(); >> viennacl::ocl::switch_device( device_of_my_thread ); >> prod(A,B,C); >> mutex_.unlock(); >> >> does not really have the expected behavior, as prod(A,B,C) does not >> always immediately returns. Indeed, it returns when the data are >> migrated on the device. (yes, clEnqueueNDRangeKernel does that...). In >> my very specific case, I bypassed this issue by doing something like >> transfer_memory(device_of_my_**thread,A); >> transfer_memory(device_of_my_**thread,B); >> transfer_memory(device_of_my_**thread,C); >> mutex_.lock(); >> viennacl::ocl::switch_device( device_of_my_thread ); >> prod(A,B,C); >> mutex_.unlock(); >> Doing the transfers out of the mutex, the enqueueing returns immediately >> and the other threads do not have to wait forever. This is also a >> possible cause of why I had so much troubles before, but I don't feel >> like git-reverting for double-checking :p >> > > As the memory transfers prior to the computations seem to be reasonable, > the second option is more appealing in my view. > > > > Plus, now the standard guarantees all the API functions are thread-safe, >> except for clSetKernelArg when called on the same kernel object by two >> different thread. But from what I have seen ViennaCL uses different >> kernel objects so there shouldn't be any problem :p >> Maybe adding an argument to almost all the existing function is probably >> a bit harsh, but do you think it is necessary to solve that >> multi-threading issue? >> > > The problem seems to be the convenience layer in the OpenCL backend. I > don't see a simple way to extend the API suitably. A nice-to-have is > something like > > exec_on_device( A = prod(B, C), > device_id); > > but in order to still provide the simple > > A = prod(B, C); > > one would have to call all operations in the destructor of some temporary > proxy object generated by operator=(). However, throwing an exception in a > destructor is a no-go... > > The other option is to implicitly bind the computation to the device the > lvalue resides on. Thus, the computation of A = prod(B, C) would be handled > on the device where A is currently located. However, as OpenCL might move > memory buffers around, I don't think this is the cleanest solution either... > Oh, yes ! I like (love !) the last option :) It should be possible to do one context per device (now that the actual data is stored in RAM, it doesn't matter :p). It should then be possible in viennacl::ocl::enqueue(KernelType const & k) to do something like : assert(all the handles are on the same device) clEnqueueNDRangeKernel( viennacl::ocl::get_queue( handle.device() ).handle().get(), blablabla...) It is rather clean and only requires a few lines of code to be changed in the current codebase. > Best regards, > Karli > > Best regards, Phil |