From: Karl R. <ru...@iu...> - 2012-08-20 11:59:44
|
Hello, > For now, there is absolutely no data locality control, which means I end > up with relativity slow performances for now (250GFLops on GTX260 + > Tesla C2050), which is still a bit faster than using only 1GPU. I'll > now focus on improving data locality... If it is possible to implement > such a locality control, doing lazy transfer might just be a pretty > effective approach :) Considering the additional functionality in the OpenCL 1.2 standard, I think that this locality control is available (soon). > Has anybody done (or read in a paper) a comparison of the performances > of images and buffer ? There is no general recommendation possible, because the different SDKs have (you guess it) different behavior. As there is currently no support for the image type, I would refrain from using images right now. The matrix-matrix multiplication kernels for buffers are pretty much efficient right now, so I don't think it makes sense right now to play many hours with images just to get a 10 percent performance increase. > Also, now that I have done a fully multi-threaded program, I have > started to see some little weaknesses in the backend's design, when it > comes to multi-threading. > I think this would be a better idea to set a device argument ( or even a > queue argument, but that may be a bit overkill.) to all the functions > that enqueue something. For example : > > mutex_.lock(); > viennacl::ocl::switch_device( device_of_my_thread ); > prod(A,B,C); > mutex_.unlock(); > > does not really have the expected behavior, as prod(A,B,C) does not > always immediately returns. Indeed, it returns when the data are > migrated on the device. (yes, clEnqueueNDRangeKernel does that...). In > my very specific case, I bypassed this issue by doing something like > transfer_memory(device_of_my_thread,A); > transfer_memory(device_of_my_thread,B); > transfer_memory(device_of_my_thread,C); > mutex_.lock(); > viennacl::ocl::switch_device( device_of_my_thread ); > prod(A,B,C); > mutex_.unlock(); > Doing the transfers out of the mutex, the enqueueing returns immediately > and the other threads do not have to wait forever. This is also a > possible cause of why I had so much troubles before, but I don't feel > like git-reverting for double-checking :p As the memory transfers prior to the computations seem to be reasonable, the second option is more appealing in my view. > Plus, now the standard guarantees all the API functions are thread-safe, > except for clSetKernelArg when called on the same kernel object by two > different thread. But from what I have seen ViennaCL uses different > kernel objects so there shouldn't be any problem :p > Maybe adding an argument to almost all the existing function is probably > a bit harsh, but do you think it is necessary to solve that > multi-threading issue? The problem seems to be the convenience layer in the OpenCL backend. I don't see a simple way to extend the API suitably. A nice-to-have is something like exec_on_device( A = prod(B, C), device_id); but in order to still provide the simple A = prod(B, C); one would have to call all operations in the destructor of some temporary proxy object generated by operator=(). However, throwing an exception in a destructor is a no-go... The other option is to implicitly bind the computation to the device the lvalue resides on. Thus, the computation of A = prod(B, C) would be handled on the device where A is currently located. However, as OpenCL might move memory buffers around, I don't think this is the cleanest solution either... Best regards, Karli |