Re: [ViennaCL-devel] Multi-GPU Reloaded !

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello,

2012/8/20 Karl Rupp <ru...@iu...>

> Hello,
>
>
>  For now, there is absolutely no data locality control, which means I end
>> up with relativity slow performances for now (250GFLops on GTX260 +
>>   Tesla C2050), which is still a bit faster than using only 1GPU. I'll
>> now focus on improving data locality... If it is possible to implement
>> such a locality control, doing lazy transfer might just be a pretty
>> effective approach :)
>>
>
> Considering the additional functionality in the OpenCL 1.2 standard, I
> think that this locality control is available (soon).

>
>  Has anybody done (or read in a paper) a comparison of the performances
>> of images and buffer ?
>>
>
> There is no general recommendation possible, because the different SDKs
> have (you guess it) different behavior. As there is currently no support
> for the image type, I would refrain from using images right now. The
> matrix-matrix multiplication kernels for buffers are pretty much efficient
> right now, so I don't think it makes sense right now to play many hours
> with images just to get a 10 percent performance increase.
>
>
>  Also, now that I have done a fully multi-threaded program, I have
>> started to see some little weaknesses in the backend's design, when it
>> comes to multi-threading.
>> I think this would be a better idea to set a device argument ( or even a
>> queue argument, but that may be a bit overkill.) to all the functions
>> that enqueue something. For example :
>>
>> mutex_.lock();
>> viennacl::ocl::switch_device( device_of_my_thread );
>> prod(A,B,C);
>> mutex_.unlock();
>>
>> does not really have the expected behavior, as prod(A,B,C) does not
>> always immediately returns. Indeed, it returns when the data are
>> migrated on the device. (yes, clEnqueueNDRangeKernel does that...). In
>> my very specific case, I bypassed this issue by doing something like
>> transfer_memory(device_of_my_**thread,A);
>> transfer_memory(device_of_my_**thread,B);
>> transfer_memory(device_of_my_**thread,C);
>> mutex_.lock();
>> viennacl::ocl::switch_device( device_of_my_thread );
>> prod(A,B,C);
>> mutex_.unlock();
>> Doing the transfers out of the mutex, the enqueueing returns immediately
>> and the other threads do not have to wait forever. This is also a
>> possible cause of why I had so much troubles before, but I don't feel
>> like git-reverting for double-checking :p
>>
>
> As the memory transfers prior to the computations seem to be reasonable,
> the second option is more appealing in my view.
>
>
>
>  Plus, now the standard guarantees all the API functions are thread-safe,
>> except for clSetKernelArg when called on the same kernel object by two
>> different thread. But from what I have seen ViennaCL uses different
>> kernel objects so there shouldn't be any problem :p
>> Maybe adding an argument to almost all the existing function is probably
>> a bit harsh, but do you think it is necessary to solve that
>> multi-threading issue?
>>
>
> The problem seems to be the convenience layer in the OpenCL backend. I
> don't see a simple way to extend the API suitably. A nice-to-have is
> something like
>
>  exec_on_device( A = prod(B, C),
>                  device_id);
>
> but in order to still provide the simple
>
>   A = prod(B, C);
>
> one would have to call all operations in the destructor of some temporary
> proxy object generated by operator=(). However, throwing an exception in a
> destructor is a no-go...
>
> The other option is to implicitly bind the computation to the device the
> lvalue resides on. Thus, the computation of A = prod(B, C) would be handled
> on the device where A is currently located. However, as OpenCL might move
> memory buffers around, I don't think this is the cleanest solution either...
>

Oh, yes ! I like (love !) the last option :)
It should be possible to do one context per device (now that the actual
data is stored in RAM, it doesn't matter :p). It should then be possible in
viennacl::ocl::enqueue(KernelType const & k) to do something like :
assert(all the handles are on the same device)
clEnqueueNDRangeKernel( viennacl::ocl::get_queue( handle.device()
).handle().get(), blablabla...)
It is rather clean and only requires a few lines of code to be changed in
the current codebase.

> Best regards,
> Karli
>
>
Best regards,
Phil

Re: [ViennaCL-devel] Multi-GPU Reloaded !

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Re: [ViennaCL-devel] Multi-GPU Reloaded !