From: Philippe T. <phi...@gm...> - 2012-08-02 14:10:04
|
Hello everybody ! I have (finally !) been able to make the multi-gpu thing work in parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually). Here are the main concurrency troubles I have faced, so that everybody knows about them :) The OpenCL implementations usually don't like when the same handle has to be dealt with in two different queues. In my case, clEnqueueNDRangeKernel didn't return until the buffer became available for transfer (which is weird, as this function is somewhat guaranteed to return immediatly !). Therefore, it was necessary to call the copy constructor of the matrices which had to be used in two different kernels. Even doing that, both the original and the copy matrix live on the same device : the one associated with the queue used in clEnqueueCopyBuffer . Therefore, once all the copies have been done, it is necessary to migrate all the handles to the appropriate device. For now, NVidia is not OpenCL 1.2 compatible, so it's not possible to use the new API call clEnqueueMigrateMemObjects... For now, it is necessary to use a "trick" : float* tmp = new float[size] clEnqueueReadBuffer //read into tmp delete[] tmp; This transfers the handles to the host, which is enough as long as the host does not have to execute a kernel (in that case everything ends up being serialized too :( ) The drawback is that, when the handle is already on the host, a useless copy is performed. For now, the tasks are not arranged so as to minimize the number of copy performed, it should increase performance. I fear I have also caused memory leaks on the device, as NVidiaVisualProfiler won't work... i'll have to debug this :p Best regards, Philippe |