From: Karl R. <ru...@iu...> - 2012-08-05 16:51:48
|
Hi, > I have (finally !) been able to make the multi-gpu thing work in > parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla > C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually). Very nice! :-) > The OpenCL implementations usually don't like when the same handle has > to be dealt with in two different queues. In my case, > clEnqueueNDRangeKernel didn't return until the buffer became available > for transfer (which is weird, as this function is somewhat guaranteed to > return immediatly !). Therefore, it was necessary to call the copy > constructor of the matrices which had to be used in two different kernels. Does the problem only show up with two queues on two different devices, or is the same true with two queues on one device? > Even doing that, both the original and the copy matrix live on the same > device : the one associated with the queue used in clEnqueueCopyBuffer . > Therefore, once all the copies have been done, it is necessary to > migrate all the handles to the appropriate device. For now, NVidia is > not OpenCL 1.2 compatible, so it's not possible to use the new API call > clEnqueueMigrateMemObjects... For now, it is necessary to use a "trick" : > > float* tmp = new float[size] > clEnqueueReadBuffer //read into tmp > delete[] tmp; > > This transfers the handles to the host, which is enough as long as the > host does not have to execute a kernel (in that case everything ends up > being serialized too :( ) Since OpenCL 1.2 'solves' the problem and we have a workaround for OpenCL 1.1, let's just accept the current situation. It might even be the case that the clEnqueueCopyBuffer() internally does the same thing: Copy data between GPUs via CPU RAM. One more remark on the 'workaround code': You better want to use a std::vector<> in order to avoid issues with deallocation. The raw pointer can be extracted via &(tmp[0]) > For now, the tasks are not arranged so as to minimize the number of copy > performed, it should increase performance. The typical assumption in the benchmarks reported by various institutions is that the data is already set up on the device accordingly. 313 GFLOPs (general note: this is counting fmad as one operation, so the marketing value would be 626 GFLOPs) are pretty good already. > I fear I have also caused memory leaks on the device, as > NVidiaVisualProfiler won't work... i'll have to debug this :p No, not necessarily. I've encountered similar problems with the installation on the testing machine, yet I could reproduce the issue with a rather simple clean OpenCL code. If I remember correctly, the issue was related to some static handles and their point of destruction by the compiler. So, it's not necessarily your fault ;-) Best regards, Karli |