[ViennaCL-devel] [Multi-GPU] Good news!

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello everybody !

I have (finally !) been able to make the multi-gpu thing work in parallel
on all devices. I achieve for now 313GFlops on GTX470 + Tesla C2050 on a
9600*9600 matrix (each GPU perform ~200GFlops individually).  Here are the
main concurrency troubles I have faced, so that everybody knows about them
:)

The OpenCL implementations usually don't like when the same handle has to
be dealt with in two different queues. In my case, clEnqueueNDRangeKernel
didn't return until the buffer became available for transfer (which is
weird, as this function is somewhat guaranteed to return immediatly !).
Therefore, it was necessary to call the copy constructor of the matrices
which had to be used in two different kernels.
Even doing that, both the original and the copy matrix live on the same
device : the one associated with the queue used in clEnqueueCopyBuffer .
Therefore, once all the copies have been done, it is necessary to migrate
all the handles to the appropriate device. For now, NVidia is not OpenCL
1.2 compatible, so it's not possible to use the new API call
clEnqueueMigrateMemObjects... For now,  it is necessary to use a "trick" :

float* tmp = new float[size]
clEnqueueReadBuffer //read into tmp
delete[] tmp;

This transfers the handles to the host, which is enough as long as the host
does not have to execute a kernel (in that case everything ends up being
serialized too :( )

The drawback is that, when the handle is already on the host, a useless
copy is performed.
For now, the tasks are not arranged so as to minimize the number of copy
performed, it should increase performance.

I fear I have also caused memory leaks on the device, as
NVidiaVisualProfiler won't work... i'll have to  debug this :p

Best regards,
Philippe

[ViennaCL-devel] [Multi-GPU] Good news!

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

[ViennaCL-devel] [Multi-GPU] Good news!