Re: [ViennaCL-devel] [Multi-GPU] Good news!

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

> I have (finally !) been able to make the multi-gpu thing work in
> parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla
> C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually).

Very nice! :-)

> The OpenCL implementations usually don't like when the same handle has
> to be dealt with in two different queues. In my case,
> clEnqueueNDRangeKernel didn't return until the buffer became available
> for transfer (which is weird, as this function is somewhat guaranteed to
> return immediatly !). Therefore, it was necessary to call the copy
> constructor of the matrices which had to be used in two different kernels.

Does the problem only show up with two queues on two different devices, 
or is the same true with two queues on one device?

> Even doing that, both the original and the copy matrix live on the same
> device : the one associated with the queue used in clEnqueueCopyBuffer .
> Therefore, once all the copies have been done, it is necessary to
> migrate all the handles to the appropriate device. For now, NVidia is
> not OpenCL 1.2 compatible, so it's not possible to use the new API call
> clEnqueueMigrateMemObjects... For now,  it is necessary to use a "trick" :
>
> float* tmp = new float[size]
> clEnqueueReadBuffer //read into tmp
> delete[] tmp;
>
> This transfers the handles to the host, which is enough as long as the
> host does not have to execute a kernel (in that case everything ends up
> being serialized too :( )

Since OpenCL 1.2 'solves' the problem and we have a workaround for 
OpenCL 1.1, let's just accept the current situation. It might even be 
the case that the clEnqueueCopyBuffer() internally does the same thing: 
Copy data between GPUs via CPU RAM.

One more remark on the 'workaround code':
You better want to use a std::vector<> in order to avoid issues with 
deallocation. The raw pointer can be extracted via &(tmp[0])

> For now, the tasks are not arranged so as to minimize the number of copy
> performed, it should increase performance.

The typical assumption in the benchmarks reported by various 
institutions is that the data is already set up on the device 
accordingly. 313 GFLOPs (general note: this is counting fmad as one 
operation, so the marketing value would be 626 GFLOPs) are pretty good 
already.

> I fear I have also caused memory leaks on the device, as
> NVidiaVisualProfiler won't work... i'll have to  debug this :p

No, not necessarily. I've encountered similar problems with the 
installation on the testing machine, yet I could reproduce the issue 
with a rather simple clean OpenCL code. If I remember correctly, the 
issue was related to some static handles and their point of destruction 
by the compiler. So, it's not necessarily your fault ;-)

Best regards,
Karli

Re: [ViennaCL-devel] [Multi-GPU] Good news!

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Re: [ViennaCL-devel] [Multi-GPU] Good news!