Re: [ViennaCL-devel] [Multi-GPU] Good news!

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

2012/8/5 Karl Rupp <ru...@iu...>

> Hi,
>
>
>
>  I have (finally !) been able to make the multi-gpu thing work in
>> parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla
>> C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually).
>>
>
> Very nice! :-)
>
>
>
>  The OpenCL implementations usually don't like when the same handle has
>> to be dealt with in two different queues. In my case,
>> clEnqueueNDRangeKernel didn't return until the buffer became available
>> for transfer (which is weird, as this function is somewhat guaranteed to
>> return immediatly !). Therefore, it was necessary to call the copy
>> constructor of the matrices which had to be used in two different kernels.
>>
>
> Does the problem only show up with two queues on two different devices, or
> is the same true with two queues on one device?

I don't really know, i have not been able to test two queues on one device
:p
Why?

>
>
>
>  Even doing that, both the original and the copy matrix live on the same
>> device : the one associated with the queue used in clEnqueueCopyBuffer .
>> Therefore, once all the copies have been done, it is necessary to
>> migrate all the handles to the appropriate device. For now, NVidia is
>> not OpenCL 1.2 compatible, so it's not possible to use the new API call
>> clEnqueueMigrateMemObjects... For now,  it is necessary to use a "trick" :
>>
>> float* tmp = new float[size]
>> clEnqueueReadBuffer //read into tmp
>> delete[] tmp;
>>
>> This transfers the handles to the host, which is enough as long as the
>> host does not have to execute a kernel (in that case everything ends up
>> being serialized too :( )
>>
>
> Since OpenCL 1.2 'solves' the problem and we have a workaround for OpenCL
> 1.1, let's just accept the current situation. It might even be the case
> that the clEnqueueCopyBuffer() internally does the same thing: Copy data
> between GPUs via CPU RAM.
>
> One more remark on the 'workaround code':
> You better want to use a std::vector<> in order to avoid issues with
> deallocation. The raw pointer can be extracted via &(tmp[0])
>
>
>
>  For now, the tasks are not arranged so as to minimize the number of copy
>> performed, it should increase performance.
>>
>
> The typical assumption in the benchmarks reported by various institutions
> is that the data is already set up on the device accordingly. 313 GFLOPs
> (general note: this is counting fmad as one operation, so the marketing
> value would be 626 GFLOPs) are pretty good already.

Yay marketing value! :D.
Actually, this was kind of just a warm up. I realized my implementation was
not really doing any dynamic scheduling and was only good when the two
devices had the same computing power... I'll have to give my threading
model another look ! I kind of refuse to fallback on static scheduling, it
really does not seem flexible enough to me (plus AMD advises to use a
dynamic scheduler !). Some GFlops still can be gained I think :p should be
possible to reach something like 350 I think, and a better performance
overall when devices are not equally powerful (which might be the case,
even if the two devices are the same, as one of them might already be doing
something). I also want to make the scheduler out-of-order and solving the
dependancies alone, considering the equality of the input/output handles...
kind of like a Parallel Compiler... ;) I definitely don't feel like writing
the tree for every existing operation.

There was at HPC2012 a talk of Marc Baboulin on using concurrently GPU and
CPU for several tasks. I also think about associating a cl_device_type with
each task. Maybe, then, in the future, it would be possible to take more
benefits on heterogeneous computing with the AMD Platform...

Also, modern GPUs are increasingly able to enqueue multiple kernels at once
(HD 7950, Last NVidia generation), so in my design I have to take in
account the fact that I might even want to enqueues multiple kernels before
flushing in the future (for now, we do not have the hardware to test it
anyway :D)

Still a lot of things to do, but the first is probably to debug the
scheduler to make it more dynamic, I'm going in holiday this week so it
will be hard for me to work on that =)

>
>  I fear I have also caused memory leaks on the device, as
>> NVidiaVisualProfiler won't work... i'll have to  debug this :p
>>
>
> No, not necessarily. I've encountered similar problems with the
> installation on the testing machine, yet I could reproduce the issue with a
> rather simple clean OpenCL code. If I remember correctly, the issue was
> related to some static handles and their point of destruction by the
> compiler. So, it's not necessarily your fault ;-)
>

Hmmm, maybe :p I'll try to give it a deeper look once everything works
better.

>
> Best regards,
> Karli
>
>
>
Best regards,
Philippe

Re: [ViennaCL-devel] [Multi-GPU] Good news!

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Re: [ViennaCL-devel] [Multi-GPU] Good news!