From: Philippe T. <phi...@gm...> - 2012-08-05 23:51:11
|
2012/8/5 Karl Rupp <ru...@iu...> > Hi, > > > > I have (finally !) been able to make the multi-gpu thing work in >> parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla >> C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually). >> > > Very nice! :-) > > > > The OpenCL implementations usually don't like when the same handle has >> to be dealt with in two different queues. In my case, >> clEnqueueNDRangeKernel didn't return until the buffer became available >> for transfer (which is weird, as this function is somewhat guaranteed to >> return immediatly !). Therefore, it was necessary to call the copy >> constructor of the matrices which had to be used in two different kernels. >> > > Does the problem only show up with two queues on two different devices, or > is the same true with two queues on one device? I don't really know, i have not been able to test two queues on one device :p Why? > > > > Even doing that, both the original and the copy matrix live on the same >> device : the one associated with the queue used in clEnqueueCopyBuffer . >> Therefore, once all the copies have been done, it is necessary to >> migrate all the handles to the appropriate device. For now, NVidia is >> not OpenCL 1.2 compatible, so it's not possible to use the new API call >> clEnqueueMigrateMemObjects... For now, it is necessary to use a "trick" : >> >> float* tmp = new float[size] >> clEnqueueReadBuffer //read into tmp >> delete[] tmp; >> >> This transfers the handles to the host, which is enough as long as the >> host does not have to execute a kernel (in that case everything ends up >> being serialized too :( ) >> > > Since OpenCL 1.2 'solves' the problem and we have a workaround for OpenCL > 1.1, let's just accept the current situation. It might even be the case > that the clEnqueueCopyBuffer() internally does the same thing: Copy data > between GPUs via CPU RAM. > > One more remark on the 'workaround code': > You better want to use a std::vector<> in order to avoid issues with > deallocation. The raw pointer can be extracted via &(tmp[0]) > > > > For now, the tasks are not arranged so as to minimize the number of copy >> performed, it should increase performance. >> > > The typical assumption in the benchmarks reported by various institutions > is that the data is already set up on the device accordingly. 313 GFLOPs > (general note: this is counting fmad as one operation, so the marketing > value would be 626 GFLOPs) are pretty good already. Yay marketing value! :D. Actually, this was kind of just a warm up. I realized my implementation was not really doing any dynamic scheduling and was only good when the two devices had the same computing power... I'll have to give my threading model another look ! I kind of refuse to fallback on static scheduling, it really does not seem flexible enough to me (plus AMD advises to use a dynamic scheduler !). Some GFlops still can be gained I think :p should be possible to reach something like 350 I think, and a better performance overall when devices are not equally powerful (which might be the case, even if the two devices are the same, as one of them might already be doing something). I also want to make the scheduler out-of-order and solving the dependancies alone, considering the equality of the input/output handles... kind of like a Parallel Compiler... ;) I definitely don't feel like writing the tree for every existing operation. There was at HPC2012 a talk of Marc Baboulin on using concurrently GPU and CPU for several tasks. I also think about associating a cl_device_type with each task. Maybe, then, in the future, it would be possible to take more benefits on heterogeneous computing with the AMD Platform... Also, modern GPUs are increasingly able to enqueue multiple kernels at once (HD 7950, Last NVidia generation), so in my design I have to take in account the fact that I might even want to enqueues multiple kernels before flushing in the future (for now, we do not have the hardware to test it anyway :D) Still a lot of things to do, but the first is probably to debug the scheduler to make it more dynamic, I'm going in holiday this week so it will be hard for me to work on that =) > > I fear I have also caused memory leaks on the device, as >> NVidiaVisualProfiler won't work... i'll have to debug this :p >> > > No, not necessarily. I've encountered similar problems with the > installation on the testing machine, yet I could reproduce the issue with a > rather simple clean OpenCL code. If I remember correctly, the issue was > related to some static handles and their point of destruction by the > compiler. So, it's not necessarily your fault ;-) > Hmmm, maybe :p I'll try to give it a deeper look once everything works better. > > Best regards, > Karli > > > Best regards, Philippe |