From: Karl R. <ru...@iu...> - 2012-08-01 08:25:04
|
From: Philippe Tillet <phi...@gm...> Date: Wed, 1 Aug 2012 00:22:34 +0200 To: Karl Rupp <ru...@iu...> Hello ! Thank you very much, I have been able to cherry pick the commit on my local branch. I can now get very interesting profiling infos through the program such as : Now, I am able to see a very interesting (and depressing) behavior. Upon memory dependancies, i.e. : kernel 1, device 1 : C(0,0) = A(0,0) * B(0,0) kernel 2, device 2 : C(0,1) = A(0,0) * B(0,1), Both the AMD and the NVidia SDK are unable to multicast A(0,0) from the host to the two GPUs. Even if the two kernels are enqued in parallel, the execution is serialized, because the 2nd device has to wait for A(0,0) to be available. This is exactly the behavior I feared. It does not happen with a simple matrix addition, where all the handles are independant. I'm desesperately looking for a low-memory handle multicasting. I might give the Khronos forum a try, even though enqueuing the same handle on different queues is left implementation-defined by the standards! But well, the good news is that the kernels are executing! Good night and thanks again for the patch :) 2012/7/31 Karl Rupp <ru...@iu... <mailto:ru...@iu...>> Hello again, I've justed pushed the following changes to the sourceforge-repository: * operator+= and operator-= no longer create temporaries * A = prod(B,C) does not fail if there is garbage in A Best regards, Karli On 07/29/2012 03:59 PM, Philippe Tillet wrote: Hello everybody ! I'll inaugurate this mailing list with a little question. I have not seen any kernel for computing the operation A += prod(B,C) . Does this mean that this operation is done doing : tmp = prod(B,C) a+=tmp ? For computing the multi_matrix ( project i'm working on, matrix composed of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and the multi devices issue), I need to do several updates of this kind, in a block layout. For a 2*2 block layout : C(0,0).clear(); => C(0,0) += prod( A(0,0), B(0,0) ) => C(0,0) += prod( A(0,1), B(1,0) ) C(0,1).clear(); => C(0,1) += prod( A(0,0), B(0,1) ) => C(0,1) += prod( A(0,1), B(1,1) ) ... ... This "sort-of-rank-1-update approach" is a special case of the SUMMA Algorithm (OpenCL doing the memory transfers in the back ground, for now at least) and seems to be efficient from a memory point of view. Using another approach would lead to both a huge memory consumption and significant memory transfers... Is there any way of doing so in ViennaCL ? Best regards ! Phil ------------------------------__------------------------------__------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/__sfrnl04242012/114/50122263/ <http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/> _________________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.__sourceforge.net <mailto:Vie...@li...> https://lists.sourceforge.net/__lists/listinfo/viennacl-devel <https://lists.sourceforge.net/lists/listinfo/viennacl-devel> |