[ViennaCL-devel] Fwd: Re: Computing A += prod(B,C)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

From: 	Philippe Tillet <phi...@gm...>
Date: 	Wed, 1 Aug 2012 00:22:34 +0200
To: 	Karl Rupp <ru...@iu...>

Hello !

Thank you very much, I have been able to cherry pick the commit on my
local branch.
I can now get very interesting profiling infos through the program such as :
Now, I am able to see a very interesting (and depressing) behavior. Upon
memory dependancies, i.e. :
kernel 1, device 1 : C(0,0) = A(0,0) * B(0,0)
kernel 2, device 2 : C(0,1) = A(0,0) * B(0,1),
Both the AMD and the NVidia SDK are unable to multicast A(0,0) from the
host to the two GPUs. Even if the two kernels are enqued in parallel,
the execution is serialized, because the 2nd device has to wait for
A(0,0) to be available. This is exactly the behavior I feared.
It does not happen with a simple matrix addition, where all the handles
are independant.

I'm desesperately looking for a low-memory handle multicasting. I might
give the Khronos forum a try, even though enqueuing the same handle on
different queues is left implementation-defined by the standards!

But well, the good news is that the kernels are executing!

Good night and thanks again for the patch :)

2012/7/31 Karl Rupp <ru...@iu... <mailto:ru...@iu...>>

     Hello again,

     I've justed pushed the following changes to the sourceforge-repository:
     * operator+= and operator-= no longer create temporaries
     * A = prod(B,C) does not fail if there is garbage in A

     Best regards,
     Karli

     On 07/29/2012 03:59 PM, Philippe Tillet wrote:

         Hello everybody !

         I'll inaugurate this mailing list with a little question.
         I have not seen any kernel for computing the operation A +=
         prod(B,C) .
         Does this mean that this operation is done doing :

         tmp = prod(B,C)
         a+=tmp

         ?

         For computing the multi_matrix ( project i'm working on, matrix
         composed
         of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and
         the multi
         devices issue), I need to do several updates of this kind, in a
         block
         layout. For a 2*2 block layout :

         C(0,0).clear();
         =>
         C(0,0) += prod( A(0,0), B(0,0) )
         =>
         C(0,0) += prod( A(0,1), B(1,0) )

         C(0,1).clear();
         =>
         C(0,1) += prod( A(0,0), B(0,1) )
         =>
         C(0,1) += prod( A(0,1), B(1,1) )

         ...
         ...

         This "sort-of-rank-1-update approach" is a special case of the 
SUMMA
         Algorithm (OpenCL doing the memory transfers in the back ground,
         for now
         at least) and seems to be efficient from a memory point of view.
         Using
         another approach would lead to both a huge memory consumption and
         significant memory transfers...

         Is there any way of doing so in ViennaCL ?

         Best regards !
         Phil

------------------------------__------------------------------__------------------
         Live Security Virtual Conference
         Exclusive live event will cover all the ways today's security and
         threat landscape has changed and how IT managers can respond.
         Discussions
         will include endpoint security, mobile security and the latest
         in malware
         threats.
         http://www.accelacomm.com/jaw/__sfrnl04242012/114/50122263/
         <http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/>

         _________________________________________________
         ViennaCL-devel mailing list
         ViennaCL-devel@lists.__sourceforge.net
         <mailto:Vie...@li...>
         https://lists.sourceforge.net/__lists/listinfo/viennacl-devel
         <https://lists.sourceforge.net/lists/listinfo/viennacl-devel>

[ViennaCL-devel] Fwd: Re: Computing A += prod(B,C)

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

[ViennaCL-devel] Fwd: Re: Computing A += prod(B,C)