Re: [ViennaCL-devel] Fwd: Re: Computing A += prod(B,C)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi again,

I have been able to observe some very funny behavior.
I am using one host thread per queue, and, depending on the race condition
(ie which devices takes which handle), the operation is longer or shorter
to execute ! The good news is that, in some cases, I have got 0.96seconds
for a 6400*6400 float matrix, which is about 6.4^3/0.96 = 273GFlop, which
is slightly more than what a single GPU can do.

I have just done an API for creating copies upon dependancies, but for now
it is just not working... because I have still not taken care of data
locality !

Going for one context per device still seems to be a bad idea, as no
function of the OpenCL API will work. For example, the copy constructor of
matrix() will have an undefined behavior in the cases where the two
matrices are on another device, because clEnqueueCopyBuffer assumes a
common context ! and i'm sure there are a lot of similar cases,
unfortunately. Plus, the OpenCL Standards tend to encourage this use of
multiple devices, with the introduction as new functions such as
clEnqueueMigrateMemObjects (or whatever the name is :D).

Anyway, i hope i'll soon be able to get out of this *&é'(_èç !

Good night everybody :)

2012/8/1 Karl Rupp <ru...@iu...>

> Hi Philippe,
>
> thanks for the investigations.
>
> > kernel 1, device 1 : C(0,0) = A(0,0) * B(0,0)
> > kernel 2, device 2 : C(0,1) = A(0,0) * B(0,1),
> > Both the AMD and the NVidia SDK are unable to multicast A(0,0) from the
> > host to the two GPUs. Even if the two kernels are enqued in parallel,
> > the execution is serialized, because the 2nd device has to wait for
> > A(0,0) to be available. This is exactly the behavior I feared.
> > It does not happen with a simple matrix addition, where all the handles
> > are independant.
>
> Okay, I see, so the const-qualifiers for the kernel handles are ignored
> (or not abused for a more efficient implementation). Thus, it seems like
> we have to use separate memory handles in such case and that we better
> attach some meta-information ('current device') to each memory handle.
>
>
> > I'm desesperately looking for a low-memory handle multicasting. I might
> > give the Khronos forum a try, even though enqueuing the same handle on
> > different queues is left implementation-defined by the standards!
>
> Oh dear, 'implementation-defined' is nothing I want to see at this point
> :-( Seems like we should perhaps reconsider using one context per device
> and benchmark memory transfers for the two options (i.e. one context for
> all devices vs. one context per device).
>
> > But well, the good news is that the kernels are executing!
>
> Yep, some good news :-)
>
> Best regards,
> Karli
>
>
>
>
> >
> > 2012/7/31 Karl Rupp <ru...@iu... <mailto:ru...@iu...
> >>
> >
> >       Hello again,
> >
> >       I've justed pushed the following changes to the
> sourceforge-repository:
> >       * operator+= and operator-= no longer create temporaries
> >       * A = prod(B,C) does not fail if there is garbage in A
> >
> >       Best regards,
> >       Karli
> >
> >
> >
> >       On 07/29/2012 03:59 PM, Philippe Tillet wrote:
> >
> >           Hello everybody !
> >
> >           I'll inaugurate this mailing list with a little question.
> >           I have not seen any kernel for computing the operation A +=
> >           prod(B,C) .
> >           Does this mean that this operation is done doing :
> >
> >           tmp = prod(B,C)
> >           a+=tmp
> >
> >           ?
> >
> >           For computing the multi_matrix ( project i'm working on, matrix
> >           composed
> >           of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and
> >           the multi
> >           devices issue), I need to do several updates of this kind, in a
> >           block
> >           layout. For a 2*2 block layout :
> >
> >           C(0,0).clear();
> >           =>
> >           C(0,0) += prod( A(0,0), B(0,0) )
> >           =>
> >           C(0,0) += prod( A(0,1), B(1,0) )
> >
> >           C(0,1).clear();
> >           =>
> >           C(0,1) += prod( A(0,0), B(0,1) )
> >           =>
> >           C(0,1) += prod( A(0,1), B(1,1) )
> >
> >           ...
> >           ...
> >
> >           This "sort-of-rank-1-update approach" is a special case of the
> > SUMMA
> >           Algorithm (OpenCL doing the memory transfers in the back
> ground,
> >           for now
> >           at least) and seems to be efficient from a memory point of
> view.
> >           Using
> >           another approach would lead to both a huge memory consumption
> and
> >           significant memory transfers...
> >
> >           Is there any way of doing so in ViennaCL ?
> >
> >           Best regards !
> >           Phil
> >
> >
> >
> >
> ------------------------------__------------------------------__------------------
> >           Live Security Virtual Conference
> >           Exclusive live event will cover all the ways today's security
> and
> >           threat landscape has changed and how IT managers can respond.
> >           Discussions
> >           will include endpoint security, mobile security and the latest
> >           in malware
> >           threats.
> >           http://www.accelacomm.com/jaw/__sfrnl04242012/114/50122263/
> >           <http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/>
> >
> >
> >
> >           _________________________________________________
> >           ViennaCL-devel mailing list
> >           ViennaCL-devel@lists.__sourceforge.net
> >           <mailto:Vie...@li...>
> >           https://lists.sourceforge.net/__lists/listinfo/viennacl-devel
> >           <https://lists.sourceforge.net/lists/listinfo/viennacl-devel>
> >
> >
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > ViennaCL-devel mailing list
> > Vie...@li...
> > https://lists.sourceforge.net/lists/listinfo/viennacl-devel
> >
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> ViennaCL-devel mailing list
> Vie...@li...
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>

Re: [ViennaCL-devel] Fwd: Re: Computing A += prod(B,C)

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Re: [ViennaCL-devel] Fwd: Re: Computing A += prod(B,C)