Re: [Pocl-devel] async_work_group_copy

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

2012/1/19 Pekka Jääskeläinen <pek...@tu...>:
> Hi,
>
> One of the book examples uses the async_work_group_copy kernel
> function.
>
> In the longer term, these functions are important to implement
> efficiently. They allow initiating block transfers between the
> global and local memories from the kernel side and overlap
> those transfers with kernel computation.
>
> This set of functions requires knowledge of the *platform*
> implementation (not only the device) as only the "platform"
> knows how to implement e.g. the DMA transfers from the
> device side.
>
> However, even the minimal default implementation (that is
> unaware of DMA transfers etc.) of these functions (a simple
> blocking for-loop) needs special treatment like the barrier()
> does. The function call is per WG, so all WIs in the WG must
> encounter it with the same arguments, similarly like with the
> barrier().
>
> I wonder how this should be implemented. Treat
> these functions exactly like a barrier() in the pocl
> passes? One difficulty is that async_work_group_copy returns
> an event which will be blocked on with wait_group_events().
> This event variable is a private variable but in a sense
> a local variable as only a single actual WI needs it so
> replicating it just consumes space (but no other harm
> that I can see).

Why can't these function not just be implemented as any other run-time
functions?

If (as you suggest below) only one work item initiates the DMA, then
wrapping the function implementation in an if statement (as you
suggest below) seems the best solution. The event type can then be an
int counting the number of times this work item has called
async_work_group_copy. The outstanding DMA requests are kept in a
per-workgroup list, ordered by this int.

Using such an int instead of a pointer allows the non-DMA-initiating
work items to continue without waiting, and they will still find the
respective DMA request when wait_group_events is called.

> One way to implement this at the higher level would be to
> have a mechanism as in OpenMP where you can define
> code to be executed by only a single WI in the WG.
> https://computing.llnl.gov/tutorials/openMP/#SINGLE
>
> Some magic variable that holds an id of a single thread
> could do.
>
> event_t async_work_group_copy ( ... ) {
>
>   if (__single())
>   {
>      // perform a block copy by initiating a DMA transfer in
>      // the background
>      return event_id; /* implementation "handle" */
>   }
>   return 0; /* doesn't matter what we return for the others */
> }

What about this implementation:

bool __single() { return get_local_id(0)==0 && get_local_id(1)==0 &&
get_local_id(2)==0; }

(Technically, that's __master(), not __single(), but it would work
here.) Otherwise you'll need a lock, and keep track which threads have
entered/passed this construct. OpenMP also adds a barrier at the end
of single construct by default, so you'd want to have that here as
well, and then move the return statement to after the barrier.

> On the other hand, in some implementations we might actually
> want all the work items/vector lanes/threads to participate in
> the memory transfer. For example, when block DMA is not available
> but block transfers using vector loads/stores are. In that case the data
> transport could be divided across the WIs/lanes somehow. Anyways,
> let's for now assume this function is executed by only one WI of the
> WG and concentrate on implementing that.

One simple implementation would call memcpy or its equivalent on work
item 0 when async_work_group_copy is called, would do nothing in the
other work items, and would have a barrier in wait_group_events. Of
course, that's not asynchronous, but no events are needed.

-erik

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...