Re: [Pocl-devel] async_work_group_copy

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 01/19/2012 08:39 PM, Erik Schnetter wrote:
> Why can't these function not just be implemented as any other run-time
> functions?

It should be. We need to be able to override it also on *platform* basis
(for which there is not yet an abstraction layer in pocl), not only
in the kernel library (which are per instruction set architecture now).
The default generic one can just use a "CPU copy" for now.

> If (as you suggest below) only one work item initiates the DMA, then
> wrapping the function implementation in an if statement (as you
> suggest below) seems the best solution. The event type can then be an
> int counting the number of times this work item has called
> async_work_group_copy. The outstanding DMA requests are kept in a
> per-workgroup list, ordered by this int.

Yes. Something like that I had in mind. Actual implementation
is going to poke some I/O registers to initiate the DMA and
possibly implement an interrupt handler to get notified when
the transfer is ready (or via polling some I/O reg).

During that process there will be a "handle" of some kind involved
so I thought we can just return that as the event value. In the
default simple implementation we can just block, thus it doesn't
matter what we return as the event because the wait_group_events
will be a dummy.

> Using such an int instead of a pointer allows the non-DMA-initiating
> work items to continue without waiting, and they will still find the
> respective DMA request when wait_group_events is called.

Remember that WIs are statically scheduled inside WG in pocl. They
are not mapped to threads with independent control. However, the
transfers initiated by multiple WGs executing in the devices is a
different story I think we do not need to yet go into.

> What about this implementation:
>
> bool __single() { return get_local_id(0)==0&&  get_local_id(1)==0&&
> get_local_id(2)==0; }

Looks good to me.

> (Technically, that's __master(), not __single(), but it would work
> here.) Otherwise you'll need a lock, and keep track which threads have
> entered/passed this construct. OpenMP also adds a barrier at the end
> of single construct by default, so you'd want to have that here as
> well, and then move the return statement to after the barrier.

I didn't get this part. Remember that the code from all the WIs is
there statically. The above code should work as is for this case.

> One simple implementation would call memcpy or its equivalent on work
> item 0 when async_work_group_copy is called, would do nothing in the
> other work items, and would have a barrier in wait_group_events. Of
> course, that's not asynchronous, but no events are needed.

Exactly. The simplest implementation should be quite trivial. It
just needs the __single() function you proposed after which a
simple for-copy-loop should do.

For the actual async block transferring one we need to wait when
we have a device/platform that can do such.

-- 
--Pekka