Re: [Pocl-devel] async_work_group_copy

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

2012/1/19 Pekka Jääskeläinen <pek...@tu...>:
> On 01/19/2012 08:39 PM, Erik Schnetter wrote:
>>
>> Why can't these function not just be implemented as any other run-time
>> functions?
>
>
> It should be. We need to be able to override it also on *platform* basis
> (for which there is not yet an abstraction layer in pocl), not only
> in the kernel library (which are per instruction set architecture now).
> The default generic one can just use a "CPU copy" for now.
>
>
>> If (as you suggest below) only one work item initiates the DMA, then
>> wrapping the function implementation in an if statement (as you
>> suggest below) seems the best solution. The event type can then be an
>> int counting the number of times this work item has called
>> async_work_group_copy. The outstanding DMA requests are kept in a
>> per-workgroup list, ordered by this int.
>
>
> Yes. Something like that I had in mind. Actual implementation
> is going to poke some I/O registers to initiate the DMA and
> possibly implement an interrupt handler to get notified when
> the transfer is ready (or via polling some I/O reg).
>
> During that process there will be a "handle" of some kind involved
> so I thought we can just return that as the event value. In the
> default simple implementation we can just block, thus it doesn't
> matter what we return as the event because the wait_group_events
> will be a dummy.
>
>
>> Using such an int instead of a pointer allows the non-DMA-initiating
>> work items to continue without waiting, and they will still find the
>> respective DMA request when wait_group_events is called.
>
>
> Remember that WIs are statically scheduled inside WG in pocl. They
> are not mapped to threads with independent control. However, the
> transfers initiated by multiple WGs executing in the devices is a
> different story I think we do not need to yet go into.
>
>
>> What about this implementation:
>>
>> bool __single() { return get_local_id(0)==0&&  get_local_id(1)==0&&
>> get_local_id(2)==0; }
>
>
> Looks good to me.
>
>
>> (Technically, that's __master(), not __single(), but it would work
>> here.) Otherwise you'll need a lock, and keep track which threads have
>> entered/passed this construct. OpenMP also adds a barrier at the end
>> of single construct by default, so you'd want to have that here as
>> well, and then move the return statement to after the barrier.
>
>
> I didn't get this part. Remember that the code from all the WIs is
> there statically. The above code should work as is for this case.

I was thinking of a case where there is an if statement, and control
flow between different work-items diverges. A routine may have
different call sites to async_work_group_copy (all with identical
arguments, but called from different lines in the source code), and
one would have to make these match up.

Example (e.g. work item size 2):

if (get_local_id(0)==0) {
  // do something
  async_work_group_copy();
  // do something else
  wait_group_events();
  // more stuff
} else {
  // do something
  async_work_group_copy();
  // do something else
  wait_group_events();
  // more stuff
}

It is then not immediately clear in which order the individual work
items are executing the async_work_group_copy(). In my reading (which
may be flawed), this code should be legal -- each work item needs to
call wait_group_events() (or a barrier) with the same arguments, but
it doesn't need to be the same call site.

In this case, a barrier is necessary. (Alternatively, pocl could
handle wait_group_events in the same case as a call to a barrier while
duplicating the code -- you mentioned this earlier.) pocl's handling
of barriers should then already handle this correctly. (Or, if it
doesn't, there's either a bug, or a misunderstanding of OpenCL
semantics on my side.)

-erik

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...