pocl-devel Mailing List for pocl (Page 52)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-- 
Erik Schnetter <sch...@gm...>
http://www.perimeterinstitute.ca/personal/eschnetter/

Pekka

No, I did not code anything in this respect.

-erik

2012/3/5 Pekka Jääskeläinen <pek...@tu...>:
> On 12/19/2011 04:17 PM, Erik Schnetter wrote:
>>
>> The previous discussion was much about file formats. Does this mean that
>> enqueuing a cached kernel would still require a dlopen? I was more
>> hoping for caching the kernels in memory, so that enqueuing a kernel is
>> really as cheap as an indirect function call.
>
>
> Did you do some coding for this?
>
> I'll implement this as the ViennaCL has test cases which compile (enqueue)
> the same kernel multiple times (blas3) and the kernels seem to be quite
> slow to compile. Otherwise it's not sensible to add the ViennaCL tests
> to the 'make check' suite.
>
> --
> Pekka

-- 
Erik Schnetter <esc...@pe...>
http://www.perimeterinstitute.ca/personal/eschnetter/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

On 12/19/2011 04:17 PM, Erik Schnetter wrote:
> The previous discussion was much about file formats. Does this mean that
> enqueuing a cached kernel would still require a dlopen? I was more
> hoping for caching the kernels in memory, so that enqueuing a kernel is
> really as cheap as an indirect function call.

Did you do some coding for this?

I'll implement this as the ViennaCL has test cases which compile (enqueue)
the same kernel multiple times (blas3) and the kernels seem to be quite
slow to compile. Otherwise it's not sensible to add the ViennaCL tests
to the 'make check' suite.

-- 
Pekka

On 02/04/2012 12:39 PM, Pekka Jääskeläinen wrote:
> If it is, I think the easiest fix is to propose a new metadata

Actually, the easiest hackish workaround would be to just assume
all llvm global buffers inside the module with a "non-C"-name (seems the one
has k.b as the name where k is the name of the kernel, b the name of the
buffer) which are referred to from kernels are automatic local buffers.

It's not robust enough but might do until a better fix (e.g. the metadata) is
in place. It seems there's some kind of wider debate going on in LLVMdev on
how to properly implement the OpenCL (or any language-specific) metadata
handling.

-- 
Pekka

Carlos,

Please confirm this is the problem:

On 02/02/2012 02:39 PM, Pekka Jääskeläinen wrote:
> Is the problem that for targets which map all the address spaces
> to zero you cannot differentiate what is a local buffer and what
> is a global one, thus unable to know to which one needs to allocate
> local space before launching the kernel?

If it is, I think the easiest fix is to propose a new metadata
for the automatic local buffers to Clang which we can then use to
allocate the space in the launcher (and it should work also for
targets with locals being mapped to AS0).

I can do this if I just get your confirmation. This is a
nasty nuisance for non-TCE targets I'd like to get done
with.

-- 
--Pekka

Hi,

I don't understand the problem. Please elaborate as I do not
remember the (live) discussion we had before you added the "sed
hack".

Here is the output (llvm/Clang trunk from yesterday) from compiling
the following kernel:

kernel void k (local int *a)
{
   local int b[100];

   local int *p;

   p = b;
   p = a;
}

with

clang kernel_with_locals.cl -emit-llvm -c -O0 -ccc-host-triple tce-tut-llvm -o 
- | llvm-dis -o -

; ModuleID = '<stdin>'
target datalayout = 
"E-p:32:32:32-i1:8:8-i8:8:32-i16:16:32-i32:32:32-i64:32:32-f32:32:32-f64:32:32-v64:32:32-v128:32:32-a0:0:32-n32"
target triple = "tce-tut-llvm"

@k.b = internal addrspace(4) global [100 x i32] zeroinitializer, align 4

define void @k(i32 addrspace(4)* %a) nounwind noinline {
entry:
   %a.addr = alloca i32 addrspace(4)*, align 4
   %p = alloca i32 addrspace(4)*, align 4
   store i32 addrspace(4)* %a, i32 addrspace(4)** %a.addr, align 4
   store i32 addrspace(4)* getelementptr inbounds ([100 x i32] addrspace(4)* 
@k.b, i32 0, i32 0), i32 addrspace(4)** %p, align 4
   %0 = load i32 addrspace(4)** %a.addr, align 4
   store i32 addrspace(4)* %0, i32 addrspace(4)** %p, align 4
   ret void
}

!opencl.kernels = !{!0}

!0 = metadata !{void (i32 addrspace(4)*)* @k}

-----------

So, the automatic local is a "llvm global" with the correct addrspace set.
Also the argument pointer has the correct addrspace. p itself is allocated in
the stack (a private variable).

On 02/02/2012 02:13 PM, Carlos Sánchez de La Lama wrote:
> should "p" have the address space qualifier? if it has, then "p = b" is
> illegal. If it has not, then "p = a" is illegal.

p itself is a private variable pointing to a local variable so I think
both of the cases should be legal.

Is the problem that for targets which map all the address spaces
to zero you cannot differentiate what is a local buffer and what
is a global one, thus unable to know to which one needs to allocate
local space before launching the kernel?

-- 
Pekka

Hi guys,

local pointer handling is now done with a preprocessor + sed hack to
workaround clang limitations. Background here is clang does not provide
information about a pointer being local or not in LLVM IR, which is
needed by pocl (an, alghought this requirement might be removable, this
information is needed on architectures wich real different address space
for local memory).

The hack adds an addresspace qualifier on local pointers in kernel
arguments, which makes them recognizable in IR, and keeps normal local
variable declarations untouched (which is required in that case).
Automatic local variales are handled by clang by creating "global"
equivalents, which then pocl passes transform as required. clang crashes
if these automatic globals have an address space requirement.

The problem, then arises when a kernel declares a local variable which
is a pointer to a local buffer. Example:

kernel k (local int *a)
{
  local int b[100];

  local int *p;

  ...
}

should "p" have the address space qualifier? if it has, then "p = b" is
illegal. If it has not, then "p = a" is illegal.

I guess this means the address spaec qualifier on kernel pointer
arguments has to be dropped (sig), which means medium-deep changes in
pocl.

BTW, Pekka, do you know how that problematic pointer assignment case
work when using tce backend? (which AFAIR assigns addresspaces to
locals)

BR

Carlos

FYI,

I implemented the minimal version of clCreateProgramWithBinary with
the binary format being directly the sequential LLVM bitcode of the
kernels. The book example that tests the functionality works now.

As there's no caching of the final target binaries in the binary
format yet it doesn't speed up the kernel compilation pretty much at
all. For that we need to implement the discussed archive binary format.

On 01/17/2012 02:43 PM, Pekka Jääskeläinen wrote:
> Not strictly needed it seems. CL_PROGRAM_SOURCE of clGetProgramInfo
> is optional in case the Program was constructed from a binary. Thus,
> the original BC should suffice.

-- 
Pekka

2012/1/22 Pekka Jääskeläinen <pek...@tu...>:
> Hi Erik,
>
>
> On 01/20/2012 07:05 PM, Erik Schnetter wrote:
>>
>> 3-element vectors can't be accessed directly in memory, because they
>> need to be aligned the same was as 4-vectors. One needs to use
>> vload/vstore for this, which doesn't require alignment.
>>
>> I assume that 3-element vectors would be stored packed in global
>> memory. I don't know whether this is what async_copy actually
>> expects... However, un-packed storage can be handled by passing a
>> respective 4-element vector as gentype.
>
>
> I'm not sure if I got this. The current implementation of
> the async copy is just a for loop that copies elements of
> the arrays.
>
> The elements in the copy loop are handled as elements of
> the actual gentype, e.g. float3. As far as I understood
> this means the compiler should take care of the correct alignment of
> the vectors in memory because it's stated with the alignment
> attribute (to be the same as float4) both in the host side and the
> kernel side.
>
> Did you check from the generated assembly that the copy loop does
> not produce a correct alignment when copying buffers with 3-element
> vectors?
>
> "The vload3 and vstore3 built-in functions can be used to read and
> write, respectively, 3-component vector data types from an array of
> packed scalar data type."
>
> From cl_platform.h:
>
> /* cl_float3 is identical in size, alignment and behavior to cl_float4. See
> section 6.1.5. */
> typedef  cl_float4  cl_float3;
>
>
> So, my understanding of this is that it should work because both the
> host and the device code are in the same understanding of the alignment
> of the 3-element vectors. I.e. it's stored 4-aligned everywhere if
> you handle buffers that contain 3-element vectors and thus copy loop
> should be generated to code that adheres to this.
>
> If the input was packed (basically a float buffer) and one would like
> to copy it to a buffer of 3-wide vectors, then vload3 and vstore3 should
> be used.
>
> Please correct me if I understood this wrongly.

Yes, this is correct. I was assuming that packed buffers were more
common than unpacked ones (otherwise, why would one use float3?), and
that async_copy would need to handle this. I now realise that there is
no basis for this assumption of mine. Indeed there is a footnote in
the standard stating that float3 should be copied in the same way as
float4.

Your implementation is correct.

-erik

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

Hi Erik,

On 01/20/2012 07:05 PM, Erik Schnetter wrote:
> 3-element vectors can't be accessed directly in memory, because they
> need to be aligned the same was as 4-vectors. One needs to use
> vload/vstore for this, which doesn't require alignment.
>
> I assume that 3-element vectors would be stored packed in global
> memory. I don't know whether this is what async_copy actually
> expects... However, un-packed storage can be handled by passing a
> respective 4-element vector as gentype.

I'm not sure if I got this. The current implementation of
the async copy is just a for loop that copies elements of
the arrays.

The elements in the copy loop are handled as elements of
the actual gentype, e.g. float3. As far as I understood
this means the compiler should take care of the correct alignment of
the vectors in memory because it's stated with the alignment
attribute (to be the same as float4) both in the host side and the
kernel side.

Did you check from the generated assembly that the copy loop does
not produce a correct alignment when copying buffers with 3-element
vectors?

"The vload3 and vstore3 built-in functions can be used to read and
write, respectively, 3-component vector data types from an array of
packed scalar data type."

 From cl_platform.h:

/* cl_float3 is identical in size, alignment and behavior to cl_float4. See 
section 6.1.5. */
typedef  cl_float4  cl_float3;

So, my understanding of this is that it should work because both the
host and the device code are in the same understanding of the alignment
of the 3-element vectors. I.e. it's stored 4-aligned everywhere if
you handle buffers that contain 3-element vectors and thus copy loop
should be generated to code that adheres to this.

If the input was packed (basically a float buffer) and one would like
to copy it to a buffer of 3-wide vectors, then vload3 and vstore3 should
be used.

Please correct me if I understood this wrongly.
-- 
--Pekka

Pekka

3-element vectors can't be accessed directly in memory, because they
need to be aligned the same was as 4-vectors. One needs to use
vload/vstore for this, which doesn't require alignment.

I assume that 3-element vectors would be stored packed in global
memory. I don't know whether this is what async_copy actually
expects... However, un-packed storage can be handled by passing a
respective 4-element vector as gentype.

-erik

2012/1/20 Pekka Jääskeläinen <pek...@tu...>:
> OK,
>
> I implemented the non-strided versions in r159 using a trivial for-loop.
> I didn't test them yet (the spmv book example case that uses the functions
> fails with the bug  regarding the __local pointers). Feel free to
> modify or comment.
>
> The easiest optimization for the implementation is to define
> versions that use vector loads and stores for the vector gentypes
> for the architectures with SIMD loads/stores (e.g. x86_64 with avx/sse).
>
> Have a nice weekend!
>
>
> On 01/20/2012 05:14 PM, Erik Schnetter wrote:
>>
>> Pekka
>>
>> I thought about using include files to instantiate macros, but so far,
>> I've kept things to using macros only. Instead, there is a different
>> macro for each "kind" of prototype. We can change this to using
>> #include instead (which would probably reduce the amount of code), but
>> this would also make it more complex to instantiate macros -- this
>> would then require a #define, and #include, and an #undef for each
>> function.
>>
>> I would instead #define a new macro just for async_work_group_copy in
>> _kernel.h. There would probably be another specific macro in
>> templates.h to help instantiating the function definitions. This would
>> be similar to the vload/vstore functions.
>>
>> This is similar to what you suggest except it doesn't use an #include
>> file. There is no particular reason not to use an #include file,
>> except that we currently don't. If you think it significantly
>> simplifies the code, then do it.
>>
>> -erik
>>
>> 2012/1/20 Pekka Jääskeläinen<pek...@tu...>:
>>>
>>> Erik,
>>>
>>> The function protototype for the async copy includes a "gentype"
>>> in a position not supported by the current "generator macros" of
>>> yours.
>>>
>>> event_t async_work_group_copy (
>>>     __local gentype *dst,
>>>     const __global gentype *src,
>>>     size_t num_gentypes,
>>>     event_t event);
>>>
>>> What do you think is best way to generate the declarations and
>>> definitions for such functions?
>>>
>>> Something like:
>>>
>>> #define __FUNC_PROTO(gentype)  \
>>>         __attribute__ ((overloadable))  \
>>>        event_t async_work_group_copy ( \
>>>            __local gentype *dst, \
>>>            const __global gentype *src, \
>>>            size_t num_gentypes, \
>>>            event_t event) \
>>>
>>> #include "gentype_func_decl.inc"
>>>
>>> Then that .inc would have the macro instantiated
>>> with all the different value types for gentype. E.g.:
>>>
>>> __FUNC_PROTO(float);
>>> __FUNC_PROTO(float2);
>>> __FUNC_PROTO(float4);
>>> ...
>>>
>>> Similarly for the definitions. Here I think we
>>> can assume both of the gentypes in the function
>>> are always the same so we do not have to generate
>>> all combinations.
>>>
>>> What do you think?
>>>
>>> --
>>> Pekka
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Keep Your Developer Skills Current with LearnDevNow!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-d2d
>>> _______________________________________________
>>> Pocl-devel mailing list
>>> Poc...@li...
>>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>
>>
>>
>>
>
>
> --
> Pekka

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

OK,

I implemented the non-strided versions in r159 using a trivial for-loop.
I didn't test them yet (the spmv book example case that uses the functions
fails with the bug  regarding the __local pointers). Feel free to
modify or comment.

The easiest optimization for the implementation is to define
versions that use vector loads and stores for the vector gentypes
for the architectures with SIMD loads/stores (e.g. x86_64 with avx/sse).

Have a nice weekend!

On 01/20/2012 05:14 PM, Erik Schnetter wrote:
> Pekka
>
> I thought about using include files to instantiate macros, but so far,
> I've kept things to using macros only. Instead, there is a different
> macro for each "kind" of prototype. We can change this to using
> #include instead (which would probably reduce the amount of code), but
> this would also make it more complex to instantiate macros -- this
> would then require a #define, and #include, and an #undef for each
> function.
>
> I would instead #define a new macro just for async_work_group_copy in
> _kernel.h. There would probably be another specific macro in
> templates.h to help instantiating the function definitions. This would
> be similar to the vload/vstore functions.
>
> This is similar to what you suggest except it doesn't use an #include
> file. There is no particular reason not to use an #include file,
> except that we currently don't. If you think it significantly
> simplifies the code, then do it.
>
> -erik
>
> 2012/1/20 Pekka Jääskeläinen<pek...@tu...>:
>> Erik,
>>
>> The function protototype for the async copy includes a "gentype"
>> in a position not supported by the current "generator macros" of
>> yours.
>>
>> event_t async_work_group_copy (
>>      __local gentype *dst,
>>      const __global gentype *src,
>>      size_t num_gentypes,
>>      event_t event);
>>
>> What do you think is best way to generate the declarations and
>> definitions for such functions?
>>
>> Something like:
>>
>> #define __FUNC_PROTO(gentype)  \
>>          __attribute__ ((overloadable))  \
>>         event_t async_work_group_copy ( \
>>             __local gentype *dst, \
>>             const __global gentype *src, \
>>             size_t num_gentypes, \
>>             event_t event) \
>>
>> #include "gentype_func_decl.inc"
>>
>> Then that .inc would have the macro instantiated
>> with all the different value types for gentype. E.g.:
>>
>> __FUNC_PROTO(float);
>> __FUNC_PROTO(float2);
>> __FUNC_PROTO(float4);
>> ...
>>
>> Similarly for the definitions. Here I think we
>> can assume both of the gentypes in the function
>> are always the same so we do not have to generate
>> all combinations.
>>
>> What do you think?
>>
>> --
>> Pekka
>>
>> ------------------------------------------------------------------------------
>> Keep Your Developer Skills Current with LearnDevNow!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-d2d
>> _______________________________________________
>> Pocl-devel mailing list
>> Poc...@li...
>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>
>
>

-- 
Pekka

Pekka

I thought about using include files to instantiate macros, but so far,
I've kept things to using macros only. Instead, there is a different
macro for each "kind" of prototype. We can change this to using
#include instead (which would probably reduce the amount of code), but
this would also make it more complex to instantiate macros -- this
would then require a #define, and #include, and an #undef for each
function.

I would instead #define a new macro just for async_work_group_copy in
_kernel.h. There would probably be another specific macro in
templates.h to help instantiating the function definitions. This would
be similar to the vload/vstore functions.

This is similar to what you suggest except it doesn't use an #include
file. There is no particular reason not to use an #include file,
except that we currently don't. If you think it significantly
simplifies the code, then do it.

-erik

2012/1/20 Pekka Jääskeläinen <pek...@tu...>:
> Erik,
>
> The function protototype for the async copy includes a "gentype"
> in a position not supported by the current "generator macros" of
> yours.
>
> event_t async_work_group_copy (
>     __local gentype *dst,
>     const __global gentype *src,
>     size_t num_gentypes,
>     event_t event);
>
> What do you think is best way to generate the declarations and
> definitions for such functions?
>
> Something like:
>
> #define __FUNC_PROTO(gentype)  \
>         __attribute__ ((overloadable))  \
>        event_t async_work_group_copy ( \
>            __local gentype *dst, \
>            const __global gentype *src, \
>            size_t num_gentypes, \
>            event_t event) \
>
> #include "gentype_func_decl.inc"
>
> Then that .inc would have the macro instantiated
> with all the different value types for gentype. E.g.:
>
> __FUNC_PROTO(float);
> __FUNC_PROTO(float2);
> __FUNC_PROTO(float4);
> ...
>
> Similarly for the definitions. Here I think we
> can assume both of the gentypes in the function
> are always the same so we do not have to generate
> all combinations.
>
> What do you think?
>
> --
> Pekka
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Pocl-devel mailing list
> Poc...@li...
> https://lists.sourceforge.net/lists/listinfo/pocl-devel

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

Erik,

The function protototype for the async copy includes a "gentype"
in a position not supported by the current "generator macros" of
yours.

event_t async_work_group_copy (
     __local gentype *dst,
     const __global gentype *src,
     size_t num_gentypes,
     event_t event);

What do you think is best way to generate the declarations and
definitions for such functions?

Something like:

#define __FUNC_PROTO(gentype)  \
         __attribute__ ((overloadable))  \
	event_t async_work_group_copy ( \
	    __local gentype *dst, \
	    const __global gentype *src, \
	    size_t num_gentypes, \
	    event_t event) \

#include "gentype_func_decl.inc"

Then that .inc would have the macro instantiated
with all the different value types for gentype. E.g.:

__FUNC_PROTO(float);
__FUNC_PROTO(float2);
__FUNC_PROTO(float4);
...

Similarly for the definitions. Here I think we
can assume both of the gentypes in the function
are always the same so we do not have to generate
all combinations.

What do you think?

-- 
Pekka

On 01/19/2012 10:28 PM, Erik Schnetter wrote:
> It is then not immediately clear in which order the individual work
> items are executing the async_work_group_copy(). In my reading (which
> may be flawed), this code should be legal -- each work item needs to
> call wait_group_events() (or a barrier) with the same arguments, but
> it doesn't need to be the same call site.

OK, got it. I think it's quite clear the WI control semantics is the
same as with the barrier(). If it's not, it leads to tricky cases
without apparent benefit.

In the other interpretation, how would one in practice differentiate
between two separate async copies (even if they were to the same address as
one can modify the memory in between the calls)? Or the other way
around, how would one prove the copy calls actually should be merged?
Without figuring it out somehow, your example should be interpreted as an
undefined program as all WIs should execute both of the async copy calls,
which is impossible.

The specs should be more clear with this but I'm quite sure
the "control semantics" are the same as with barriers. Let's
trust on that until shown otherwise.

-- 
--Pekka

2012/1/19 Pekka Jääskeläinen <pek...@tu...>:
> On 01/19/2012 08:39 PM, Erik Schnetter wrote:
>>
>> Why can't these function not just be implemented as any other run-time
>> functions?
>
>
> It should be. We need to be able to override it also on *platform* basis
> (for which there is not yet an abstraction layer in pocl), not only
> in the kernel library (which are per instruction set architecture now).
> The default generic one can just use a "CPU copy" for now.
>
>
>> If (as you suggest below) only one work item initiates the DMA, then
>> wrapping the function implementation in an if statement (as you
>> suggest below) seems the best solution. The event type can then be an
>> int counting the number of times this work item has called
>> async_work_group_copy. The outstanding DMA requests are kept in a
>> per-workgroup list, ordered by this int.
>
>
> Yes. Something like that I had in mind. Actual implementation
> is going to poke some I/O registers to initiate the DMA and
> possibly implement an interrupt handler to get notified when
> the transfer is ready (or via polling some I/O reg).
>
> During that process there will be a "handle" of some kind involved
> so I thought we can just return that as the event value. In the
> default simple implementation we can just block, thus it doesn't
> matter what we return as the event because the wait_group_events
> will be a dummy.
>
>
>> Using such an int instead of a pointer allows the non-DMA-initiating
>> work items to continue without waiting, and they will still find the
>> respective DMA request when wait_group_events is called.
>
>
> Remember that WIs are statically scheduled inside WG in pocl. They
> are not mapped to threads with independent control. However, the
> transfers initiated by multiple WGs executing in the devices is a
> different story I think we do not need to yet go into.
>
>
>> What about this implementation:
>>
>> bool __single() { return get_local_id(0)==0&&  get_local_id(1)==0&&
>> get_local_id(2)==0; }
>
>
> Looks good to me.
>
>
>> (Technically, that's __master(), not __single(), but it would work
>> here.) Otherwise you'll need a lock, and keep track which threads have
>> entered/passed this construct. OpenMP also adds a barrier at the end
>> of single construct by default, so you'd want to have that here as
>> well, and then move the return statement to after the barrier.
>
>
> I didn't get this part. Remember that the code from all the WIs is
> there statically. The above code should work as is for this case.

I was thinking of a case where there is an if statement, and control
flow between different work-items diverges. A routine may have
different call sites to async_work_group_copy (all with identical
arguments, but called from different lines in the source code), and
one would have to make these match up.

Example (e.g. work item size 2):

if (get_local_id(0)==0) {
  // do something
  async_work_group_copy();
  // do something else
  wait_group_events();
  // more stuff
} else {
  // do something
  async_work_group_copy();
  // do something else
  wait_group_events();
  // more stuff
}

It is then not immediately clear in which order the individual work
items are executing the async_work_group_copy(). In my reading (which
may be flawed), this code should be legal -- each work item needs to
call wait_group_events() (or a barrier) with the same arguments, but
it doesn't need to be the same call site.

In this case, a barrier is necessary. (Alternatively, pocl could
handle wait_group_events in the same case as a call to a barrier while
duplicating the code -- you mentioned this earlier.) pocl's handling
of barriers should then already handle this correctly. (Or, if it
doesn't, there's either a bug, or a misunderstanding of OpenCL
semantics on my side.)

-erik

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

On 01/19/2012 08:39 PM, Erik Schnetter wrote:
> Why can't these function not just be implemented as any other run-time
> functions?

It should be. We need to be able to override it also on *platform* basis
(for which there is not yet an abstraction layer in pocl), not only
in the kernel library (which are per instruction set architecture now).
The default generic one can just use a "CPU copy" for now.

> If (as you suggest below) only one work item initiates the DMA, then
> wrapping the function implementation in an if statement (as you
> suggest below) seems the best solution. The event type can then be an
> int counting the number of times this work item has called
> async_work_group_copy. The outstanding DMA requests are kept in a
> per-workgroup list, ordered by this int.

Yes. Something like that I had in mind. Actual implementation
is going to poke some I/O registers to initiate the DMA and
possibly implement an interrupt handler to get notified when
the transfer is ready (or via polling some I/O reg).

During that process there will be a "handle" of some kind involved
so I thought we can just return that as the event value. In the
default simple implementation we can just block, thus it doesn't
matter what we return as the event because the wait_group_events
will be a dummy.

> Using such an int instead of a pointer allows the non-DMA-initiating
> work items to continue without waiting, and they will still find the
> respective DMA request when wait_group_events is called.

Remember that WIs are statically scheduled inside WG in pocl. They
are not mapped to threads with independent control. However, the
transfers initiated by multiple WGs executing in the devices is a
different story I think we do not need to yet go into.

> What about this implementation:
>
> bool __single() { return get_local_id(0)==0&&  get_local_id(1)==0&&
> get_local_id(2)==0; }

Looks good to me.

> (Technically, that's __master(), not __single(), but it would work
> here.) Otherwise you'll need a lock, and keep track which threads have
> entered/passed this construct. OpenMP also adds a barrier at the end
> of single construct by default, so you'd want to have that here as
> well, and then move the return statement to after the barrier.

I didn't get this part. Remember that the code from all the WIs is
there statically. The above code should work as is for this case.

> One simple implementation would call memcpy or its equivalent on work
> item 0 when async_work_group_copy is called, would do nothing in the
> other work items, and would have a barrier in wait_group_events. Of
> course, that's not asynchronous, but no events are needed.

Exactly. The simplest implementation should be quite trivial. It
just needs the __single() function you proposed after which a
simple for-copy-loop should do.

For the actual async block transferring one we need to wait when
we have a device/platform that can do such.

-- 
--Pekka

2012/1/19 Pekka Jääskeläinen <pek...@tu...>:
> Hi,
>
> One of the book examples uses the async_work_group_copy kernel
> function.
>
> In the longer term, these functions are important to implement
> efficiently. They allow initiating block transfers between the
> global and local memories from the kernel side and overlap
> those transfers with kernel computation.
>
> This set of functions requires knowledge of the *platform*
> implementation (not only the device) as only the "platform"
> knows how to implement e.g. the DMA transfers from the
> device side.
>
> However, even the minimal default implementation (that is
> unaware of DMA transfers etc.) of these functions (a simple
> blocking for-loop) needs special treatment like the barrier()
> does. The function call is per WG, so all WIs in the WG must
> encounter it with the same arguments, similarly like with the
> barrier().
>
> I wonder how this should be implemented. Treat
> these functions exactly like a barrier() in the pocl
> passes? One difficulty is that async_work_group_copy returns
> an event which will be blocked on with wait_group_events().
> This event variable is a private variable but in a sense
> a local variable as only a single actual WI needs it so
> replicating it just consumes space (but no other harm
> that I can see).

Why can't these function not just be implemented as any other run-time
functions?

If (as you suggest below) only one work item initiates the DMA, then
wrapping the function implementation in an if statement (as you
suggest below) seems the best solution. The event type can then be an
int counting the number of times this work item has called
async_work_group_copy. The outstanding DMA requests are kept in a
per-workgroup list, ordered by this int.

Using such an int instead of a pointer allows the non-DMA-initiating
work items to continue without waiting, and they will still find the
respective DMA request when wait_group_events is called.

> One way to implement this at the higher level would be to
> have a mechanism as in OpenMP where you can define
> code to be executed by only a single WI in the WG.
> https://computing.llnl.gov/tutorials/openMP/#SINGLE
>
> Some magic variable that holds an id of a single thread
> could do.
>
> event_t async_work_group_copy ( ... ) {
>
>   if (__single())
>   {
>      // perform a block copy by initiating a DMA transfer in
>      // the background
>      return event_id; /* implementation "handle" */
>   }
>   return 0; /* doesn't matter what we return for the others */
> }

What about this implementation:

bool __single() { return get_local_id(0)==0 && get_local_id(1)==0 &&
get_local_id(2)==0; }

(Technically, that's __master(), not __single(), but it would work
here.) Otherwise you'll need a lock, and keep track which threads have
entered/passed this construct. OpenMP also adds a barrier at the end
of single construct by default, so you'd want to have that here as
well, and then move the return statement to after the barrier.

> On the other hand, in some implementations we might actually
> want all the work items/vector lanes/threads to participate in
> the memory transfer. For example, when block DMA is not available
> but block transfers using vector loads/stores are. In that case the data
> transport could be divided across the WIs/lanes somehow. Anyways,
> let's for now assume this function is executed by only one WI of the
> WG and concentrate on implementing that.

One simple implementation would call memcpy or its equivalent on work
item 0 when async_work_group_copy is called, would do nothing in the
other work items, and would have a barrier in wait_group_events. Of
course, that's not asynchronous, but no events are needed.

-erik

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

Hi,

One of the book examples uses the async_work_group_copy kernel
function.

In the longer term, these functions are important to implement
efficiently. They allow initiating block transfers between the
global and local memories from the kernel side and overlap
those transfers with kernel computation.

This set of functions requires knowledge of the *platform*
implementation (not only the device) as only the "platform"
knows how to implement e.g. the DMA transfers from the
device side.

However, even the minimal default implementation (that is
unaware of DMA transfers etc.) of these functions (a simple
blocking for-loop) needs special treatment like the barrier()
does. The function call is per WG, so all WIs in the WG must
encounter it with the same arguments, similarly like with the
barrier().

I wonder how this should be implemented. Treat
these functions exactly like a barrier() in the pocl
passes? One difficulty is that async_work_group_copy returns
an event which will be blocked on with wait_group_events().
This event variable is a private variable but in a sense
a local variable as only a single actual WI needs it so
replicating it just consumes space (but no other harm
that I can see).

One way to implement this at the higher level would be to
have a mechanism as in OpenMP where you can define
code to be executed by only a single WI in the WG.
https://computing.llnl.gov/tutorials/openMP/#SINGLE

Some magic variable that holds an id of a single thread
could do.

event_t async_work_group_copy ( ... ) {

   if (__single())
   {
      // perform a block copy by initiating a DMA transfer in
      // the background
      return event_id; /* implementation "handle" */
   }
   return 0; /* doesn't matter what we return for the others */
}

On the other hand, in some implementations we might actually
want all the work items/vector lanes/threads to participate in
the memory transfer. For example, when block DMA is not available
but block transfers using vector loads/stores are. In that case the data 
transport could be divided across the WIs/lanes somehow. Anyways,
let's for now assume this function is executed by only one WI of the
WG and concentrate on implementing that.

-- 
Pekka

On 01/17/2012 02:35 PM, Carlos Sánchez de La Lama wrote:
>> src/kernel_name/source.cl
>>                    -- source codes of all the kernels
>
> Is it really needed? I think with the kernel BC would suffice.

Not strictly needed it seems. CL_PROGRAM_SOURCE of clGetProgramInfo
is optional in case the Program was constructed from a binary. Thus,
the original BC should suffice.

>> bc/kernel_name/32_1_1.bc
>> -- the LLVM bitcode of the kernel a replicated WG function
>>      with 32x1x1 WG dimensions
>
> same with this, whats the use of the replicated BC? If we store the bin
> (ELF or whatever) for the target, this is unneded AFAIU.

True.

>> The simplest implementation of this would have to only include the
>> src/kernel_name structure, thus no caching of compiler results at all.
>> It would be enough just to implement the API and the basis for the
>> file format.
>
> Anyways, I would definitely wait after branching. This looks like a
> potential bug-bringer (and I am branching tomorrow).

I probably won't have time to start on this in the near days anyways.

Whether the branch should be created while there are known critical bugs
is another question. IMHO we should delay branching until we have
taken care of those to avoid needless merges.

>> Error checking could be improved via a metadata file that describes
>> the device the target-specific files are for etc.
>
> Let's keep it simple. If you add metadata at this stage one would need
> to edit the metadata if wants to "tweak" the binary for any reason,
> which is likely at this stage of development. Time lost on metadata
> defining, implementation and experimenting. I would only add that once
> the implementation is settled (at least partially).

I agree.

-- 
Pekka

Hi,

> How to actually implement this quickly is that we could create
> an archive file (e.g. a tar) of the needed files from the temporary
> directories created by the kernel code generation. Contents of this
> file would be returned for each device in clGetProgramInfo(CL_PROGRAM_BINARIES ..)

Up for that, and I would actually use some standard (tar looks ok) until
we give this more thought. Moving to a complex and more optimum format
can be done later, while using something standard like "tar" now is much
better for debugging.

You might notice that the kernel structure has a "binary" field... this
was supposed to be pointing to the binary itself (when I was still on
the idea of sharing the binary throught all the devices). Thus some
rework would be needed before to keep this sane.

> src/kernel_name/source.cl
>                   -- source codes of all the kernels

Is it really needed? I think with the kernel BC would suffice.

> bc/kernel_name/32_1_1.bc
> -- the LLVM bitcode of the kernel a replicated WG function
>     with 32x1x1 WG dimensions

same with this, whats the use of the replicated BC? If we store the bin
(ELF or whatever) for the target, this is unneded AFAIU.

> The simplest implementation of this would have to only include the
> src/kernel_name structure, thus no caching of compiler results at all.
> It would be enough just to implement the API and the basis for the
> file format.

Anyways, I would definitely wait after branching. This looks like a
potential bug-bringer (and I am branching tomorrow).

> Error checking could be improved via a metadata file that describes
> the device the target-specific files are for etc.

Let's keep it simple. If you add metadata at this stage one would need
to edit the metadata if wants to "tweak" the binary for any reason,
which is likely at this stage of development. Time lost on metadata
defining, implementation and experimenting. I would only add that once
the implementation is settled (at least partially).

BR

Carlos

BTW,

This affects the "cleanup of temp directories" discussed yesterday and
the build process somewhat.

How this could work is that the device drivers (that do the compilation)
are handed a target file (or directory) for the final product(s) and
the directories are managed from the higher level API implementation of
the Program.

The Program object can then keep track of the locations of the
"artifact files" so we can create the tar ball out of them (and also
extract them to correct locations in the reverse case).

We have to keep those (final) files in disk until the Program has been
destroyed as we do not know if the program wants them dumped via the 
clGetProgramInfo.

On 01/17/2012 01:53 PM, Pekka Jääskeläinen wrote:
>
> How to actually implement this quickly is that we could create
> an archive file (e.g. a tar) of the needed files from the temporary
> directories created by the kernel code generation.

-- 
Pekka

Hello,

For implementing clCreateProgramWithBinary we need to define the
program binary format that was discussed in a thread titled
"Workgroup functions caching".

The final conclusion in that discussion:

I wrote:
>> The alternative could be to define a simplistic custom wrapper format
>> that wraps in the final binaries and the bc. It would not care of the
>> format of the final binaries (thus the OpenCL binary container *contents*
>> would be platform-specific, but the container itself not) and would store
>> enough metadata for choosing the correct final binary based on the
>> dimensions.

On 12/19/2011 02:26 PM, Carlos Sánchez de La Lama wrote:
> That was my first thought. I think it is the simplest way.
 > Any format capable of storing several files/buffers together will do.
 > I said ELF cause BIF is defined as ELF, but any format will do. I
 > suspect AMD chose ELF cause they had to put the ELF libraries there
 > anyways (the binary inside the .text is again ELF in their case). But
 > I would go for any sensible format
 > (existing one, no need to reinvent the wheel with a custom format).

How to actually implement this quickly is that we could create
an archive file (e.g. a tar) of the needed files from the temporary
directories created by the kernel code generation. Contents of this
file would be returned for each device in clGetProgramInfo(CL_PROGRAM_BINARIES ..)

The archive file would contain a directory structure like:

src/kernel_name/source.cl
-- source codes of all the kernels

bc/kernel_name/original.bc
-- the LLVM bitcode of the kernel for unreplicated version

bc/kernel_name/32_1_1.bc
-- the LLVM bitcode of the kernel a replicated WG function
    with 32x1x1 WG dimensions

obj/kernel_name/32_1_1.bin
-- the final executable binary (e.g. ELF) for 'kernel_name' kernel,
    for WG dimensions 32x1x1

The simplest implementation of this would have to only include the
src/kernel_name structure, thus no caching of compiler results at all.
It would be enough just to implement the API and the basis for the
file format.

Support for the rest of the files could be added gradually to actually
speed up the compilation process, if available in the archive.

Error checking could be improved via a metadata file that describes
the device the target-specific files are for etc.

What do you think?

-- 
Pekka

Hi,

In revision 144 of trunk I committed reference counting for
several OpenCL object types. In pocl_cl.h there is now a set
of macros for generic implementations on operations that are
shared across the main OpenCL object types.

Please, check it out, especially the pocl_cl.h part:
http://bazaar.launchpad.net/~pocl/pocl/trunk/revision/144

Reference counting for events is not yet there and it's the last
one of the retain function calls the book examples need.

-- 
Pekka

Hi,

I didn't yet add it to 'make check' (as most of the examples fail to
compile), but I added a Makefile for compiling the examples of
http://code.google.com/p/opencl-book-samples/ against a pocl
*build tree*.

This should make it easier to test the build when adding new APIs
(no need to install pocl).

I listed the missing APIs for this example suite in the TODO file:

Known missing OpenCL 1.2 features
---------------------------------

Missing APIs used by the tested OpenCL example suites are
entered here.

(*) == Used by the opencl-book-samples.

   4. THE OPENCL PLATFORM LAYER

* 4.1 Querying platform info (properly)
* 4.3 Partitioning device
* 4.4 Contexts
   * clRetainContext (*)

   5. THE OPENCL RUNTIME

* 5.1 Command queues
   * clRetainCommandQueue (*)
* 5.2.1 Creating buffer objects
   * clCreateSubBuffer (*)
* 5.2.4 Mapping buffer objects
   * clEnqueueMapBuffer (*)
* 5.3 Image objects
   * clCreateImage2D (deprecated in OpenCL 1.2) (*)
* 5.3.3 Reading, Writing and Copying Image Objects
   * clEnqueueReadImage (*)
* 5.4 Querying, Umapping, Migrating, ... Mem objects
* 5.4.1 Retaining and Releasing Memory Objects
   * clRetainMemObject (*)
* 5.4.2 Unmapping Mapped Memory Objects
   * clEnqueueUnmapMemObject (*)
* 5.5 Sampler objects
* 5.5.1 Creating Sampler Objects
   * clCreateSampler (*)
* 5.6.1 Creating Program Objects
   * clCreateProgramWithBinary (*)
   * clRetainProgram (*)
* 5.7.1 Creating Kernel Objects
   * clRetainKernel (*)
* 5.9 Event objects
   * clWaitForEvents (*)
   * clReleaseEvent (*)
* 5.10 Markers, Barriers and Waiting for Events
   * clEnqueueMarker (deprecated in OpenCL 1.2) (*)
* 5.12 Profiling

* Reference counting for all the OpenCL objects

   6. THE OPENCL C PROGRAMMING LANGUAGE

* 6.12.10 Async Copies
* 6.12.11 Atomic functions

   OpenCL 1.2 Extensions

* 9.7 Sharing Memory Objects with OpenGL / OpenGL
   ES Buffer, Texture and Renderbuffer Objects
* 9.7.6 Sharing memory objects that map to GL objects
   between GL and CL contexts
   * clEnqueueAcquireGLObjects (*)

There's not so many APIs missing. Event and command queue APIs
are being implemented by Kalle. Plenty of retain/release
APIs are missing as there's no reference counting implementation
yet. I might pick up on it as I already proposed a way forward
with this (the macros) some time ago here in the mailing list.

After those are in place this set of examples is missing mainly
the image APIs, clCreateProgramWithBinary (of which implementation
was already discussed briefly), sub buffers and the GL/CL
interoperation which is an extension and not high priority yet.

Sub buffers could exploit the bufalloc code. The original buffer
would contain a mem_region from which the subbuffers are allocated.
Additional API to the bufalloc is needed where both the start and
the size of the chunk are given. I can look also on this one.

On 01/12/2012 08:44 PM, Pekka Jääskeläinen wrote:
> It would be nice to get a large set of examples easily executable from
> 'make check' so we get constant testing of a larger test suite to avoid
> regressions during development and to get confidence on the implementation.

-- 
Pekka

2011	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (25)	Nov (11)	Dec (36)
2012	Jan (30)	Feb (4)	Mar (4)	Apr (7)	May (5)	Jun (31)	Jul (6)	Aug (19)	Sep (38)	Oct (30)	Nov (22)	Dec (19)
2013	Jan (55)	Feb (39)	Mar (77)	Apr (10)	May (83)	Jun (52)	Jul (86)	Aug (61)	Sep (29)	Oct (9)	Nov (38)	Dec (22)
2014	Jan (14)	Feb (29)	Mar (4)	Apr (19)	May (3)	Jun (27)	Jul (6)	Aug (5)	Sep (3)	Oct (48)	Nov	Dec (5)
2015	Jan (8)	Feb (2)	Mar (8)	Apr (16)	May	Jun	Jul (2)	Aug (1)	Sep (2)	Oct (13)	Nov (5)	Dec (2)
2016	Jan (26)	Feb (6)	Mar (8)	Apr (8)	May (2)	Jun	Jul	Aug (11)	Sep (3)	Oct (5)	Nov (14)	Dec (2)
2017	Jan (16)	Feb (4)	Mar (11)	Apr (4)	May (5)	Jun (5)	Jul (3)	Aug	Sep (6)	Oct	Nov (10)	Dec (6)
2018	Jan	Feb (21)	Mar (11)	Apr (3)	May (2)	Jun (8)	Jul	Aug (13)	Sep (6)	Oct (2)	Nov	Dec (11)
2019	Jan	Feb (5)	Mar (10)	Apr (2)	May	Jun	Jul	Aug	Sep (10)	Oct (4)	Nov	Dec
2020	Jan	Feb	Mar (1)	Apr (4)	May	Jun	Jul (3)	Aug	Sep (3)	Oct	Nov	Dec (4)
2021	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (4)	Aug	Sep	Oct (4)	Nov	Dec
2022	Jan	Feb	Mar (4)	Apr	May (11)	Jun (1)	Jul (3)	Aug	Sep (1)	Oct	Nov (2)	Dec (1)
2023	Jan (4)	Feb	Mar (1)	Apr	May	Jun (2)	Jul	Aug	Sep	Oct	Nov	Dec (1)

pocl-devel Mailing List for pocl (Page 52)

pocl-devel — Portable OpenCL development discussion