You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
(11) |
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
(30) |
Feb
(4) |
Mar
(4) |
Apr
(7) |
May
(5) |
Jun
(31) |
Jul
(6) |
Aug
(19) |
Sep
(38) |
Oct
(30) |
Nov
(22) |
Dec
(19) |
2013 |
Jan
(55) |
Feb
(39) |
Mar
(77) |
Apr
(10) |
May
(83) |
Jun
(52) |
Jul
(86) |
Aug
(61) |
Sep
(29) |
Oct
(9) |
Nov
(38) |
Dec
(22) |
2014 |
Jan
(14) |
Feb
(29) |
Mar
(4) |
Apr
(19) |
May
(3) |
Jun
(27) |
Jul
(6) |
Aug
(5) |
Sep
(3) |
Oct
(48) |
Nov
|
Dec
(5) |
2015 |
Jan
(8) |
Feb
(2) |
Mar
(8) |
Apr
(16) |
May
|
Jun
|
Jul
(2) |
Aug
(1) |
Sep
(2) |
Oct
(13) |
Nov
(5) |
Dec
(2) |
2016 |
Jan
(26) |
Feb
(6) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
|
Jul
|
Aug
(11) |
Sep
(3) |
Oct
(5) |
Nov
(14) |
Dec
(2) |
2017 |
Jan
(16) |
Feb
(4) |
Mar
(11) |
Apr
(4) |
May
(5) |
Jun
(5) |
Jul
(3) |
Aug
|
Sep
(6) |
Oct
|
Nov
(10) |
Dec
(6) |
2018 |
Jan
|
Feb
(21) |
Mar
(11) |
Apr
(3) |
May
(2) |
Jun
(8) |
Jul
|
Aug
(13) |
Sep
(6) |
Oct
(2) |
Nov
|
Dec
(11) |
2019 |
Jan
|
Feb
(5) |
Mar
(10) |
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(10) |
Oct
(4) |
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
(3) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
(4) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
(4) |
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
(11) |
Jun
(1) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
(2) |
Dec
(1) |
2023 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
From: Erik S. <sch...@gm...> - 2012-03-06 16:11:57
|
-- Erik Schnetter <sch...@gm...> http://www.perimeterinstitute.ca/personal/eschnetter/ |
From: Erik S. <esc...@pe...> - 2012-03-05 15:01:14
|
Pekka No, I did not code anything in this respect. -erik 2012/3/5 Pekka Jääskeläinen <pek...@tu...>: > On 12/19/2011 04:17 PM, Erik Schnetter wrote: >> >> The previous discussion was much about file formats. Does this mean that >> enqueuing a cached kernel would still require a dlopen? I was more >> hoping for caching the kernels in memory, so that enqueuing a kernel is >> really as cheap as an indirect function call. > > > Did you do some coding for this? > > I'll implement this as the ViennaCL has test cases which compile (enqueue) > the same kernel multiple times (blas3) and the kernels seem to be quite > slow to compile. Otherwise it's not sensible to add the ViennaCL tests > to the 'make check' suite. > > -- > Pekka -- Erik Schnetter <esc...@pe...> http://www.perimeterinstitute.ca/personal/eschnetter/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-03-05 09:14:27
|
On 12/19/2011 04:17 PM, Erik Schnetter wrote: > The previous discussion was much about file formats. Does this mean that > enqueuing a cached kernel would still require a dlopen? I was more > hoping for caching the kernels in memory, so that enqueuing a kernel is > really as cheap as an indirect function call. Did you do some coding for this? I'll implement this as the ViennaCL has test cases which compile (enqueue) the same kernel multiple times (blas3) and the kernels seem to be quite slow to compile. Otherwise it's not sensible to add the ViennaCL tests to the 'make check' suite. -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-02-08 09:53:55
|
On 02/04/2012 12:39 PM, Pekka Jääskeläinen wrote: > If it is, I think the easiest fix is to propose a new metadata Actually, the easiest hackish workaround would be to just assume all llvm global buffers inside the module with a "non-C"-name (seems the one has k.b as the name where k is the name of the kernel, b the name of the buffer) which are referred to from kernels are automatic local buffers. It's not robust enough but might do until a better fix (e.g. the metadata) is in place. It seems there's some kind of wider debate going on in LLVMdev on how to properly implement the OpenCL (or any language-specific) metadata handling. -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-02-04 10:40:08
|
Carlos, Please confirm this is the problem: On 02/02/2012 02:39 PM, Pekka Jääskeläinen wrote: > Is the problem that for targets which map all the address spaces > to zero you cannot differentiate what is a local buffer and what > is a global one, thus unable to know to which one needs to allocate > local space before launching the kernel? If it is, I think the easiest fix is to propose a new metadata for the automatic local buffers to Clang which we can then use to allocate the space in the launcher (and it should work also for targets with locals being mapped to AS0). I can do this if I just get your confirmation. This is a nasty nuisance for non-TCE targets I'd like to get done with. -- --Pekka |
From: Pekka J. <pek...@tu...> - 2012-02-02 12:39:38
|
Hi, I don't understand the problem. Please elaborate as I do not remember the (live) discussion we had before you added the "sed hack". Here is the output (llvm/Clang trunk from yesterday) from compiling the following kernel: kernel void k (local int *a) { local int b[100]; local int *p; p = b; p = a; } with clang kernel_with_locals.cl -emit-llvm -c -O0 -ccc-host-triple tce-tut-llvm -o - | llvm-dis -o - ; ModuleID = '<stdin>' target datalayout = "E-p:32:32:32-i1:8:8-i8:8:32-i16:16:32-i32:32:32-i64:32:32-f32:32:32-f64:32:32-v64:32:32-v128:32:32-a0:0:32-n32" target triple = "tce-tut-llvm" @k.b = internal addrspace(4) global [100 x i32] zeroinitializer, align 4 define void @k(i32 addrspace(4)* %a) nounwind noinline { entry: %a.addr = alloca i32 addrspace(4)*, align 4 %p = alloca i32 addrspace(4)*, align 4 store i32 addrspace(4)* %a, i32 addrspace(4)** %a.addr, align 4 store i32 addrspace(4)* getelementptr inbounds ([100 x i32] addrspace(4)* @k.b, i32 0, i32 0), i32 addrspace(4)** %p, align 4 %0 = load i32 addrspace(4)** %a.addr, align 4 store i32 addrspace(4)* %0, i32 addrspace(4)** %p, align 4 ret void } !opencl.kernels = !{!0} !0 = metadata !{void (i32 addrspace(4)*)* @k} ----------- So, the automatic local is a "llvm global" with the correct addrspace set. Also the argument pointer has the correct addrspace. p itself is allocated in the stack (a private variable). On 02/02/2012 02:13 PM, Carlos Sánchez de La Lama wrote: > should "p" have the address space qualifier? if it has, then "p = b" is > illegal. If it has not, then "p = a" is illegal. p itself is a private variable pointing to a local variable so I think both of the cases should be legal. Is the problem that for targets which map all the address spaces to zero you cannot differentiate what is a local buffer and what is a global one, thus unable to know to which one needs to allocate local space before launching the kernel? -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2012-02-02 12:13:11
|
Hi guys, local pointer handling is now done with a preprocessor + sed hack to workaround clang limitations. Background here is clang does not provide information about a pointer being local or not in LLVM IR, which is needed by pocl (an, alghought this requirement might be removable, this information is needed on architectures wich real different address space for local memory). The hack adds an addresspace qualifier on local pointers in kernel arguments, which makes them recognizable in IR, and keeps normal local variable declarations untouched (which is required in that case). Automatic local variales are handled by clang by creating "global" equivalents, which then pocl passes transform as required. clang crashes if these automatic globals have an address space requirement. The problem, then arises when a kernel declares a local variable which is a pointer to a local buffer. Example: kernel k (local int *a) { local int b[100]; local int *p; ... } should "p" have the address space qualifier? if it has, then "p = b" is illegal. If it has not, then "p = a" is illegal. I guess this means the address spaec qualifier on kernel pointer arguments has to be dropped (sig), which means medium-deep changes in pocl. BTW, Pekka, do you know how that problematic pointer assignment case work when using tce backend? (which AFAIR assigns addresspaces to locals) BR Carlos |
From: Pekka J. <pek...@tu...> - 2012-01-24 16:45:13
|
FYI, I implemented the minimal version of clCreateProgramWithBinary with the binary format being directly the sequential LLVM bitcode of the kernels. The book example that tests the functionality works now. As there's no caching of the final target binaries in the binary format yet it doesn't speed up the kernel compilation pretty much at all. For that we need to implement the discussed archive binary format. On 01/17/2012 02:43 PM, Pekka Jääskeläinen wrote: > Not strictly needed it seems. CL_PROGRAM_SOURCE of clGetProgramInfo > is optional in case the Program was constructed from a binary. Thus, > the original BC should suffice. -- Pekka |
From: Erik S. <esc...@pe...> - 2012-01-22 15:10:34
|
2012/1/22 Pekka Jääskeläinen <pek...@tu...>: > Hi Erik, > > > On 01/20/2012 07:05 PM, Erik Schnetter wrote: >> >> 3-element vectors can't be accessed directly in memory, because they >> need to be aligned the same was as 4-vectors. One needs to use >> vload/vstore for this, which doesn't require alignment. >> >> I assume that 3-element vectors would be stored packed in global >> memory. I don't know whether this is what async_copy actually >> expects... However, un-packed storage can be handled by passing a >> respective 4-element vector as gentype. > > > I'm not sure if I got this. The current implementation of > the async copy is just a for loop that copies elements of > the arrays. > > The elements in the copy loop are handled as elements of > the actual gentype, e.g. float3. As far as I understood > this means the compiler should take care of the correct alignment of > the vectors in memory because it's stated with the alignment > attribute (to be the same as float4) both in the host side and the > kernel side. > > Did you check from the generated assembly that the copy loop does > not produce a correct alignment when copying buffers with 3-element > vectors? > > "The vload3 and vstore3 built-in functions can be used to read and > write, respectively, 3-component vector data types from an array of > packed scalar data type." > > From cl_platform.h: > > /* cl_float3 is identical in size, alignment and behavior to cl_float4. See > section 6.1.5. */ > typedef cl_float4 cl_float3; > > > So, my understanding of this is that it should work because both the > host and the device code are in the same understanding of the alignment > of the 3-element vectors. I.e. it's stored 4-aligned everywhere if > you handle buffers that contain 3-element vectors and thus copy loop > should be generated to code that adheres to this. > > If the input was packed (basically a float buffer) and one would like > to copy it to a buffer of 3-wide vectors, then vload3 and vstore3 should > be used. > > Please correct me if I understood this wrongly. Yes, this is correct. I was assuming that packed buffers were more common than unpacked ones (otherwise, why would one use float3?), and that async_copy would need to handle this. I now realise that there is no basis for this assumption of mine. Indeed there is a footnote in the standard stating that float3 should be copied in the same way as float4. Your implementation is correct. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-01-22 12:19:55
|
Hi Erik, On 01/20/2012 07:05 PM, Erik Schnetter wrote: > 3-element vectors can't be accessed directly in memory, because they > need to be aligned the same was as 4-vectors. One needs to use > vload/vstore for this, which doesn't require alignment. > > I assume that 3-element vectors would be stored packed in global > memory. I don't know whether this is what async_copy actually > expects... However, un-packed storage can be handled by passing a > respective 4-element vector as gentype. I'm not sure if I got this. The current implementation of the async copy is just a for loop that copies elements of the arrays. The elements in the copy loop are handled as elements of the actual gentype, e.g. float3. As far as I understood this means the compiler should take care of the correct alignment of the vectors in memory because it's stated with the alignment attribute (to be the same as float4) both in the host side and the kernel side. Did you check from the generated assembly that the copy loop does not produce a correct alignment when copying buffers with 3-element vectors? "The vload3 and vstore3 built-in functions can be used to read and write, respectively, 3-component vector data types from an array of packed scalar data type." From cl_platform.h: /* cl_float3 is identical in size, alignment and behavior to cl_float4. See section 6.1.5. */ typedef cl_float4 cl_float3; So, my understanding of this is that it should work because both the host and the device code are in the same understanding of the alignment of the 3-element vectors. I.e. it's stored 4-aligned everywhere if you handle buffers that contain 3-element vectors and thus copy loop should be generated to code that adheres to this. If the input was packed (basically a float buffer) and one would like to copy it to a buffer of 3-wide vectors, then vload3 and vstore3 should be used. Please correct me if I understood this wrongly. -- --Pekka |
From: Erik S. <esc...@pe...> - 2012-01-20 17:06:00
|
Pekka 3-element vectors can't be accessed directly in memory, because they need to be aligned the same was as 4-vectors. One needs to use vload/vstore for this, which doesn't require alignment. I assume that 3-element vectors would be stored packed in global memory. I don't know whether this is what async_copy actually expects... However, un-packed storage can be handled by passing a respective 4-element vector as gentype. -erik 2012/1/20 Pekka Jääskeläinen <pek...@tu...>: > OK, > > I implemented the non-strided versions in r159 using a trivial for-loop. > I didn't test them yet (the spmv book example case that uses the functions > fails with the bug regarding the __local pointers). Feel free to > modify or comment. > > The easiest optimization for the implementation is to define > versions that use vector loads and stores for the vector gentypes > for the architectures with SIMD loads/stores (e.g. x86_64 with avx/sse). > > Have a nice weekend! > > > On 01/20/2012 05:14 PM, Erik Schnetter wrote: >> >> Pekka >> >> I thought about using include files to instantiate macros, but so far, >> I've kept things to using macros only. Instead, there is a different >> macro for each "kind" of prototype. We can change this to using >> #include instead (which would probably reduce the amount of code), but >> this would also make it more complex to instantiate macros -- this >> would then require a #define, and #include, and an #undef for each >> function. >> >> I would instead #define a new macro just for async_work_group_copy in >> _kernel.h. There would probably be another specific macro in >> templates.h to help instantiating the function definitions. This would >> be similar to the vload/vstore functions. >> >> This is similar to what you suggest except it doesn't use an #include >> file. There is no particular reason not to use an #include file, >> except that we currently don't. If you think it significantly >> simplifies the code, then do it. >> >> -erik >> >> 2012/1/20 Pekka Jääskeläinen<pek...@tu...>: >>> >>> Erik, >>> >>> The function protototype for the async copy includes a "gentype" >>> in a position not supported by the current "generator macros" of >>> yours. >>> >>> event_t async_work_group_copy ( >>> __local gentype *dst, >>> const __global gentype *src, >>> size_t num_gentypes, >>> event_t event); >>> >>> What do you think is best way to generate the declarations and >>> definitions for such functions? >>> >>> Something like: >>> >>> #define __FUNC_PROTO(gentype) \ >>> __attribute__ ((overloadable)) \ >>> event_t async_work_group_copy ( \ >>> __local gentype *dst, \ >>> const __global gentype *src, \ >>> size_t num_gentypes, \ >>> event_t event) \ >>> >>> #include "gentype_func_decl.inc" >>> >>> Then that .inc would have the macro instantiated >>> with all the different value types for gentype. E.g.: >>> >>> __FUNC_PROTO(float); >>> __FUNC_PROTO(float2); >>> __FUNC_PROTO(float4); >>> ... >>> >>> Similarly for the definitions. Here I think we >>> can assume both of the gentypes in the function >>> are always the same so we do not have to generate >>> all combinations. >>> >>> What do you think? >>> >>> -- >>> Pekka >>> >>> >>> ------------------------------------------------------------------------------ >>> Keep Your Developer Skills Current with LearnDevNow! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-d2d >>> _______________________________________________ >>> Pocl-devel mailing list >>> Poc...@li... >>> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> >> >> >> > > > -- > Pekka -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-01-20 16:41:42
|
OK, I implemented the non-strided versions in r159 using a trivial for-loop. I didn't test them yet (the spmv book example case that uses the functions fails with the bug regarding the __local pointers). Feel free to modify or comment. The easiest optimization for the implementation is to define versions that use vector loads and stores for the vector gentypes for the architectures with SIMD loads/stores (e.g. x86_64 with avx/sse). Have a nice weekend! On 01/20/2012 05:14 PM, Erik Schnetter wrote: > Pekka > > I thought about using include files to instantiate macros, but so far, > I've kept things to using macros only. Instead, there is a different > macro for each "kind" of prototype. We can change this to using > #include instead (which would probably reduce the amount of code), but > this would also make it more complex to instantiate macros -- this > would then require a #define, and #include, and an #undef for each > function. > > I would instead #define a new macro just for async_work_group_copy in > _kernel.h. There would probably be another specific macro in > templates.h to help instantiating the function definitions. This would > be similar to the vload/vstore functions. > > This is similar to what you suggest except it doesn't use an #include > file. There is no particular reason not to use an #include file, > except that we currently don't. If you think it significantly > simplifies the code, then do it. > > -erik > > 2012/1/20 Pekka Jääskeläinen<pek...@tu...>: >> Erik, >> >> The function protototype for the async copy includes a "gentype" >> in a position not supported by the current "generator macros" of >> yours. >> >> event_t async_work_group_copy ( >> __local gentype *dst, >> const __global gentype *src, >> size_t num_gentypes, >> event_t event); >> >> What do you think is best way to generate the declarations and >> definitions for such functions? >> >> Something like: >> >> #define __FUNC_PROTO(gentype) \ >> __attribute__ ((overloadable)) \ >> event_t async_work_group_copy ( \ >> __local gentype *dst, \ >> const __global gentype *src, \ >> size_t num_gentypes, \ >> event_t event) \ >> >> #include "gentype_func_decl.inc" >> >> Then that .inc would have the macro instantiated >> with all the different value types for gentype. E.g.: >> >> __FUNC_PROTO(float); >> __FUNC_PROTO(float2); >> __FUNC_PROTO(float4); >> ... >> >> Similarly for the definitions. Here I think we >> can assume both of the gentypes in the function >> are always the same so we do not have to generate >> all combinations. >> >> What do you think? >> >> -- >> Pekka >> >> ------------------------------------------------------------------------------ >> Keep Your Developer Skills Current with LearnDevNow! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-d2d >> _______________________________________________ >> Pocl-devel mailing list >> Poc...@li... >> https://lists.sourceforge.net/lists/listinfo/pocl-devel > > > -- Pekka |
From: Erik S. <esc...@pe...> - 2012-01-20 15:14:17
|
Pekka I thought about using include files to instantiate macros, but so far, I've kept things to using macros only. Instead, there is a different macro for each "kind" of prototype. We can change this to using #include instead (which would probably reduce the amount of code), but this would also make it more complex to instantiate macros -- this would then require a #define, and #include, and an #undef for each function. I would instead #define a new macro just for async_work_group_copy in _kernel.h. There would probably be another specific macro in templates.h to help instantiating the function definitions. This would be similar to the vload/vstore functions. This is similar to what you suggest except it doesn't use an #include file. There is no particular reason not to use an #include file, except that we currently don't. If you think it significantly simplifies the code, then do it. -erik 2012/1/20 Pekka Jääskeläinen <pek...@tu...>: > Erik, > > The function protototype for the async copy includes a "gentype" > in a position not supported by the current "generator macros" of > yours. > > event_t async_work_group_copy ( > __local gentype *dst, > const __global gentype *src, > size_t num_gentypes, > event_t event); > > What do you think is best way to generate the declarations and > definitions for such functions? > > Something like: > > #define __FUNC_PROTO(gentype) \ > __attribute__ ((overloadable)) \ > event_t async_work_group_copy ( \ > __local gentype *dst, \ > const __global gentype *src, \ > size_t num_gentypes, \ > event_t event) \ > > #include "gentype_func_decl.inc" > > Then that .inc would have the macro instantiated > with all the different value types for gentype. E.g.: > > __FUNC_PROTO(float); > __FUNC_PROTO(float2); > __FUNC_PROTO(float4); > ... > > Similarly for the definitions. Here I think we > can assume both of the gentypes in the function > are always the same so we do not have to generate > all combinations. > > What do you think? > > -- > Pekka > > ------------------------------------------------------------------------------ > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-01-20 09:17:54
|
Erik, The function protototype for the async copy includes a "gentype" in a position not supported by the current "generator macros" of yours. event_t async_work_group_copy ( __local gentype *dst, const __global gentype *src, size_t num_gentypes, event_t event); What do you think is best way to generate the declarations and definitions for such functions? Something like: #define __FUNC_PROTO(gentype) \ __attribute__ ((overloadable)) \ event_t async_work_group_copy ( \ __local gentype *dst, \ const __global gentype *src, \ size_t num_gentypes, \ event_t event) \ #include "gentype_func_decl.inc" Then that .inc would have the macro instantiated with all the different value types for gentype. E.g.: __FUNC_PROTO(float); __FUNC_PROTO(float2); __FUNC_PROTO(float4); ... Similarly for the definitions. Here I think we can assume both of the gentypes in the function are always the same so we do not have to generate all combinations. What do you think? -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-01-19 21:28:57
|
On 01/19/2012 10:28 PM, Erik Schnetter wrote: > It is then not immediately clear in which order the individual work > items are executing the async_work_group_copy(). In my reading (which > may be flawed), this code should be legal -- each work item needs to > call wait_group_events() (or a barrier) with the same arguments, but > it doesn't need to be the same call site. OK, got it. I think it's quite clear the WI control semantics is the same as with the barrier(). If it's not, it leads to tricky cases without apparent benefit. In the other interpretation, how would one in practice differentiate between two separate async copies (even if they were to the same address as one can modify the memory in between the calls)? Or the other way around, how would one prove the copy calls actually should be merged? Without figuring it out somehow, your example should be interpreted as an undefined program as all WIs should execute both of the async copy calls, which is impossible. The specs should be more clear with this but I'm quite sure the "control semantics" are the same as with barriers. Let's trust on that until shown otherwise. -- --Pekka |
From: Erik S. <esc...@pe...> - 2012-01-19 20:28:53
|
2012/1/19 Pekka Jääskeläinen <pek...@tu...>: > On 01/19/2012 08:39 PM, Erik Schnetter wrote: >> >> Why can't these function not just be implemented as any other run-time >> functions? > > > It should be. We need to be able to override it also on *platform* basis > (for which there is not yet an abstraction layer in pocl), not only > in the kernel library (which are per instruction set architecture now). > The default generic one can just use a "CPU copy" for now. > > >> If (as you suggest below) only one work item initiates the DMA, then >> wrapping the function implementation in an if statement (as you >> suggest below) seems the best solution. The event type can then be an >> int counting the number of times this work item has called >> async_work_group_copy. The outstanding DMA requests are kept in a >> per-workgroup list, ordered by this int. > > > Yes. Something like that I had in mind. Actual implementation > is going to poke some I/O registers to initiate the DMA and > possibly implement an interrupt handler to get notified when > the transfer is ready (or via polling some I/O reg). > > During that process there will be a "handle" of some kind involved > so I thought we can just return that as the event value. In the > default simple implementation we can just block, thus it doesn't > matter what we return as the event because the wait_group_events > will be a dummy. > > >> Using such an int instead of a pointer allows the non-DMA-initiating >> work items to continue without waiting, and they will still find the >> respective DMA request when wait_group_events is called. > > > Remember that WIs are statically scheduled inside WG in pocl. They > are not mapped to threads with independent control. However, the > transfers initiated by multiple WGs executing in the devices is a > different story I think we do not need to yet go into. > > >> What about this implementation: >> >> bool __single() { return get_local_id(0)==0&& get_local_id(1)==0&& >> get_local_id(2)==0; } > > > Looks good to me. > > >> (Technically, that's __master(), not __single(), but it would work >> here.) Otherwise you'll need a lock, and keep track which threads have >> entered/passed this construct. OpenMP also adds a barrier at the end >> of single construct by default, so you'd want to have that here as >> well, and then move the return statement to after the barrier. > > > I didn't get this part. Remember that the code from all the WIs is > there statically. The above code should work as is for this case. I was thinking of a case where there is an if statement, and control flow between different work-items diverges. A routine may have different call sites to async_work_group_copy (all with identical arguments, but called from different lines in the source code), and one would have to make these match up. Example (e.g. work item size 2): if (get_local_id(0)==0) { // do something async_work_group_copy(); // do something else wait_group_events(); // more stuff } else { // do something async_work_group_copy(); // do something else wait_group_events(); // more stuff } It is then not immediately clear in which order the individual work items are executing the async_work_group_copy(). In my reading (which may be flawed), this code should be legal -- each work item needs to call wait_group_events() (or a barrier) with the same arguments, but it doesn't need to be the same call site. In this case, a barrier is necessary. (Alternatively, pocl could handle wait_group_events in the same case as a call to a barrier while duplicating the code -- you mentioned this earlier.) pocl's handling of barriers should then already handle this correctly. (Or, if it doesn't, there's either a bug, or a misunderstanding of OpenCL semantics on my side.) -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-01-19 19:25:10
|
On 01/19/2012 08:39 PM, Erik Schnetter wrote: > Why can't these function not just be implemented as any other run-time > functions? It should be. We need to be able to override it also on *platform* basis (for which there is not yet an abstraction layer in pocl), not only in the kernel library (which are per instruction set architecture now). The default generic one can just use a "CPU copy" for now. > If (as you suggest below) only one work item initiates the DMA, then > wrapping the function implementation in an if statement (as you > suggest below) seems the best solution. The event type can then be an > int counting the number of times this work item has called > async_work_group_copy. The outstanding DMA requests are kept in a > per-workgroup list, ordered by this int. Yes. Something like that I had in mind. Actual implementation is going to poke some I/O registers to initiate the DMA and possibly implement an interrupt handler to get notified when the transfer is ready (or via polling some I/O reg). During that process there will be a "handle" of some kind involved so I thought we can just return that as the event value. In the default simple implementation we can just block, thus it doesn't matter what we return as the event because the wait_group_events will be a dummy. > Using such an int instead of a pointer allows the non-DMA-initiating > work items to continue without waiting, and they will still find the > respective DMA request when wait_group_events is called. Remember that WIs are statically scheduled inside WG in pocl. They are not mapped to threads with independent control. However, the transfers initiated by multiple WGs executing in the devices is a different story I think we do not need to yet go into. > What about this implementation: > > bool __single() { return get_local_id(0)==0&& get_local_id(1)==0&& > get_local_id(2)==0; } Looks good to me. > (Technically, that's __master(), not __single(), but it would work > here.) Otherwise you'll need a lock, and keep track which threads have > entered/passed this construct. OpenMP also adds a barrier at the end > of single construct by default, so you'd want to have that here as > well, and then move the return statement to after the barrier. I didn't get this part. Remember that the code from all the WIs is there statically. The above code should work as is for this case. > One simple implementation would call memcpy or its equivalent on work > item 0 when async_work_group_copy is called, would do nothing in the > other work items, and would have a barrier in wait_group_events. Of > course, that's not asynchronous, but no events are needed. Exactly. The simplest implementation should be quite trivial. It just needs the __single() function you proposed after which a simple for-copy-loop should do. For the actual async block transferring one we need to wait when we have a device/platform that can do such. -- --Pekka |
From: Erik S. <esc...@pe...> - 2012-01-19 18:40:13
|
2012/1/19 Pekka Jääskeläinen <pek...@tu...>: > Hi, > > One of the book examples uses the async_work_group_copy kernel > function. > > In the longer term, these functions are important to implement > efficiently. They allow initiating block transfers between the > global and local memories from the kernel side and overlap > those transfers with kernel computation. > > This set of functions requires knowledge of the *platform* > implementation (not only the device) as only the "platform" > knows how to implement e.g. the DMA transfers from the > device side. > > However, even the minimal default implementation (that is > unaware of DMA transfers etc.) of these functions (a simple > blocking for-loop) needs special treatment like the barrier() > does. The function call is per WG, so all WIs in the WG must > encounter it with the same arguments, similarly like with the > barrier(). > > I wonder how this should be implemented. Treat > these functions exactly like a barrier() in the pocl > passes? One difficulty is that async_work_group_copy returns > an event which will be blocked on with wait_group_events(). > This event variable is a private variable but in a sense > a local variable as only a single actual WI needs it so > replicating it just consumes space (but no other harm > that I can see). Why can't these function not just be implemented as any other run-time functions? If (as you suggest below) only one work item initiates the DMA, then wrapping the function implementation in an if statement (as you suggest below) seems the best solution. The event type can then be an int counting the number of times this work item has called async_work_group_copy. The outstanding DMA requests are kept in a per-workgroup list, ordered by this int. Using such an int instead of a pointer allows the non-DMA-initiating work items to continue without waiting, and they will still find the respective DMA request when wait_group_events is called. > One way to implement this at the higher level would be to > have a mechanism as in OpenMP where you can define > code to be executed by only a single WI in the WG. > https://computing.llnl.gov/tutorials/openMP/#SINGLE > > Some magic variable that holds an id of a single thread > could do. > > event_t async_work_group_copy ( ... ) { > > if (__single()) > { > // perform a block copy by initiating a DMA transfer in > // the background > return event_id; /* implementation "handle" */ > } > return 0; /* doesn't matter what we return for the others */ > } What about this implementation: bool __single() { return get_local_id(0)==0 && get_local_id(1)==0 && get_local_id(2)==0; } (Technically, that's __master(), not __single(), but it would work here.) Otherwise you'll need a lock, and keep track which threads have entered/passed this construct. OpenMP also adds a barrier at the end of single construct by default, so you'd want to have that here as well, and then move the return statement to after the barrier. > On the other hand, in some implementations we might actually > want all the work items/vector lanes/threads to participate in > the memory transfer. For example, when block DMA is not available > but block transfers using vector loads/stores are. In that case the data > transport could be divided across the WIs/lanes somehow. Anyways, > let's for now assume this function is executed by only one WI of the > WG and concentrate on implementing that. One simple implementation would call memcpy or its equivalent on work item 0 when async_work_group_copy is called, would do nothing in the other work items, and would have a barrier in wait_group_events. Of course, that's not asynchronous, but no events are needed. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-01-19 18:08:34
|
Hi, One of the book examples uses the async_work_group_copy kernel function. In the longer term, these functions are important to implement efficiently. They allow initiating block transfers between the global and local memories from the kernel side and overlap those transfers with kernel computation. This set of functions requires knowledge of the *platform* implementation (not only the device) as only the "platform" knows how to implement e.g. the DMA transfers from the device side. However, even the minimal default implementation (that is unaware of DMA transfers etc.) of these functions (a simple blocking for-loop) needs special treatment like the barrier() does. The function call is per WG, so all WIs in the WG must encounter it with the same arguments, similarly like with the barrier(). I wonder how this should be implemented. Treat these functions exactly like a barrier() in the pocl passes? One difficulty is that async_work_group_copy returns an event which will be blocked on with wait_group_events(). This event variable is a private variable but in a sense a local variable as only a single actual WI needs it so replicating it just consumes space (but no other harm that I can see). One way to implement this at the higher level would be to have a mechanism as in OpenMP where you can define code to be executed by only a single WI in the WG. https://computing.llnl.gov/tutorials/openMP/#SINGLE Some magic variable that holds an id of a single thread could do. event_t async_work_group_copy ( ... ) { if (__single()) { // perform a block copy by initiating a DMA transfer in // the background return event_id; /* implementation "handle" */ } return 0; /* doesn't matter what we return for the others */ } On the other hand, in some implementations we might actually want all the work items/vector lanes/threads to participate in the memory transfer. For example, when block DMA is not available but block transfers using vector loads/stores are. In that case the data transport could be divided across the WIs/lanes somehow. Anyways, let's for now assume this function is executed by only one WI of the WG and concentrate on implementing that. -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-01-17 12:43:28
|
On 01/17/2012 02:35 PM, Carlos Sánchez de La Lama wrote: >> src/kernel_name/source.cl >> -- source codes of all the kernels > > Is it really needed? I think with the kernel BC would suffice. Not strictly needed it seems. CL_PROGRAM_SOURCE of clGetProgramInfo is optional in case the Program was constructed from a binary. Thus, the original BC should suffice. >> bc/kernel_name/32_1_1.bc >> -- the LLVM bitcode of the kernel a replicated WG function >> with 32x1x1 WG dimensions > > same with this, whats the use of the replicated BC? If we store the bin > (ELF or whatever) for the target, this is unneded AFAIU. True. >> The simplest implementation of this would have to only include the >> src/kernel_name structure, thus no caching of compiler results at all. >> It would be enough just to implement the API and the basis for the >> file format. > > Anyways, I would definitely wait after branching. This looks like a > potential bug-bringer (and I am branching tomorrow). I probably won't have time to start on this in the near days anyways. Whether the branch should be created while there are known critical bugs is another question. IMHO we should delay branching until we have taken care of those to avoid needless merges. >> Error checking could be improved via a metadata file that describes >> the device the target-specific files are for etc. > > Let's keep it simple. If you add metadata at this stage one would need > to edit the metadata if wants to "tweak" the binary for any reason, > which is likely at this stage of development. Time lost on metadata > defining, implementation and experimenting. I would only add that once > the implementation is settled (at least partially). I agree. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2012-01-17 12:32:18
|
Hi, > How to actually implement this quickly is that we could create > an archive file (e.g. a tar) of the needed files from the temporary > directories created by the kernel code generation. Contents of this > file would be returned for each device in clGetProgramInfo(CL_PROGRAM_BINARIES ..) Up for that, and I would actually use some standard (tar looks ok) until we give this more thought. Moving to a complex and more optimum format can be done later, while using something standard like "tar" now is much better for debugging. You might notice that the kernel structure has a "binary" field... this was supposed to be pointing to the binary itself (when I was still on the idea of sharing the binary throught all the devices). Thus some rework would be needed before to keep this sane. > src/kernel_name/source.cl > -- source codes of all the kernels Is it really needed? I think with the kernel BC would suffice. > bc/kernel_name/32_1_1.bc > -- the LLVM bitcode of the kernel a replicated WG function > with 32x1x1 WG dimensions same with this, whats the use of the replicated BC? If we store the bin (ELF or whatever) for the target, this is unneded AFAIU. > The simplest implementation of this would have to only include the > src/kernel_name structure, thus no caching of compiler results at all. > It would be enough just to implement the API and the basis for the > file format. Anyways, I would definitely wait after branching. This looks like a potential bug-bringer (and I am branching tomorrow). > Error checking could be improved via a metadata file that describes > the device the target-specific files are for etc. Let's keep it simple. If you add metadata at this stage one would need to edit the metadata if wants to "tweak" the binary for any reason, which is likely at this stage of development. Time lost on metadata defining, implementation and experimenting. I would only add that once the implementation is settled (at least partially). BR Carlos |
From: Pekka J. <pek...@tu...> - 2012-01-17 12:27:23
|
BTW, This affects the "cleanup of temp directories" discussed yesterday and the build process somewhat. How this could work is that the device drivers (that do the compilation) are handed a target file (or directory) for the final product(s) and the directories are managed from the higher level API implementation of the Program. The Program object can then keep track of the locations of the "artifact files" so we can create the tar ball out of them (and also extract them to correct locations in the reverse case). We have to keep those (final) files in disk until the Program has been destroyed as we do not know if the program wants them dumped via the clGetProgramInfo. On 01/17/2012 01:53 PM, Pekka Jääskeläinen wrote: > > How to actually implement this quickly is that we could create > an archive file (e.g. a tar) of the needed files from the temporary > directories created by the kernel code generation. -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-01-17 11:53:29
|
Hello, For implementing clCreateProgramWithBinary we need to define the program binary format that was discussed in a thread titled "Workgroup functions caching". The final conclusion in that discussion: I wrote: >> The alternative could be to define a simplistic custom wrapper format >> that wraps in the final binaries and the bc. It would not care of the >> format of the final binaries (thus the OpenCL binary container *contents* >> would be platform-specific, but the container itself not) and would store >> enough metadata for choosing the correct final binary based on the >> dimensions. On 12/19/2011 02:26 PM, Carlos Sánchez de La Lama wrote: > That was my first thought. I think it is the simplest way. > Any format capable of storing several files/buffers together will do. > I said ELF cause BIF is defined as ELF, but any format will do. I > suspect AMD chose ELF cause they had to put the ELF libraries there > anyways (the binary inside the .text is again ELF in their case). But > I would go for any sensible format > (existing one, no need to reinvent the wheel with a custom format). How to actually implement this quickly is that we could create an archive file (e.g. a tar) of the needed files from the temporary directories created by the kernel code generation. Contents of this file would be returned for each device in clGetProgramInfo(CL_PROGRAM_BINARIES ..) The archive file would contain a directory structure like: src/kernel_name/source.cl -- source codes of all the kernels bc/kernel_name/original.bc -- the LLVM bitcode of the kernel for unreplicated version bc/kernel_name/32_1_1.bc -- the LLVM bitcode of the kernel a replicated WG function with 32x1x1 WG dimensions obj/kernel_name/32_1_1.bin -- the final executable binary (e.g. ELF) for 'kernel_name' kernel, for WG dimensions 32x1x1 The simplest implementation of this would have to only include the src/kernel_name structure, thus no caching of compiler results at all. It would be enough just to implement the API and the basis for the file format. Support for the rest of the files could be added gradually to actually speed up the compilation process, if available in the archive. Error checking could be improved via a metadata file that describes the device the target-specific files are for etc. What do you think? -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-01-17 11:24:05
|
Hi, In revision 144 of trunk I committed reference counting for several OpenCL object types. In pocl_cl.h there is now a set of macros for generic implementations on operations that are shared across the main OpenCL object types. Please, check it out, especially the pocl_cl.h part: http://bazaar.launchpad.net/~pocl/pocl/trunk/revision/144 Reference counting for events is not yet there and it's the last one of the retain function calls the book examples need. -- Pekka |
From: Pekka J. <pek...@tu...> - 2012-01-16 16:32:17
|
Hi, I didn't yet add it to 'make check' (as most of the examples fail to compile), but I added a Makefile for compiling the examples of http://code.google.com/p/opencl-book-samples/ against a pocl *build tree*. This should make it easier to test the build when adding new APIs (no need to install pocl). I listed the missing APIs for this example suite in the TODO file: Known missing OpenCL 1.2 features --------------------------------- Missing APIs used by the tested OpenCL example suites are entered here. (*) == Used by the opencl-book-samples. 4. THE OPENCL PLATFORM LAYER * 4.1 Querying platform info (properly) * 4.3 Partitioning device * 4.4 Contexts * clRetainContext (*) 5. THE OPENCL RUNTIME * 5.1 Command queues * clRetainCommandQueue (*) * 5.2.1 Creating buffer objects * clCreateSubBuffer (*) * 5.2.4 Mapping buffer objects * clEnqueueMapBuffer (*) * 5.3 Image objects * clCreateImage2D (deprecated in OpenCL 1.2) (*) * 5.3.3 Reading, Writing and Copying Image Objects * clEnqueueReadImage (*) * 5.4 Querying, Umapping, Migrating, ... Mem objects * 5.4.1 Retaining and Releasing Memory Objects * clRetainMemObject (*) * 5.4.2 Unmapping Mapped Memory Objects * clEnqueueUnmapMemObject (*) * 5.5 Sampler objects * 5.5.1 Creating Sampler Objects * clCreateSampler (*) * 5.6.1 Creating Program Objects * clCreateProgramWithBinary (*) * clRetainProgram (*) * 5.7.1 Creating Kernel Objects * clRetainKernel (*) * 5.9 Event objects * clWaitForEvents (*) * clReleaseEvent (*) * 5.10 Markers, Barriers and Waiting for Events * clEnqueueMarker (deprecated in OpenCL 1.2) (*) * 5.12 Profiling * Reference counting for all the OpenCL objects 6. THE OPENCL C PROGRAMMING LANGUAGE * 6.12.10 Async Copies * 6.12.11 Atomic functions OpenCL 1.2 Extensions * 9.7 Sharing Memory Objects with OpenGL / OpenGL ES Buffer, Texture and Renderbuffer Objects * 9.7.6 Sharing memory objects that map to GL objects between GL and CL contexts * clEnqueueAcquireGLObjects (*) There's not so many APIs missing. Event and command queue APIs are being implemented by Kalle. Plenty of retain/release APIs are missing as there's no reference counting implementation yet. I might pick up on it as I already proposed a way forward with this (the macros) some time ago here in the mailing list. After those are in place this set of examples is missing mainly the image APIs, clCreateProgramWithBinary (of which implementation was already discussed briefly), sub buffers and the GL/CL interoperation which is an extension and not high priority yet. Sub buffers could exploit the bufalloc code. The original buffer would contain a mem_region from which the subbuffers are allocated. Additional API to the bufalloc is needed where both the start and the size of the chunk are given. I can look also on this one. On 01/12/2012 08:44 PM, Pekka Jääskeläinen wrote: > It would be nice to get a large set of examples easily executable from > 'make check' so we get constant testing of a larger test suite to avoid > regressions during development and to get confidence on the implementation. -- Pekka |