You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
(11) |
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
(30) |
Feb
(4) |
Mar
(4) |
Apr
(7) |
May
(5) |
Jun
(31) |
Jul
(6) |
Aug
(19) |
Sep
(38) |
Oct
(30) |
Nov
(22) |
Dec
(19) |
2013 |
Jan
(55) |
Feb
(39) |
Mar
(77) |
Apr
(10) |
May
(83) |
Jun
(52) |
Jul
(86) |
Aug
(61) |
Sep
(29) |
Oct
(9) |
Nov
(38) |
Dec
(22) |
2014 |
Jan
(14) |
Feb
(29) |
Mar
(4) |
Apr
(19) |
May
(3) |
Jun
(27) |
Jul
(6) |
Aug
(5) |
Sep
(3) |
Oct
(48) |
Nov
|
Dec
(5) |
2015 |
Jan
(8) |
Feb
(2) |
Mar
(8) |
Apr
(16) |
May
|
Jun
|
Jul
(2) |
Aug
(1) |
Sep
(2) |
Oct
(13) |
Nov
(5) |
Dec
(2) |
2016 |
Jan
(26) |
Feb
(6) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
|
Jul
|
Aug
(11) |
Sep
(3) |
Oct
(5) |
Nov
(14) |
Dec
(2) |
2017 |
Jan
(16) |
Feb
(4) |
Mar
(11) |
Apr
(4) |
May
(5) |
Jun
(5) |
Jul
(3) |
Aug
|
Sep
(6) |
Oct
|
Nov
(10) |
Dec
(6) |
2018 |
Jan
|
Feb
(21) |
Mar
(11) |
Apr
(3) |
May
(2) |
Jun
(8) |
Jul
|
Aug
(13) |
Sep
(6) |
Oct
(2) |
Nov
|
Dec
(11) |
2019 |
Jan
|
Feb
(5) |
Mar
(10) |
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(10) |
Oct
(4) |
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
(3) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
(4) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
(4) |
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
(11) |
Jun
(1) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
(2) |
Dec
(1) |
2023 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
From: Erik S. <esc...@pe...> - 2011-11-09 13:40:28
|
On Wed, Nov 9, 2011 at 7:43 AM, Carlos Sánchez de La Lama <car...@ur...> wrote: >> Anyways, it seems the math lib of Newlib is nicely separable so we could include it in pocl to avoid the requirement of Newlib installed? The license is a bit unclear but I think it's a BSD license for the libm part. We could just copy the 'math' dir from the Newlib to the source tree and then the various kernel lib implementations can cherry pick the codes they need from there (at source code level due to the different bitcode targets). The math lib is not the only thing that could be useful. For example, printf is a very useful OpenCL extension that should be supported. The underlying I/O stream representation probably needs to be implemented from scratch, but the formatting code should work fine. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-11-09 13:32:51
|
On 11/09/2011 02:43 PM, Carlos Sánchez de La Lama wrote: > The reasons would of course be seen as "weak" if there was a "strong" > reason against them, but the linking approach is a cleaner alternative > IMO. Fine. One drawback of this is that the whole Newlib bitcode lib then needs to be built for all the targets that need it. Even though a particular kernel lib would require, say, two math functions from it. I'll take a look if the math library can be configured separately in Newlib which would reduce the harm from this. At least it seems to have its own configure script. > Of course, that is pretty clear. The point is if we do not include newlib > in the pocl (which I am against) then making the default library depend on > it might be a drawback, given that the builtins are also "standard". > However I am not so strongly about it, newlib (or any other C-lib) > build-time dependency is not a big deal. Let's think of the current targets in the pocl tree. x86_64 (assuming a multicore) without WI replication is likely to benefit from host's CPU-optimized math lib due to having smaller instruction cache foot print (and I think CALL and LOOP are quite well optimized in this regard in x86_64 microarchs). Therefore, it should generally use the current builtin approach, not the inlined Newlib funcs, and lower to the syslib calls instead. ARM could use NEON (or other instruction set extension/co-processor) optimized native math libs, if available. However, then the ABI of the lib must match with whatever the clang generates. That is, most likely the very-target specific switches (that use the same ISEs as the device's libs) must be used. I'm unsure if this is a problem or not. TCE is fully customizable. It could easily have hardware sin/cos, for example. Thus, it should use the intrinsics/builtins which are lowered to the best possible instructions using the ADF info. And as a fallback it should use the Newlib libm included in the TCE tree. On the other hand, the kernel libs should be built against the ADF info so it can inline as much as possible to exploit the static ILP. > I would say: > 1. Compile newlib to bytecode (I guess CC=clang > CFLAGS=-ccc-target-triplet=xxx is probably enough) We do compile Newlib to a bitcode lib in TCE, I can check from its build files what is required. > 2. Make either default or per-device libs link against that and > perform a library linking step to > parts of the C-library being used in the kernel library get > linked in creating a self contained kernel runtime library. I think 2 boils down to the question of whether the inter-WI parallelism will be the main source of DLP in pocl or not. As we have discussed, it depends on e.g. icache configuration and possibly on the compiled kernel whether the replication is beneficial or only intra-WI DLP should be used. For now as there is no inter-WI vectorization support in pocl yet, I suppose the libcall-based implementation should be the default. This fact suggests that it's too early to trouble oneself with integrating the Newlib (or other math lib) to the build system too. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-11-09 12:44:01
|
Ok, i inline your "post" in the blueprint here: > Requiring Newlib to be in device/host produces more trouble. A point in embedding the required functions inside pocl is to make pocl self-contained (aside from the LLVM/Clang dependency), that is, to make it easily portable to various platforms (hosts and devices) + the inlining benefits. I just got rid of the gcc dependency in the 'ld' branch and I'd like to get rid of the libm dependency (were it the Newlib or the native) too. There is no need to include newlib on the host. I was thinking already on this wen I proposed it. Newlib would be needed only at pocl compile time, then (some of) the kernel libraries link against them, in bytecode, producing self-contained kernel libraries with no external requirements. > Newlib is quite big and contains the whole C library which pocl does not need. It would require porting the whole newlib to the target under question whenever one wants to use pocl on a host/device. I see that quite a bit more "overkill" than just copying the functions we need from some BSD/MIT licensed library, if one is found. Not really. newlib works using stubs so you need to port nothing, some of the functions will have unresolved stubs but we wont be using those. Only the needed functions (say sin/cos/whatever) will get linked (library behaviour). Basically it is the same as "borrowing" the sources, but at bytecode level. > Anyways, it seems the math lib of Newlib is nicely separable so we could include it in pocl to avoid the requirement of Newlib installed? The license is a bit unclear but I think it's a BSD license for the libm part. We could just copy the 'math' dir from the Newlib to the source tree and then the various kernel lib implementations can cherry pick the codes they need from there (at source code level due to the different bitcode targets). I do not like the idea of taking code of of there, for several reasons: 1) It is kind of "wrong", even when licenses allow it. If someone used pocl for example I would prefer them to use pocl than getting the passes out of llvmopencl directory and using it in their own project. If you take part of a project code base you do not support the project. 2) It requires "mimicing" some of the configure / makefile structure of newlib, configure switches, etc. Why reinventing the wheel? They give some source files with the build framework, let us use it. The reasons would of course be seen as "weak" if there was a "strong" reason against them, but the linking approach is a cleaner alternative IMO. > A major point IMHO for the "inlineable versions by default" is the exploitation inter-WI parallelism with vectors or long instructions which is ruined if you have a libary call in the kernel. Avoiding such can lead to a more parallelizable default generic lib (ability to maybe execute some parts of sin/cos, for example, for multiple WIs using parallel instructions) which should be a good in the "performance portability" sense. Of course, that is pretty clear. The point is if we do not include newlib in the pocl (which I am against) then making the default library depend on it might be a drawback, given that the builtins are also "standard". However I am not so strongly about it, newlib (or any other C-lib) build-time dependency is not a big deal. > So. I propose: > 1. Copy the required math implementations from Newlib > 2 .Use them in the generic implementation and assume the device-optimized libs use whatever is better for them I would say: 1. Compile newlib to bytecode (I guess CC=clang CFLAGS=-ccc-target-triplet=xxx is probably enough) 2. Make either default or per-device libs link against that and perform a library linking step to parts of the C-library being used in the kernel library get linked in creating a self contained kernel runtime library. Carlos |
From: Pekka J. <pek...@tu...> - 2011-11-09 11:50:44
|
On 11/09/2011 01:38 PM, Carlos Sánchez de La Lama wrote: > I put some thought on the blueprint whiteboard, in launchpad. Thanks. I added some more. The Blueprint system of Launchapd doesn't seem to be designed for discussions so it's better to discuss it further here. Unless you agree with the current proposal and I can start implementing it, of course! :) -- --Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-11-09 11:39:14
|
I put some thought on the blueprint whiteboard, in launchpad. Carlos El 09/11/2011, a las 09:54, Pekka Jääskeläinen escribió: > Hi, > > Please check: > https://blueprints.launchpad.net/pocl/+spec/no-libm-dep > > Comments, thoughts? > -- > --Pekka > > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel |
From: Pekka J. <pek...@tu...> - 2011-11-09 07:54:28
|
Hi, Please check: https://blueprints.launchpad.net/pocl/+spec/no-libm-dep Comments, thoughts? -- --Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-11-07 10:02:48
|
Yep, good idea... just another macro would be needed for cases when both are needed (well, or just nesting the two tests). BR, Carlos El 06/11/2011, a las 22:24, Pekka Jääskeläinen escribió: > On 11/06/2011 02:17 AM, Erik Schnetter wrote: >> Does this sound like a good idea? > > Seems like a good idea to me. > > BR, > -- > --Pekka > > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel |
From: Pekka J. <pek...@tu...> - 2011-11-06 20:24:49
|
On 11/06/2011 02:17 AM, Erik Schnetter wrote: > Does this sound like a good idea? Seems like a good idea to me. BR, -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-11-06 00:17:42
|
The current way we handle cl_khr_fp64 and cl_khr_int64 leads to much code duplication. One alternative would be to define two macros __IFDBL(x) __IFLNG(x) that expand to their arguments if double and long are supported, respectively, and expand to nothing otherwise. #ifdef cl_khr_fp64 # define __IFDBL(x) x #else # define __IFDBL(x) #endif We would surround every line in _kernel.h that uses double or long with the respective macro. This would look like #define _CL_DECLARE_FUNC_V_V(NAME) \ float _cl_overloadable NAME(float ); \ float2 _cl_overloadable NAME(float2 ); \ __IFDBL(double _cl_overloadable NAME(double );) \ __IFDBL(double2 _cl_overloadable NAME(double2 );) Does this sound like a good idea? -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Erik S. <esc...@pe...> - 2011-10-31 15:37:12
|
The aggressive inlining (without having to program compiler intrinsics or perform header file gymnastics) is one of the most compelling features of OpenCL (and an LLVM implementation). Yes, I'm using the "correct" clobber specifiers, as you suggest. -erik 2011/10/31 Pekka Jääskeläinen <pek...@tu...>: > On 10/31/2011 04:24 PM, Erik Schnetter wrote: >> >> However, I am quite certain (but can't guarantee it) that the other >> vector elements of the respective xmm register are unused. That is at >> least the calling convention for x86; of course, I don't know whether > > Yes, that might be true for calls. However, with OpenCL C kernels we want > to inline functions aggressively. In that case your asm clobber list has > to include the whole xmm register. This means that the code that preceeds > the call to the inline asm block has to save the XMM if it uses the other > elements before entering your inline asm block. > > -- > Pekka > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-31 14:47:44
|
On 10/31/2011 04:24 PM, Erik Schnetter wrote: > However, I am quite certain (but can't guarantee it) that the other > vector elements of the respective xmm register are unused. That is at > least the calling convention for x86; of course, I don't know whether Yes, that might be true for calls. However, with OpenCL C kernels we want to inline functions aggressively. In that case your asm clobber list has to include the whole xmm register. This means that the code that preceeds the call to the inline asm block has to save the XMM if it uses the other elements before entering your inline asm block. -- Pekka |
From: Erik S. <esc...@pe...> - 2011-10-31 14:24:15
|
Pekka Yes, such an andps instruction is possible. However, I am quite certain (but can't guarantee it) that the other vector elements of the respective xmm register are unused. That is at least the calling convention for x86; of course, I don't know whether llvm is doing anything clever within a routine, but by looking at the generated code, this does not seem to be the case. I've just figured out the extended asm syntax for this. -erik 2011/10/31 Pekka Jääskeläinen <pek...@tu...>: > On 10/31/2011 12:21 AM, Erik Schnetter wrote: >> >> You are right; andss does not exist, but there is an andps instead. > > It seems to be an SIMD instruction that performs the 'and' for 4 single > precision floats and you are performing it on a single one. > > I can understand that LLVM cannot select it automatically as in that case > it would clobber all the other floats in the SIMD register too, and (at > least when inlined) they can contain live data. Thus, if it selected it > automatically, it had to "spill" the other parts of the SIMD reg before > doing that which is quite costly. > > However, if you are sure using ANDPS here is faster, you can generate an > inline asm that has a safe 'all ones' mask for the rest of the fields, > right? > > -- > --Pekka > > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Carlos S. de La L. <car...@ur...> - 2011-10-31 11:24:36
|
Hi all, I changed the autogen.sh script with one from buildconf (http://buildconf.brlcad.org/). The old one was not working well for me cause I always build out of source directory, and old one called configure directly. Main difference is now you should run ./autogen.sh after checkout, and configure manually once configure script is regenerated. BR Carlos |
From: Carlos S. de La L. <car...@ur...> - 2011-10-31 11:00:26
|
Hi Erik, I saw you made it the default on the scripts. It is ok for now, at some point the host API will store binaries for all the supported targets and will link each one with its needed library. But that code is not done yet. BR, Carlos On Sat, 2011-10-29 at 12:30 -0400, Erik Schnetter wrote: > I just split the x86 specific implementations from the generic > run-time library. However, when running OpenCL code, this specific > library is not used. I can modify the pocl-*.in scripts manually -- > what is the better way? > > -erik > |
From: Pekka J. <pek...@tu...> - 2011-10-31 07:25:51
|
On 10/31/2011 12:21 AM, Erik Schnetter wrote: > You are right; andss does not exist, but there is an andps instead. It seems to be an SIMD instruction that performs the 'and' for 4 single precision floats and you are performing it on a single one. I can understand that LLVM cannot select it automatically as in that case it would clobber all the other floats in the SIMD register too, and (at least when inlined) they can contain live data. Thus, if it selected it automatically, it had to "spill" the other parts of the SIMD reg before doing that which is quite costly. However, if you are sure using ANDPS here is faster, you can generate an inline asm that has a safe 'all ones' mask for the rest of the fields, right? -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-10-30 22:21:53
|
Pekka You are right; andss does not exist, but there is an andps instead. -erik 2011/10/30 Pekka Jääskeläinen <pek...@tu...>: > Hi Erik, > > On 10/29/2011 07:43 PM, Erik Schnetter wrote: >> When I use clang 3.1 (a recent snapshot) to translate e.g. the fabs >> intrinsic, acting on a single floating point number, then then >> generated x86 code looks like >> >> _Z4fabsf: # @_Z4fabsf >> movd %xmm0, %eax >> andl $2147483647, %eax # imm = 0x7FFFFFFF >> movd %eax, %xmm0 >> ret >> >> This is not optimal, since the value is moved from xmm0 to eax and >> back, which is not necessary. Instead of andl, I expect to see the >> andss instruction. >> >> How do I go about having this corrected? Is this a problem in pocl, in >> clang, in llvm, or in the way one of these are used? > > I'm not familiar with the SSE instruction extensions but quick googling > didn't return 'andss' for single floats. E.g.: > http://en.wikipedia.org/wiki/X86_instruction_listings > > I see this absf implementation uses bit manipulation to reset the sign bit > of the float word to return the absolute. Thus, in case SSE does not have > 'and', it has to go back to the x86 instruction set to perform the and > to reset the sign bit. > > If SSE has a suitable 'and', it should be able to operate directly on the > xmm reg in which case it's an LLVM instruction selection issue. In that case > overriding the implementation with inline assembly can circumvent the issue. > Of course, the preferred way is to add proper 'andss' to the instruction > patterns in the LLVM side, if such is available. > > -- > --Pekka > > > ------------------------------------------------------------------------------ > Get your Android app more play: Bring it to the BlackBerry PlayBook > in minutes. BlackBerry App World™ now supports Android™ Apps > for the BlackBerry® PlayBook™. Discover just how easy and simple > it is! http://p.sf.net/sfu/android-dev2dev > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-30 16:22:14
|
On 10/30/2011 05:38 PM, Erik Schnetter wrote: > Another option could be to build a small C program that uses OpenMP; > the OpenMP run time contains logic that determines a good number of > threads to use. You would look at omp_max_threads(). I wouldn't like to introduce a library dependency just because of this. I'm sure there are OS-specific ways to figure out the count of cores and hardware threads per core in the different operating systems. Or just resort to some CPU info instruction set in the device, if available. After all, the current need of pocl is quite simple: if we want to exploit the task level parallelism provided by the device to the max while minimizing the threading overheads, it boils down to the number of hardware threads per core times the core count (or the number of WGs, whichever is smaller), doesn't it? If disk or network I/O was of concern there should be additional threads to hide the I/O latencies (at the OS level), but now we are mainly concerned on hiding the memory latencies because the kernels do not access files or the network like, for example, OpenMP loops in general can do. For memory latency hiding, only hardware threads can be of help, AFAIK. Additional consideration is the size of the local memory as each parallel WG needs a separate local memory space. Currently pocl just assumes the local memory malloc overhead (and the size) per thread is tolerable. In reality, for example on memory-tight embedded targets, this should also restrict the max number of parallel WG threads. If you can afford only one local memory "alive" at the same time, you can launch only one WG thread. BR, -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-10-30 15:38:46
|
Another option could be to build a small C program that uses OpenMP; the OpenMP run time contains logic that determines a good number of threads to use. You would look at omp_max_threads(). -erik 2011/10/30 Erik Schnetter <esc...@pe...>: > There is "hwloc", distributed on <http://www.open-mpi.org/>. This > library determines the number of logical CPUs, as well as their > association with various cache levels and NUMA properties. > > -erik > > 2011/10/30 Pekka Jääskeläinen <pek...@tu...>: >> On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote: >>> I just commited rev. 45 with a multithreading device, similar to native >>> but creates a thread for each workgroup. >> >> I committed a modification to the multithreading code on Friday. >> >> Now it creates a "sensible number" of threads for the multicore >> instead of blindly creating as many threads as there are WGs. >> >> However, parsing the /proc/cpuinfo to produce the number of hardware >> threads available in the processor is a bit flaky so (if you run >> Linux) please test that it returns a sensible number of threads for >> you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and >> compiling+running one of the examples. It should print out the "max >> thread count" for your (multi)processor before running the kernel. For >> Mac (and Windows) we need to figure out some other way to get the >> hardware thread count which defaults to 8 now. >> >> -- >> --Pekka >> >> >> ------------------------------------------------------------------------------ >> Get your Android app more play: Bring it to the BlackBerry PlayBook >> in minutes. BlackBerry App World™ now supports Android™ Apps >> for the BlackBerry® PlayBook™. Discover just how easy and simple >> it is! http://p.sf.net/sfu/android-dev2dev >> _______________________________________________ >> Pocl-devel mailing list >> Poc...@li... >> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> > > > > -- > Erik Schnetter <esc...@pe...> > http://www.cct.lsu.edu/~eschnett/ > AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Erik S. <esc...@pe...> - 2011-10-30 15:37:23
|
There is "hwloc", distributed on <http://www.open-mpi.org/>. This library determines the number of logical CPUs, as well as their association with various cache levels and NUMA properties. -erik 2011/10/30 Pekka Jääskeläinen <pek...@tu...>: > On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote: >> I just commited rev. 45 with a multithreading device, similar to native >> but creates a thread for each workgroup. > > I committed a modification to the multithreading code on Friday. > > Now it creates a "sensible number" of threads for the multicore > instead of blindly creating as many threads as there are WGs. > > However, parsing the /proc/cpuinfo to produce the number of hardware > threads available in the processor is a bit flaky so (if you run > Linux) please test that it returns a sensible number of threads for > you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and > compiling+running one of the examples. It should print out the "max > thread count" for your (multi)processor before running the kernel. For > Mac (and Windows) we need to figure out some other way to get the > hardware thread count which defaults to 8 now. > > -- > --Pekka > > > ------------------------------------------------------------------------------ > Get your Android app more play: Bring it to the BlackBerry PlayBook > in minutes. BlackBerry App World™ now supports Android™ Apps > for the BlackBerry® PlayBook™. Discover just how easy and simple > it is! http://p.sf.net/sfu/android-dev2dev > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-30 11:34:08
|
On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote: > I just commited rev. 45 with a multithreading device, similar to native > but creates a thread for each workgroup. I committed a modification to the multithreading code on Friday. Now it creates a "sensible number" of threads for the multicore instead of blindly creating as many threads as there are WGs. However, parsing the /proc/cpuinfo to produce the number of hardware threads available in the processor is a bit flaky so (if you run Linux) please test that it returns a sensible number of threads for you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and compiling+running one of the examples. It should print out the "max thread count" for your (multi)processor before running the kernel. For Mac (and Windows) we need to figure out some other way to get the hardware thread count which defaults to 8 now. -- --Pekka |
From: Pekka J. <pek...@tu...> - 2011-10-30 11:01:44
|
Hi Erik, On 10/29/2011 07:43 PM, Erik Schnetter wrote: > When I use clang 3.1 (a recent snapshot) to translate e.g. the fabs > intrinsic, acting on a single floating point number, then then > generated x86 code looks like > > _Z4fabsf: # @_Z4fabsf > movd %xmm0, %eax > andl $2147483647, %eax # imm = 0x7FFFFFFF > movd %eax, %xmm0 > ret > > This is not optimal, since the value is moved from xmm0 to eax and > back, which is not necessary. Instead of andl, I expect to see the > andss instruction. > > How do I go about having this corrected? Is this a problem in pocl, in > clang, in llvm, or in the way one of these are used? I'm not familiar with the SSE instruction extensions but quick googling didn't return 'andss' for single floats. E.g.: http://en.wikipedia.org/wiki/X86_instruction_listings I see this absf implementation uses bit manipulation to reset the sign bit of the float word to return the absolute. Thus, in case SSE does not have 'and', it has to go back to the x86 instruction set to perform the and to reset the sign bit. If SSE has a suitable 'and', it should be able to operate directly on the xmm reg in which case it's an LLVM instruction selection issue. In that case overriding the implementation with inline assembly can circumvent the issue. Of course, the preferred way is to add proper 'andss' to the instruction patterns in the LLVM side, if such is available. -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-10-29 16:43:20
|
When I use clang 3.1 (a recent snapshot) to translate e.g. the fabs intrinsic, acting on a single floating point number, then then generated x86 code looks like _Z4fabsf: # @_Z4fabsf movd %xmm0, %eax andl $2147483647, %eax # imm = 0x7FFFFFFF movd %eax, %xmm0 ret This is not optimal, since the value is moved from xmm0 to eax and back, which is not necessary. Instead of andl, I expect to see the andss instruction. How do I go about having this corrected? Is this a problem in pocl, in clang, in llvm, or in the way one of these are used? -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Erik S. <esc...@pe...> - 2011-10-29 16:30:18
|
I just split the x86 specific implementations from the generic run-time library. However, when running OpenCL code, this specific library is not used. I can modify the pocl-*.in scripts manually -- what is the better way? -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Carlos S. de La L. <car...@ur...> - 2011-10-25 16:01:59
|
Hi all, I just commited rev. 45 with a multithreading device, similar to native but creates a thread for each workgroup. This device is also made default device. BR, Carlos |
From: Carlos S. de La L. <car...@ur...> - 2011-10-24 14:05:31
|
> > 1) Make kernel library runtime compatible with all devices. This was the > > planned approach, it can be done by selection the implementation for a > > device using runtime conditional (C-level ifs) instead of preprocessor > > ones (#if/#ifdefs). LLVM should then eliminate dead code when generating > > the final binary. > > This does not allow hardware-dependent optimisations. For example, > certain function calls / assembler instructions are only available on > certain hardware, and lead to syntax errors on others. After some rethinking and discussion, it is probably better to forget about bytecode device independency. It is not designed to work that way and even compiled bytecode for different targets is already different just after clang. So different bytecodes inside API structures and different libraries for each device then. One interesting point is that if we want proper "vectorization" (once that is working) on the library code also, the library needs to be compiled into bytecode and linked at bytecode level, before WG generation, but LLVM assembly can handle inline target assembly so this should not be a problem. One pending task is thus to organize the library Makefiles in a way that allows each target to override generic implementations with target-dependent ones. > Now, it would be nice if these were not necessary. This would require > providing them via LLVM, i.e. implementing the OpenCL run time (e.g. > sin, cos, sqrt, their vectorised versions, etc.) in LLVM instead of in > POCL. It would simplify our job, but would make every LLVM backend have to implement all those functions, this means will make backend implementation more complex. I am dubious that LLVM project will go that way. BR, Carlos |