You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
(11) |
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
(30) |
Feb
(4) |
Mar
(4) |
Apr
(7) |
May
(5) |
Jun
(31) |
Jul
(6) |
Aug
(19) |
Sep
(38) |
Oct
(30) |
Nov
(22) |
Dec
(19) |
2013 |
Jan
(55) |
Feb
(39) |
Mar
(77) |
Apr
(10) |
May
(83) |
Jun
(52) |
Jul
(86) |
Aug
(61) |
Sep
(29) |
Oct
(9) |
Nov
(38) |
Dec
(22) |
2014 |
Jan
(14) |
Feb
(29) |
Mar
(4) |
Apr
(19) |
May
(3) |
Jun
(27) |
Jul
(6) |
Aug
(5) |
Sep
(3) |
Oct
(48) |
Nov
|
Dec
(5) |
2015 |
Jan
(8) |
Feb
(2) |
Mar
(8) |
Apr
(16) |
May
|
Jun
|
Jul
(2) |
Aug
(1) |
Sep
(2) |
Oct
(13) |
Nov
(5) |
Dec
(2) |
2016 |
Jan
(26) |
Feb
(6) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
|
Jul
|
Aug
(11) |
Sep
(3) |
Oct
(5) |
Nov
(14) |
Dec
(2) |
2017 |
Jan
(16) |
Feb
(4) |
Mar
(11) |
Apr
(4) |
May
(5) |
Jun
(5) |
Jul
(3) |
Aug
|
Sep
(6) |
Oct
|
Nov
(10) |
Dec
(6) |
2018 |
Jan
|
Feb
(21) |
Mar
(11) |
Apr
(3) |
May
(2) |
Jun
(8) |
Jul
|
Aug
(13) |
Sep
(6) |
Oct
(2) |
Nov
|
Dec
(11) |
2019 |
Jan
|
Feb
(5) |
Mar
(10) |
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(10) |
Oct
(4) |
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
(3) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
(4) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
(4) |
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
(11) |
Jun
(1) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
(2) |
Dec
(1) |
2023 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 12:27:03
|
>> Again, ELF is used in BIF just as a wrapper, so you can create a >> ".myownstuff" section and put whatever you want inside. There is no need >> for it to be in the ".text" > > I know. But then, how I see it, ELF is just "abused" as an archive format here > without actual benefits to generic file archive formats. If you do not > use the executable and linking related info but treat all contents as > "unknown binary blobs" with some "custom metadata", what's the point of > using such a format? None, it is just a format, any is OK as long as it is simple enough (thats why I prefer ELF over FatELF, but any other simpler form would be even better). > If we use ELF as the "OpenCL binary format" I think we should keep the binary > program dlopenable (in .text, .rodata, .data) to have the benefits from the > standard format (at least in operating systems that support ELF). This means > (AFAIU) compiling the multiple versions of the kernel to the same .text with > different function names. Or, if something simpler is chosen it could be a > simple wrapper with quickly extractable final files (even tar could work or even our own simple format). Agreed, the directly loadable ELF would be the nicest. But how to support this in a way it does not clash with systems where the loadable libraries are not ELF? One option is to let binary generation solely to the driver so I can take care, but even some drivers (native/pthread) might have to be ELF and non-ELF depending on the system (darwin is not ELF, for example). > The alternative could be to define a simplistic custom wrapper format that > wraps in the final binaries and the bc. It would not care of the format of > the final binaries (thus the OpenCL binary container *contents* would be > platform-specific, but the container itself not) and would store enough > metadata for choosing the correct final binary based on the dimensions. That was my first thought. I think it is the simplest way. Any format capable of storing several files/buffers together will do. I said ELF cause BIF is defined as ELF, but any format will do. I suspect AMD chose ELF cause they had to put the ELF libraries there anyways (the binary inside the .text is again ELF in their case). But I would go for any sensible format (existing one, no need to reinvent the wheel with a custom format). Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 12:05:02
|
On 12/19/2011 01:43 PM, Carlos Sánchez de La Lama wrote: > Again, ELF is used in BIF just as a wrapper, so you can create a > ".myownstuff" section and put whatever you want inside. There is no need > for it to be in the ".text" I know. But then, how I see it, ELF is just "abused" as an archive format here without actual benefits to generic file archive formats. If you do not use the executable and linking related info but treat all contents as "unknown binary blobs" with some "custom metadata", what's the point of using such a format? This adds a dependency to the ELF writing libraries (think Windows or embedded, for example) even when something much simpler could suffice. If we use ELF as the "OpenCL binary format" I think we should keep the binary program dlopenable (in .text, .rodata, .data) to have the benefits from the standard format (at least in operating systems that support ELF). This means (AFAIU) compiling the multiple versions of the kernel to the same .text with different function names. Or, if something simpler is chosen it could be a simple wrapper with quickly extractable final files (even tar could work or even our own simple format). The alternative could be to define a simplistic custom wrapper format that wraps in the final binaries and the bc. It would not care of the format of the final binaries (thus the OpenCL binary container *contents* would be platform-specific, but the container itself not) and would store enough metadata for choosing the correct final binary based on the dimensions. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 11:43:43
|
>> Also the final binary, optionally. > > OK. In our case the kernel just might have multiple versions for > the multiple dimensions in the .text section. Should work... Or even multiple version of the binary in different sections. >>> The OpenCL API for fetching and loading the program binaries is >>> multi-device. >>> Thus the format should not be tied to an architecture as it can >>> contain the >>> same kernels compiled for multiple devices. >> >> What does this mean? > > I think it means that for example in case of AMD you could have > the CPU and the GPU (device) versions of the program in the same > (OpenCL) binary. I see from the specs that they do not support this but > store only the GPU or CPU bits but not both: > > "By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the > executable for the GPU (,.llvmir, .amdil, and .text sections), as well as > LLVM IR and the executable for the CPU (.llvmir and .text sections)."? Given our LLVM IR format can be linked to an device-dependant architecture, we are not going to support binary retargeting anyways so we should not bother about that. > ELF has only one architecture-specific .text section, IIUC so it would > not work for this. Again, ELF is used in BIF just as a wrapper, so you can create a ".myownstuff" section and put whatever you want inside. There is no need for it to be in the ".text" >> Any other option (tar/zip/ELF/whatever) would do the same, but as this >> is documented and used on a OpenCL SDK I would suggest doing the same. > > I do not consider the main advantage to be that it's used by AMD. But > in case it can be used as a directly dlopenable program binary then it's > a real advantage (BTW in MacOS or at least Windows we might need something > else then?). It would avoid the objcopy step in case the binary contains > a kernel version suitable for launching directly for the given > dimensions... probably a small saving but still a nifty thing to have. If we want to be able to dlopen the binary directly then we need something like this FatELF... but as you found out, the project seems to be half dead, dlopen is not going to support FatELF binaries in almost any system so we would end up with more stuff to fix ourselves. I would go for the "keep it simple" way. Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 11:19:37
|
On 12/19/2011 12:59 PM, Carlos Sánchez de La Lama wrote: > Also the final binary, optionally. OK. In our case the kernel just might have multiple versions for the multiple dimensions in the .text section. Should work... >> The OpenCL API for fetching and loading the program binaries is >> multi-device. >> Thus the format should not be tied to an architecture as it can >> contain the >> same kernels compiled for multiple devices. > > What does this mean? I think it means that for example in case of AMD you could have the CPU and the GPU (device) versions of the program in the same (OpenCL) binary. I see from the specs that they do not support this but store only the GPU or CPU bits but not both: "By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the executable for the GPU (,.llvmir, .amdil, and .text sections), as well as LLVM IR and the executable for the CPU (.llvmir and .text sections)."? ELF has only one architecture-specific .text section, IIUC so it would not work for this. Anyways, we can add a separate wrapper for the multidevice case on top of this (or use the FatELF) later, if we see need. http://icculus.org/fatelf/ "FatELF lets you pack binaries into one file, seperated by OS ABI, OS ABI version, byte order and word size, and most importantly, CPU architecture." > Any other option (tar/zip/ELF/whatever) would do the same, but as this > is documented and used on a OpenCL SDK I would suggest doing the same. I do not consider the main advantage to be that it's used by AMD. But in case it can be used as a directly dlopenable program binary then it's a real advantage (BTW in MacOS or at least Windows we might need something else then?). It would avoid the objcopy step in case the binary contains a kernel version suitable for launching directly for the given dimensions... probably a small saving but still a nifty thing to have. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 11:00:05
|
>> I think we can reuse AMD one (just uses ELF sections for different >> binaries) as it is well specified. > > IIRC, the AMD's ELF files store the LLVM bitcode and the AMD-IL (both IRs), > not the final binary, right? Also the final binary, optionally. > The OpenCL API for fetching and loading the program binaries is multi-device. > Thus the format should not be tied to an architecture as it can contain the > same kernels compiled for multiple devices. What does this mean? > I'm not sure if the ELF provides some benefits for this scenario. We need > basically a format that stores multiple independent (sometimes ELF?) > binaries for the compiled program(s) and the LLVM bitcode as a separate one. > The compiled programs should be quickly loadable to the dynamic linker, > preferably without extracting them first to a separate file. Can ELF > support this nicely? The stored binaries (same kernels compiled for > multiple dimensions) might contain clashing symbols, AFAIU so they should > be stored as "binary blob sections" in ELF, not as "program sections" with > linkage/relocation info? You know ELF better than I do... ELF is used as a trick here. It is not real "ELF". You just get the binary/bytecode/source file/whatever and put it into an ELF section. No relocations or anything at that level, the ELF is used only as a way to put all the files together. If some of the binaries are ELF, then the whole ELF would be inside of a section (you would need to "objcopy" it out, and then you would get a real ELF). http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf Any other option (tar/zip/ELF/whatever) would do the same, but as this is documented and used on a OpenCL SDK I would suggest doing the same. > What about FatELF? If it provides advantages in our use case and it is widespread enough, why not. But if it does not the then using ELF (not real ELF, but ELF as in AMD SDK "BIF" format) is probably easier for implementation (just pack the files together). Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 10:36:52
|
On 12/19/2011 12:18 PM, Carlos Sánchez de La Lama wrote: > I think we can reuse AMD one (just uses ELF sections for different > binaries) as it is well specified. IIRC, the AMD's ELF files store the LLVM bitcode and the AMD-IL (both IRs), not the final binary, right? The OpenCL API for fetching and loading the program binaries is multi-device. Thus the format should not be tied to an architecture as it can contain the same kernels compiled for multiple devices. I'm not sure if the ELF provides some benefits for this scenario. We need basically a format that stores multiple independent (sometimes ELF?) binaries for the compiled program(s) and the LLVM bitcode as a separate one. The compiled programs should be quickly loadable to the dynamic linker, preferably without extracting them first to a separate file. Can ELF support this nicely? The stored binaries (same kernels compiled for multiple dimensions) might contain clashing symbols, AFAIU so they should be stored as "binary blob sections" in ELF, not as "program sections" with linkage/relocation info? You know ELF better than I do... What about FatELF? http://en.wikipedia.org/wiki/Executable_and_Linkable_Format#FatELF:_Universal_Binaries_for_Linux Let's see... -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 10:18:57
|
>>> Although I'd like to cache the final >>> binary, not only the bitcode > > I meant to (also) cache (or store in the binary format) the code gen > results, i.e., the final bits. Ok, I had actually missed it. Then some binary format is needed, I think we can reuse AMD one (just uses ELF sections for different binaries) as it is well specified. > All in all, I don't think there's many cases when you do *not* want > caching (even over multiple OpenCL program runs). If it works "behind > the scenes" like ccache (and includes the pocl+LLVM versions in the hash) > it should always be beneficial. Disk space is cheap. Binary caching is probably always OK. I was referring to storing replicated WG functions in the same LLVM IR module when I said it has drawbacks. > I think clCreateProgramWithBinary can also contain all this functionality > for manual caching: > > "The program binary can consist of either or both: > Device-specific code and/or, > Implementation-specific intermediate representation (IR) which will be > converted to the device-specific code." Yep, binary format is really implementation dependent so you can put whatever you want there. The point is in our case we always need the LLVM IR, cause while OpenCL API allows storing only the binary for a device (AMD SDK can do this, and gives an "invalid device" error of loading a binary for a different architecture) in our case storing the binary would mean fixing also the dimensions, which is not API compliant in the general cases AFAIK. Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 09:21:43
|
Hi, I think you missed this: >> Although I'd like to cache the final >> binary, not only the bitcode I meant to (also) cache (or store in the binary format) the code gen results, i.e., the final bits. As you know, for TCE it might take considerable time to generate the code from the bitcode. Also, in general, in a production system that uses OpenCL for the program, we do not want to execute the compiler at all if one can avoid it. We might even exclude the compiler from the host to provide something like the "standalone mode" (to ship binaries only) but in a more standards compliant way. Useless overhead is useless overhead. It's especially the case for mobile devices with energy constraints. All in all, I don't think there's many cases when you do *not* want caching (even over multiple OpenCL program runs). If it works "behind the scenes" like ccache (and includes the pocl+LLVM versions in the hash) it should always be beneficial. Disk space is cheap. I think clCreateProgramWithBinary can also contain all this functionality for manual caching: "The program binary can consist of either or both: Device-specific code and/or, Implementation-specific intermediate representation (IR) which will be converted to the device-specific code." "OpenCL allows applications to create a program object using the program source or binary and build appropriate program executables. This can be very useful as it allows applications to load program source and then compile and link to generate a program executable online on its first instance for appropriate OpenCL devices in the system. These executables can now be queried and cached by the application. Future instances of the application launching will no longer need to compile and link the program executables. The cached executables can be read and loaded by the application, which can help significantly reduce the application initialization time." BR, -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 09:02:10
|
> We currently do this in C++, and I want to port this code to OpenCL. > Unconditional inlining of all functions would not be good for this > application. Would it be possible to skip functions that don't call a > get_*() function, or to skip inlining functions marked "noinline"? It is, that is why I removed the forcing inline. The passes do not strictly require that the kernel is fully inlined (in fact, inlining now is done by LLVM with its own criteria), What needs to be fixed (as per your bug report) is always inline calls leading to one of those get_xxx(). > Instead of privatizing the code for each thread, is it possible to > privatize these variables on which the get_*() functions are based? With > hyperthreading or modern AMD processors, it can be beneficial to have > several threads executing the same code, even if some expressions cannot > be evaluated at build time. No, those variables need to be different for each workgroup, so we cannot make them global (multiple workgroups might be running in parallel in threaded environments). The only way around this is using a context structure that gets passed to all subfunctions, but old passes used to work like that and, in general, generated code is much worse due to load and stores to that structure. Remember there is no threading "within" the workgroup. Threads are created for different workgroups but not for different workitems of the same workgroup. Carlos |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 08:48:51
|
Hi, moved this from the bug reporting comments (better discuss on the list I think). >> How many kernels are cached? E.g. in pthread.c, there is an if statement >> "if (d->current_kernel != kernel)", as if only one kernel was cached. If >> that is so, would that be the right place to introduce a larger cache? Only one right now. Larger cache is needed, I agree. > Yes, Carlos. The cached result depends on the dimensions so saving > multiple versions could work. Although I'd like to cache the final > binary, not only the bitcode, to save all the compilation costs. This > will be useful especially in the future when TCE is used in a proper > host-device configuration, in the embedded/mobile systems that really > want to save all useless work, and also in my planned research wrt. > OpenCL to FPGA. So we might need to create a new simple binary format > with multiple target binaries inside + some metadata (for example to > save the dimensions). I think the best way is, instead of defining a binary format (I was originally thinking on a ELF with different binaries in different sections, like AMD SDK does) it is probably better just to use the BC. The workgroup function would be created a different name (probably including the dimensions, so the info is there) and the OpenCL-related metadata is not touched so it still points to the original kernel. The passes need slight modifications to handle this, but if we update at that point the "binary image" of the kernel, then we would have what we want. Drawbacks would be: 1) Little unneeded delay with the caching code in case no caching is wanted. 2) The binary code might grow quite big. It would be nice if there was a way to enable/disable the caching, more or less complying to the standard. Is there a way to define host-side extensions? Carlos |
From: Carlos S. de La L. <car...@ur...> - 2011-12-16 15:43:09
|
Yep, the item is stored in fp16 format inside the i16 of course... I thought it would work as you can have a target independent f16 to f32 conversion but that requires assuming the storage format is IEEE FP. Anyways, pocl-wide it is enough as it is now IMHO, if LLVM supports codegen for halfs in a target then the kernel library uses it, otherwise it does not. Carlos On Fri, 2011-12-16 at 10:18 -0500, Erik Schnetter wrote: > On Fri, Dec 16, 2011 at 7:03 AM, Carlos Sánchez de La Lama > <car...@ur...> wrote: > Just for clarification: > > There is no fp16 type in LLVM, at all, neither for computation > nor > storage. It is not defined in LLVM assembly language. > > What clang does is generates i16 (integer) values for halfs, > and > converts i16 to floats before operating with them. The LLVM > intrinsics > convert i16 <-> float, there is no fp16 type. > I would expect therefore those intrinsics to work on all the > LLVM > codegen targets (int to float works, so this should also). > > > The intrinsics do not work -- tried 3.0 and trunk. By looking at the > code, I believe that this conversion intrinsic is only defined for > ARM. > > > The i16 contains a bit pattern representing the fp16 value, it cannot > be interpreted as integer value. It seems to me that using i16 is > purely a hack to avoid introducing a new (and very limited) LLVM > datatype, because by using an i16 one ensures that load/store etc. > work correctly. > > > -erik > > > -- > Erik Schnetter <esc...@pe...> > http://www.cct.lsu.edu/~eschnett/ > AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Erik S. <esc...@pe...> - 2011-12-16 15:18:57
|
On Fri, Dec 16, 2011 at 7:03 AM, Carlos Sánchez de La Lama < car...@ur...> wrote: > Just for clarification: > > There is no fp16 type in LLVM, at all, neither for computation nor > storage. It is not defined in LLVM assembly language. > > What clang does is generates i16 (integer) values for halfs, and > converts i16 to floats before operating with them. The LLVM intrinsics > convert i16 <-> float, there is no fp16 type. > I would expect therefore those intrinsics to work on all the LLVM > codegen targets (int to float works, so this should also). > The intrinsics do not work -- tried 3.0 and trunk. By looking at the code, I believe that this conversion intrinsic is only defined for ARM. The i16 contains a bit pattern representing the fp16 value, it cannot be interpreted as integer value. It seems to me that using i16 is purely a hack to avoid introducing a new (and very limited) LLVM datatype, because by using an i16 one ensures that load/store etc. work correctly. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Carlos S. de La L. <car...@ur...> - 2011-12-16 11:59:57
|
Just for clarification: There is no fp16 type in LLVM, at all, neither for computation nor storage. It is not defined in LLVM assembly language. What clang does is generates i16 (integer) values for halfs, and converts i16 to floats before operating with them. The LLVM intrinsics convert i16 <-> float, there is no fp16 type. I would expect therefore those intrinsics to work on all the LLVM codegen targets (int to float works, so this should also). >From the pocl kernel library I think the current way is correct, if halfs are not mandatory in OpenCL, check for the "half" support in the compiler and activate the extension if it is there. Support meaning not only that the compiler "eats" the keyword but also its size as the standard defines it. So, as it is done. To map halfs to real half-operating-hardware LLVM-side changes would be needed, to add the half as a real type, as right now it is not possible (without a lot of def-use chain analysis) to determine whether a float comes from a "half/i16" or is a real float. BR Carlos On Thu, 2011-12-15 at 14:32 -0500, Erik Schnetter wrote: > The conversion intrinsics exist in LLVM, and are implemented in some > of its backends. To my knowledge, currently only the ARM backend > supports it, presumably via a machine instruction (or maybe via a > sequence of machine instructions). Other backends will report an error > when the llvm code is lowered to machine code -- that is (as I had to > find out), libkernel.a will build fine on all architectures, but the > respective functions cannot be used. > > > As you say, it should not be difficult to implement this generically > for all other platforms, either in pocl, or (better) in LLVM. This may > be slow, but the memory savings (in particular also if this is stored > in a file) may make the slow conversion worthwhile for some > applications. > > > -erik > > 2011/12/15 Pekka Jääskeläinen <pek...@tu...> > On 12/15/2011 09:04 PM, Erik Schnetter wrote: > supports the half datatype > > How do you think so? > http://llvm.org/docs/LangRef.html#t_floating > > As I wrote, I think it's supported only via those two > conversion > intrinsics: > > http://llvm.org/docs/LangRef.html#int_fp16 > > My question was that who implements those intrinsics > (fp32 to fp16 to fp32) as they require some bit manipulation > of > the fp fields, AFAIK (extract mantissa, exponent, sign and put > them pack to the destination) and it's unlikely the hardware > has > direct instructions for such conversion. Do you mean that > the default lowering of those intrinsics produces the > conversion > code too? > > Using native halfs would mean that one can use smaller adders, > multipliers, shifters etc. in the FPUs which means energy > savings in > low power designs (less switching activity). Too bad it seems > not to be > supported yet in LLVM, AFAIU. > > -- > --Pekka > > > > > > -- > Erik Schnetter <esc...@pe...> > http://www.cct.lsu.edu/~eschnett/ > AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... > ------------------------------------------------------------------------------ > 10 Tips for Better Server Consolidation > Server virtualization is being driven by many needs. > But none more important than the need to reduce IT complexity > while improving strategic productivity. Learn More! > http://www.accelacomm.com/jaw/sdnl/114/51507609/ > _______________________________________________ Pocl-devel mailing list Poc...@li... https://lists.sourceforge.net/lists/listinfo/pocl-devel |
From: Erik S. <esc...@pe...> - 2011-12-15 19:32:09
|
The conversion intrinsics exist in LLVM, and are implemented in some of its backends. To my knowledge, currently only the ARM backend supports it, presumably via a machine instruction (or maybe via a sequence of machine instructions). Other backends will report an error when the llvm code is lowered to machine code -- that is (as I had to find out), libkernel.a will build fine on all architectures, but the respective functions cannot be used. As you say, it should not be difficult to implement this generically for all other platforms, either in pocl, or (better) in LLVM. This may be slow, but the memory savings (in particular also if this is stored in a file) may make the slow conversion worthwhile for some applications. -erik 2011/12/15 Pekka Jääskeläinen <pek...@tu...> > On 12/15/2011 09:04 PM, Erik Schnetter wrote: > >> supports the half datatype >> > > How do you think so? > http://llvm.org/docs/LangRef.**html#t_floating<http://llvm.org/docs/LangRef.html#t_floating> > > As I wrote, I think it's supported only via those two conversion > intrinsics: > > http://llvm.org/docs/LangRef.**html#int_fp16<http://llvm.org/docs/LangRef.html#int_fp16> > > My question was that who implements those intrinsics > (fp32 to fp16 to fp32) as they require some bit manipulation of > the fp fields, AFAIK (extract mantissa, exponent, sign and put > them pack to the destination) and it's unlikely the hardware has > direct instructions for such conversion. Do you mean that > the default lowering of those intrinsics produces the conversion > code too? > > Using native halfs would mean that one can use smaller adders, > multipliers, shifters etc. in the FPUs which means energy savings in > low power designs (less switching activity). Too bad it seems not to be > supported yet in LLVM, AFAIU. > > -- > --Pekka > > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-12-15 19:17:54
|
On 12/15/2011 09:04 PM, Erik Schnetter wrote: > supports the half datatype How do you think so? http://llvm.org/docs/LangRef.html#t_floating As I wrote, I think it's supported only via those two conversion intrinsics: http://llvm.org/docs/LangRef.html#int_fp16 My question was that who implements those intrinsics (fp32 to fp16 to fp32) as they require some bit manipulation of the fp fields, AFAIK (extract mantissa, exponent, sign and put them pack to the destination) and it's unlikely the hardware has direct instructions for such conversion. Do you mean that the default lowering of those intrinsics produces the conversion code too? Using native halfs would mean that one can use smaller adders, multipliers, shifters etc. in the FPUs which means energy savings in low power designs (less switching activity). Too bad it seems not to be supported yet in LLVM, AFAIU. -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-12-15 19:05:02
|
Since LLVM supports the half datatype (I believe it was added about nine months ago, especially for OpenCL), it generates these conversions itself. I just updated (read: corrected) the autoconf rules to determine whether half is supported. It seems that it is currently only supported on ARM. Other platforms will have this code disabled. I will push this to my branch soon. To my knowledge, the "traditional" implementation of half would be to perform all arithmetic operations in float precision, except possibly expensive iterative operations (divide, sqrt), where fewer iterations may be used that for float. -erik 2011/12/15 Pekka Jääskeläinen <pek...@tu...> > On 12/15/2011 06:24 PM, Erik Schnetter wrote: > > OpenCL supports only two operations for halfs: vload_half, converting it > > to a float, and vstore_half, converting from a float. Nothing else > > exists explicitly, not even vectors of halfs. Essentially the only thing > > one can do with the half type is to pass a half* to these load/store > > routines. > > OK, interesting. Who then generates the float <-> half conversion code > for the LLVM intrinsics? Does LLVM generate it automatically or > do we need to provide conversion routines in pocl? > > > -- > Pekka > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-12-15 16:44:55
|
On 12/15/2011 06:24 PM, Erik Schnetter wrote: > OpenCL supports only two operations for halfs: vload_half, converting it > to a float, and vstore_half, converting from a float. Nothing else > exists explicitly, not even vectors of halfs. Essentially the only thing > one can do with the half type is to pass a half* to these load/store > routines. OK, interesting. Who then generates the float <-> half conversion code for the LLVM intrinsics? Does LLVM generate it automatically or do we need to provide conversion routines in pocl? -- Pekka |
From: Erik S. <esc...@pe...> - 2011-12-15 16:24:51
|
OpenCL supports only two operations for halfs: vload_half, converting it to a float, and vstore_half, converting from a float. Nothing else exists explicitly, not even vectors of halfs. Essentially the only thing one can do with the half type is to pass a half* to these load/store routines. There are routines such as float sin_half(float) that are only required to have the precision offered by datatype half (allowing optimisations), but the API is via float. There is text in the standard presumable allowing this to be optimised to use operations that act directly on half values, but this is not required. I added code to detect whether clang supports half (called __fp16 in C), and if so, these vload_half/vload_store routines are available. sin_half and friends are always available, forwarding to their float counterparts by default -- I assume that target-specific optimisations can do better. -erik 2011/12/15 Pekka Jääskeläinen <pek...@tu...> > On 12/15/2011 01:09 AM, Erik Schnetter wrote: > > Erik Schnetter has proposed merging lp:~schnetter/pocl/main into lp:pocl. > > > > Requested reviews: pocl maintaners (pocl) > > > > For more details, see: > > https://code.launchpad.net/~schnetter/pocl/main/+merge/85761 > > > > I added support for the half datatype, protected by #ifdef cl_khr_fp16, > > analogous to cl_khr_fp64. I don't know which targets support this > datatype > > (presumably all, since llvm supports them?), so I enabled this for all > > targets -- this will break things if this is wrong. > > Just curious... > > How does LLVM/Clang support the half by default nowadays? I've heard that > for > NVIDIA GPUs, for example, the half is supported only as a storage format. > That > is, you have the float in 16bit format in memory but whenever you compute > something with halfs, they are converted to single precision floats to > avoid > the need for separate floating point units for halfs. > > Just curious to hear what happens when you use half floats in LLVM/Clang > now -- do they convert them to single precision fp whenever computation > occurs? > The last time I checked, 'half' was not a datatype in the LLVM IR > thus they could not be selected (to be implemented with the target ISA) > nicely. > > It seems there are only two intrinsics for halfs available: > http://llvm.org/docs/LangRef.html#int_fp16 > > Does Clang generate those automatically for halfs in OpenCL C now? For > example > if you perform a basic operation halfA + halfB, what happens? > > I'm interested in a proper half support as for embedded/mobile it is more > beneficial than just for saving the memory bandwidth as you can save in > the area > of the FPU, improve the speed, lower the energy consumption, etc. if you > can do with half floats for your computations. But I think they do not > accept > it as a proper datatype in LLVM before there is a real (read: > off-the-shelf) > target in LLVM that supports it natively. > > -- > --Pekka > > > > ------------------------------------------------------------------------------ > 10 Tips for Better Server Consolidation > Server virtualization is being driven by many needs. > But none more important than the need to reduce IT complexity > while improving strategic productivity. Learn More! > http://www.accelacomm.com/jaw/sdnl/114/51507609/ > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Carlos S. de La L. <car...@ur...> - 2011-12-15 12:36:02
|
>> I'm finding that many of the object allocation/deallocation routines are not >> careful about allocation and freeing memory. I think that some reference >> counting may be necessary, but this doesn't seem to be employed consistently. I am aware of this... it is a consequence of always going too fast. But I agree we need to get it back to a controlled state ASAP. > I'm not sure how it's best to implement the reference counting generically > in C to avoid code duplication for all the different structure types. > Probably through some set of cpp macros. E.g. POCL_RELEASE(_OBJ) > which decrements the ref count and if it goes to zero, frees it and > POCL_RETAIN(_TYPE, _OBJ) which does the opposite. These macros then would assume > the struct has a unsigned _ref_count member. Or similar. Sounds ok to me. >> Should we add "magic markers" into the objects, which would help identify >> routines accessing objects that have been freed? I am thinking in particular of >> program, kernel, and mem objects, i.e. adding such markers to their declarations >> in pocl_cl.h, and checking them in various places. > > Valgrind should be able to spot these. In my opinion we should just run > the test cases in valgrind and fix the leaks and references to the freed > objects it finds instead of adding some runtime overhead (and clutter) > to the code. Yep, I agree with Pekka... I think the least code complexity the better (my usual point). And valgrind for sanitizing the current leaking code. BR Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-15 08:06:13
|
On 12/15/2011 01:09 AM, Erik Schnetter wrote: > Erik Schnetter has proposed merging lp:~schnetter/pocl/main into lp:pocl. > > Requested reviews: pocl maintaners (pocl) > > For more details, see: > https://code.launchpad.net/~schnetter/pocl/main/+merge/85761 > > I added support for the half datatype, protected by #ifdef cl_khr_fp16, > analogous to cl_khr_fp64. I don't know which targets support this datatype > (presumably all, since llvm supports them?), so I enabled this for all > targets -- this will break things if this is wrong. Just curious... How does LLVM/Clang support the half by default nowadays? I've heard that for NVIDIA GPUs, for example, the half is supported only as a storage format. That is, you have the float in 16bit format in memory but whenever you compute something with halfs, they are converted to single precision floats to avoid the need for separate floating point units for halfs. Just curious to hear what happens when you use half floats in LLVM/Clang now -- do they convert them to single precision fp whenever computation occurs? The last time I checked, 'half' was not a datatype in the LLVM IR thus they could not be selected (to be implemented with the target ISA) nicely. It seems there are only two intrinsics for halfs available: http://llvm.org/docs/LangRef.html#int_fp16 Does Clang generate those automatically for halfs in OpenCL C now? For example if you perform a basic operation halfA + halfB, what happens? I'm interested in a proper half support as for embedded/mobile it is more beneficial than just for saving the memory bandwidth as you can save in the area of the FPU, improve the speed, lower the energy consumption, etc. if you can do with half floats for your computations. But I think they do not accept it as a proper datatype in LLVM before there is a real (read: off-the-shelf) target in LLVM that supports it natively. -- --Pekka |
From: Pekka J. <pek...@tu...> - 2011-12-14 20:21:48
|
On 12/14/2011 08:50 PM, Erik Schnetter wrote: > I'm finding that many of the object allocation/deallocation routines are not > careful about allocation and freeing memory. I think that some reference > counting may be necessary, but this doesn't seem to be employed consistently. Ah, it seems the OpenCL specs assumes reference counting for its object types. It includes clRetain*/clRelease* functions for the OpenCL platform/runtime API struct types. I'm not sure how it's best to implement the reference counting generically in C to avoid code duplication for all the different structure types. Probably through some set of cpp macros. E.g. POCL_RELEASE(_OBJ) which decrements the ref count and if it goes to zero, frees it and POCL_RETAIN(_TYPE, _OBJ) which does the opposite. These macros then would assume the struct has a unsigned _ref_count member. Or similar. > Should we add "magic markers" into the objects, which would help identify > routines accessing objects that have been freed? I am thinking in particular of > program, kernel, and mem objects, i.e. adding such markers to their declarations > in pocl_cl.h, and checking them in various places. Valgrind should be able to spot these. In my opinion we should just run the test cases in valgrind and fix the leaks and references to the freed objects it finds instead of adding some runtime overhead (and clutter) to the code. -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-12-14 18:50:55
|
I'm finding that many of the object allocation/deallocation routines are not careful about allocation and freeing memory. I think that some reference counting may be necessary, but this doesn't seem to be employed consistently. Should we add "magic markers" into the objects, which would help identify routines accessing objects that have been freed? I am thinking in particular of program, kernel, and mem objects, i.e. adding such markers to their declarations in pocl_cl.h, and checking them in various places. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-12-14 18:02:42
|
Hi, I added LLVM_3_0 and LLVM_3_1 (and LLVM_SVN) version macro generation to the configure. Currently only very minor fix was needed to make pocl work with the LLVM top-of-tree. These macros can be used to make it work both with the 3.0 and (upcoming) 3.1 in the future. It's better we always support the latest released LLVM version in pocl so let's keep it working with both. -- Pekka |
From: Pekka J. <pek...@tu...> - 2011-11-15 15:43:36
|
Hi, Some interesting additions to the OpenCL standard in the newly released 1.2: * Built-in Kernel: A built-in kernel is a kernel that is executed on an OpenCL device or custom device by fixed-function hardware or in firmware. Applications can query the built-in kernels supported by a device or custom device. A program object can only contain kernels written in OpenCL C or built-in kernels but not both. See also Kernel and Program. Fits the OpenCL to TTA-ASIP case quite well... one can have highly tuned implementations of some kernels embedded in the FPGA/ROM of the ASIP and exploit them from the host program in a standard way. * printf is now in the main specs with specified behavior The printf built-in function writes output to an implementation-defined stream such as stdout under control of the string pointed to by format that specifies how subsequent arguments are converted for output. If there are insufficient arguments for the format, the behavior is undefined. If the format is exhausted while arguments remain, the excess arguments are evaluated (as always) but are otherwise ignored. The printf function returns when the end of the format string is encountered. ...When the event that is associated with a particular kernel invocation is completed, the output of all printf() calls executed by this kernel invocation is flushed to the implementation- defined output stream. Calling clFinish on a command queue flushes all pending output by printf in previously enqueued and completed commands to the implementation-defined output stream... You can query the printf "buffer" size in the device. "Maximum size of the internal buffer that holds the output of printf calls from a kernel. The minimum value for the EMBEDDED profile is 1 KB." * support for separated linkage phase Storage class specifiers 'extern' and 'static' are now keywords.. -- Pekka |
From: Pekka J. <pek...@tu...> - 2011-11-09 14:27:43
|
On 11/09/2011 03:40 PM, Erik Schnetter wrote: > The math lib is not the only thing that could be useful. For example, > printf is a very useful OpenCL extension that should be supported. The > underlying I/O stream representation probably needs to be implemented from > scratch, but the formatting code should work fine. I'm not sure of printf(). Maybe that should use the stdio.h and -lc of the device because 1) The actual stdout stream destination is fully platform (OS+device) dependent. 2) The "inlining benefits" do not apply to it as it's probably only used for debug printouts. Inter-WI DLP does not matter here. 3) It's an optional extension. In case the target does not support it, the target just doesn't advertise it as a vendor extension. -- Pekka |