From: Carlos S. de La L. <car...@ur...> - 2011-12-19 08:48:51
|
Hi, moved this from the bug reporting comments (better discuss on the list I think). >> How many kernels are cached? E.g. in pthread.c, there is an if statement >> "if (d->current_kernel != kernel)", as if only one kernel was cached. If >> that is so, would that be the right place to introduce a larger cache? Only one right now. Larger cache is needed, I agree. > Yes, Carlos. The cached result depends on the dimensions so saving > multiple versions could work. Although I'd like to cache the final > binary, not only the bitcode, to save all the compilation costs. This > will be useful especially in the future when TCE is used in a proper > host-device configuration, in the embedded/mobile systems that really > want to save all useless work, and also in my planned research wrt. > OpenCL to FPGA. So we might need to create a new simple binary format > with multiple target binaries inside + some metadata (for example to > save the dimensions). I think the best way is, instead of defining a binary format (I was originally thinking on a ELF with different binaries in different sections, like AMD SDK does) it is probably better just to use the BC. The workgroup function would be created a different name (probably including the dimensions, so the info is there) and the OpenCL-related metadata is not touched so it still points to the original kernel. The passes need slight modifications to handle this, but if we update at that point the "binary image" of the kernel, then we would have what we want. Drawbacks would be: 1) Little unneeded delay with the caching code in case no caching is wanted. 2) The binary code might grow quite big. It would be nice if there was a way to enable/disable the caching, more or less complying to the standard. Is there a way to define host-side extensions? Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 09:21:43
|
Hi, I think you missed this: >> Although I'd like to cache the final >> binary, not only the bitcode I meant to (also) cache (or store in the binary format) the code gen results, i.e., the final bits. As you know, for TCE it might take considerable time to generate the code from the bitcode. Also, in general, in a production system that uses OpenCL for the program, we do not want to execute the compiler at all if one can avoid it. We might even exclude the compiler from the host to provide something like the "standalone mode" (to ship binaries only) but in a more standards compliant way. Useless overhead is useless overhead. It's especially the case for mobile devices with energy constraints. All in all, I don't think there's many cases when you do *not* want caching (even over multiple OpenCL program runs). If it works "behind the scenes" like ccache (and includes the pocl+LLVM versions in the hash) it should always be beneficial. Disk space is cheap. I think clCreateProgramWithBinary can also contain all this functionality for manual caching: "The program binary can consist of either or both: Device-specific code and/or, Implementation-specific intermediate representation (IR) which will be converted to the device-specific code." "OpenCL allows applications to create a program object using the program source or binary and build appropriate program executables. This can be very useful as it allows applications to load program source and then compile and link to generate a program executable online on its first instance for appropriate OpenCL devices in the system. These executables can now be queried and cached by the application. Future instances of the application launching will no longer need to compile and link the program executables. The cached executables can be read and loaded by the application, which can help significantly reduce the application initialization time." BR, -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 10:18:57
|
>>> Although I'd like to cache the final >>> binary, not only the bitcode > > I meant to (also) cache (or store in the binary format) the code gen > results, i.e., the final bits. Ok, I had actually missed it. Then some binary format is needed, I think we can reuse AMD one (just uses ELF sections for different binaries) as it is well specified. > All in all, I don't think there's many cases when you do *not* want > caching (even over multiple OpenCL program runs). If it works "behind > the scenes" like ccache (and includes the pocl+LLVM versions in the hash) > it should always be beneficial. Disk space is cheap. Binary caching is probably always OK. I was referring to storing replicated WG functions in the same LLVM IR module when I said it has drawbacks. > I think clCreateProgramWithBinary can also contain all this functionality > for manual caching: > > "The program binary can consist of either or both: > Device-specific code and/or, > Implementation-specific intermediate representation (IR) which will be > converted to the device-specific code." Yep, binary format is really implementation dependent so you can put whatever you want there. The point is in our case we always need the LLVM IR, cause while OpenCL API allows storing only the binary for a device (AMD SDK can do this, and gives an "invalid device" error of loading a binary for a different architecture) in our case storing the binary would mean fixing also the dimensions, which is not API compliant in the general cases AFAIK. Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 10:36:52
|
On 12/19/2011 12:18 PM, Carlos Sánchez de La Lama wrote: > I think we can reuse AMD one (just uses ELF sections for different > binaries) as it is well specified. IIRC, the AMD's ELF files store the LLVM bitcode and the AMD-IL (both IRs), not the final binary, right? The OpenCL API for fetching and loading the program binaries is multi-device. Thus the format should not be tied to an architecture as it can contain the same kernels compiled for multiple devices. I'm not sure if the ELF provides some benefits for this scenario. We need basically a format that stores multiple independent (sometimes ELF?) binaries for the compiled program(s) and the LLVM bitcode as a separate one. The compiled programs should be quickly loadable to the dynamic linker, preferably without extracting them first to a separate file. Can ELF support this nicely? The stored binaries (same kernels compiled for multiple dimensions) might contain clashing symbols, AFAIU so they should be stored as "binary blob sections" in ELF, not as "program sections" with linkage/relocation info? You know ELF better than I do... What about FatELF? http://en.wikipedia.org/wiki/Executable_and_Linkable_Format#FatELF:_Universal_Binaries_for_Linux Let's see... -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 11:00:05
|
>> I think we can reuse AMD one (just uses ELF sections for different >> binaries) as it is well specified. > > IIRC, the AMD's ELF files store the LLVM bitcode and the AMD-IL (both IRs), > not the final binary, right? Also the final binary, optionally. > The OpenCL API for fetching and loading the program binaries is multi-device. > Thus the format should not be tied to an architecture as it can contain the > same kernels compiled for multiple devices. What does this mean? > I'm not sure if the ELF provides some benefits for this scenario. We need > basically a format that stores multiple independent (sometimes ELF?) > binaries for the compiled program(s) and the LLVM bitcode as a separate one. > The compiled programs should be quickly loadable to the dynamic linker, > preferably without extracting them first to a separate file. Can ELF > support this nicely? The stored binaries (same kernels compiled for > multiple dimensions) might contain clashing symbols, AFAIU so they should > be stored as "binary blob sections" in ELF, not as "program sections" with > linkage/relocation info? You know ELF better than I do... ELF is used as a trick here. It is not real "ELF". You just get the binary/bytecode/source file/whatever and put it into an ELF section. No relocations or anything at that level, the ELF is used only as a way to put all the files together. If some of the binaries are ELF, then the whole ELF would be inside of a section (you would need to "objcopy" it out, and then you would get a real ELF). http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf Any other option (tar/zip/ELF/whatever) would do the same, but as this is documented and used on a OpenCL SDK I would suggest doing the same. > What about FatELF? If it provides advantages in our use case and it is widespread enough, why not. But if it does not the then using ELF (not real ELF, but ELF as in AMD SDK "BIF" format) is probably easier for implementation (just pack the files together). Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 11:19:37
|
On 12/19/2011 12:59 PM, Carlos Sánchez de La Lama wrote: > Also the final binary, optionally. OK. In our case the kernel just might have multiple versions for the multiple dimensions in the .text section. Should work... >> The OpenCL API for fetching and loading the program binaries is >> multi-device. >> Thus the format should not be tied to an architecture as it can >> contain the >> same kernels compiled for multiple devices. > > What does this mean? I think it means that for example in case of AMD you could have the CPU and the GPU (device) versions of the program in the same (OpenCL) binary. I see from the specs that they do not support this but store only the GPU or CPU bits but not both: "By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the executable for the GPU (,.llvmir, .amdil, and .text sections), as well as LLVM IR and the executable for the CPU (.llvmir and .text sections)."? ELF has only one architecture-specific .text section, IIUC so it would not work for this. Anyways, we can add a separate wrapper for the multidevice case on top of this (or use the FatELF) later, if we see need. http://icculus.org/fatelf/ "FatELF lets you pack binaries into one file, seperated by OS ABI, OS ABI version, byte order and word size, and most importantly, CPU architecture." > Any other option (tar/zip/ELF/whatever) would do the same, but as this > is documented and used on a OpenCL SDK I would suggest doing the same. I do not consider the main advantage to be that it's used by AMD. But in case it can be used as a directly dlopenable program binary then it's a real advantage (BTW in MacOS or at least Windows we might need something else then?). It would avoid the objcopy step in case the binary contains a kernel version suitable for launching directly for the given dimensions... probably a small saving but still a nifty thing to have. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 11:43:43
|
>> Also the final binary, optionally. > > OK. In our case the kernel just might have multiple versions for > the multiple dimensions in the .text section. Should work... Or even multiple version of the binary in different sections. >>> The OpenCL API for fetching and loading the program binaries is >>> multi-device. >>> Thus the format should not be tied to an architecture as it can >>> contain the >>> same kernels compiled for multiple devices. >> >> What does this mean? > > I think it means that for example in case of AMD you could have > the CPU and the GPU (device) versions of the program in the same > (OpenCL) binary. I see from the specs that they do not support this but > store only the GPU or CPU bits but not both: > > "By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the > executable for the GPU (,.llvmir, .amdil, and .text sections), as well as > LLVM IR and the executable for the CPU (.llvmir and .text sections)."? Given our LLVM IR format can be linked to an device-dependant architecture, we are not going to support binary retargeting anyways so we should not bother about that. > ELF has only one architecture-specific .text section, IIUC so it would > not work for this. Again, ELF is used in BIF just as a wrapper, so you can create a ".myownstuff" section and put whatever you want inside. There is no need for it to be in the ".text" >> Any other option (tar/zip/ELF/whatever) would do the same, but as this >> is documented and used on a OpenCL SDK I would suggest doing the same. > > I do not consider the main advantage to be that it's used by AMD. But > in case it can be used as a directly dlopenable program binary then it's > a real advantage (BTW in MacOS or at least Windows we might need something > else then?). It would avoid the objcopy step in case the binary contains > a kernel version suitable for launching directly for the given > dimensions... probably a small saving but still a nifty thing to have. If we want to be able to dlopen the binary directly then we need something like this FatELF... but as you found out, the project seems to be half dead, dlopen is not going to support FatELF binaries in almost any system so we would end up with more stuff to fix ourselves. I would go for the "keep it simple" way. Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 12:05:02
|
On 12/19/2011 01:43 PM, Carlos Sánchez de La Lama wrote: > Again, ELF is used in BIF just as a wrapper, so you can create a > ".myownstuff" section and put whatever you want inside. There is no need > for it to be in the ".text" I know. But then, how I see it, ELF is just "abused" as an archive format here without actual benefits to generic file archive formats. If you do not use the executable and linking related info but treat all contents as "unknown binary blobs" with some "custom metadata", what's the point of using such a format? This adds a dependency to the ELF writing libraries (think Windows or embedded, for example) even when something much simpler could suffice. If we use ELF as the "OpenCL binary format" I think we should keep the binary program dlopenable (in .text, .rodata, .data) to have the benefits from the standard format (at least in operating systems that support ELF). This means (AFAIU) compiling the multiple versions of the kernel to the same .text with different function names. Or, if something simpler is chosen it could be a simple wrapper with quickly extractable final files (even tar could work or even our own simple format). The alternative could be to define a simplistic custom wrapper format that wraps in the final binaries and the bc. It would not care of the format of the final binaries (thus the OpenCL binary container *contents* would be platform-specific, but the container itself not) and would store enough metadata for choosing the correct final binary based on the dimensions. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 12:27:03
|
>> Again, ELF is used in BIF just as a wrapper, so you can create a >> ".myownstuff" section and put whatever you want inside. There is no need >> for it to be in the ".text" > > I know. But then, how I see it, ELF is just "abused" as an archive format here > without actual benefits to generic file archive formats. If you do not > use the executable and linking related info but treat all contents as > "unknown binary blobs" with some "custom metadata", what's the point of > using such a format? None, it is just a format, any is OK as long as it is simple enough (thats why I prefer ELF over FatELF, but any other simpler form would be even better). > If we use ELF as the "OpenCL binary format" I think we should keep the binary > program dlopenable (in .text, .rodata, .data) to have the benefits from the > standard format (at least in operating systems that support ELF). This means > (AFAIU) compiling the multiple versions of the kernel to the same .text with > different function names. Or, if something simpler is chosen it could be a > simple wrapper with quickly extractable final files (even tar could work or even our own simple format). Agreed, the directly loadable ELF would be the nicest. But how to support this in a way it does not clash with systems where the loadable libraries are not ELF? One option is to let binary generation solely to the driver so I can take care, but even some drivers (native/pthread) might have to be ELF and non-ELF depending on the system (darwin is not ELF, for example). > The alternative could be to define a simplistic custom wrapper format that > wraps in the final binaries and the bc. It would not care of the format of > the final binaries (thus the OpenCL binary container *contents* would be > platform-specific, but the container itself not) and would store enough > metadata for choosing the correct final binary based on the dimensions. That was my first thought. I think it is the simplest way. Any format capable of storing several files/buffers together will do. I said ELF cause BIF is defined as ELF, but any format will do. I suspect AMD chose ELF cause they had to put the ELF libraries there anyways (the binary inside the .text is again ELF in their case). But I would go for any sensible format (existing one, no need to reinvent the wheel with a custom format). Carlos |
From: Pekka J. <pek...@tu...> - 2011-12-19 12:42:05
|
On 12/19/2011 02:26 PM, Carlos Sánchez de La Lama wrote: > That was my first thought. I think it is the simplest way. Any format > capable of storing several files/buffers together will do. I'd say let's go with this. The "directly dloadable" requirement is not so important as "portability" and sharing common code across all platforms. I wonder if there's a simple archive format available on all platforms or should we use our own very simple format. I wouldn't like to rely on 'tar' or 'zip' availablity, for example, as those contain even too much info (we do not need file permission data etc.). Yes, the device drivers need to produce the final binaries as only they know the correct toolchains and compiler arguments for the target, so a "driver hook" is needed to get the binaries in/out from them. -- Pekka |
From: Erik S. <esc...@pe...> - 2011-12-19 14:17:17
|
The previous discussion was much about file formats. Does this mean that enqueuing a cached kernel would still require a dlopen? I was more hoping for caching the kernels in memory, so that enqueuing a kernel is really as cheap as an indirect function call. -erik On Mon, Dec 19, 2011 at 7:41 AM, Pekka Jääskeläinen < pek...@tu...> wrote: > On 12/19/2011 02:26 PM, Carlos Sánchez de La Lama wrote: > > That was my first thought. I think it is the simplest way. Any format > > capable of storing several files/buffers together will do. > > I'd say let's go with this. The "directly dloadable" requirement is not > so important as "portability" and sharing common code across all platforms. > > I wonder if there's a simple archive format available on all platforms or > should we use our own very simple format. I wouldn't like to rely on 'tar' > or > 'zip' availablity, for example, as those contain even too much info (we do > not need file permission data etc.). > > Yes, the device drivers need to produce the final binaries as only > they know the correct toolchains and compiler arguments for the > target, so a "driver hook" is needed to get the binaries in/out from > them. > > -- > Pekka > > > ------------------------------------------------------------------------------ > Learn Windows Azure Live! Tuesday, Dec 13, 2011 > Microsoft is holding a special Learn Windows Azure training event for > developers. It will provide a great way to learn Windows Azure and what it > provides. You can attend the event by watching it streamed LIVE online. > Learn more at http://p.sf.net/sfu/ms-windowsazure > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-12-19 14:19:36
|
On 12/19/2011 04:17 PM, Erik Schnetter wrote: > The previous discussion was much about file formats. Does this mean that > enqueuing a cached kernel would still require a dlopen? I was more > hoping for caching the kernels in memory, so that enqueuing a kernel is > really as cheap as an indirect function call. We should do both, cache the previously called functions in memory and also enable offline caching. -- Pekka |
From: Carlos S. de La L. <car...@ur...> - 2011-12-19 14:22:14
|
Actually that is already done. It gets dlopened just the first time IIRC, but just for one kernel, as all the caching. This needs to be expanded. Carlos El 19/12/2011, a las 16:17, Erik Schnetter escribió: > The previous discussion was much about file formats. Does this mean that enqueuing a cached kernel would still require a dlopen? I was more hoping for caching the kernels in memory, so that enqueuing a kernel is really as cheap as an indirect function call. > > -erik > > On Mon, Dec 19, 2011 at 7:41 AM, Pekka Jääskeläinen <pek...@tu...> wrote: > On 12/19/2011 02:26 PM, Carlos Sánchez de La Lama wrote: > > That was my first thought. I think it is the simplest way. Any format > > capable of storing several files/buffers together will do. > > I'd say let's go with this. The "directly dloadable" requirement is not > so important as "portability" and sharing common code across all platforms. > > I wonder if there's a simple archive format available on all platforms or > should we use our own very simple format. I wouldn't like to rely on 'tar' or > 'zip' availablity, for example, as those contain even too much info (we do > not need file permission data etc.). > > Yes, the device drivers need to produce the final binaries as only > they know the correct toolchains and compiler arguments for the > target, so a "driver hook" is needed to get the binaries in/out from > them. > > -- > Pekka > > ------------------------------------------------------------------------------ > Learn Windows Azure Live! Tuesday, Dec 13, 2011 > Microsoft is holding a special Learn Windows Azure training event for > developers. It will provide a great way to learn Windows Azure and what it > provides. You can attend the event by watching it streamed LIVE online. > Learn more at http://p.sf.net/sfu/ms-windowsazure > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > > > > -- > Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ > AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2012-03-05 09:14:27
|
On 12/19/2011 04:17 PM, Erik Schnetter wrote: > The previous discussion was much about file formats. Does this mean that > enqueuing a cached kernel would still require a dlopen? I was more > hoping for caching the kernels in memory, so that enqueuing a kernel is > really as cheap as an indirect function call. Did you do some coding for this? I'll implement this as the ViennaCL has test cases which compile (enqueue) the same kernel multiple times (blas3) and the kernels seem to be quite slow to compile. Otherwise it's not sensible to add the ViennaCL tests to the 'make check' suite. -- Pekka |
From: Erik S. <esc...@pe...> - 2012-03-05 15:01:14
|
Pekka No, I did not code anything in this respect. -erik 2012/3/5 Pekka Jääskeläinen <pek...@tu...>: > On 12/19/2011 04:17 PM, Erik Schnetter wrote: >> >> The previous discussion was much about file formats. Does this mean that >> enqueuing a cached kernel would still require a dlopen? I was more >> hoping for caching the kernels in memory, so that enqueuing a kernel is >> really as cheap as an indirect function call. > > > Did you do some coding for this? > > I'll implement this as the ViennaCL has test cases which compile (enqueue) > the same kernel multiple times (blas3) and the kernels seem to be quite > slow to compile. Otherwise it's not sensible to add the ViennaCL tests > to the 'make check' suite. > > -- > Pekka -- Erik Schnetter <esc...@pe...> http://www.perimeterinstitute.ca/personal/eschnetter/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |