pocl-devel Mailing List for pocl (Page 54)

pocl-devel — Portable OpenCL development discussion

You can subscribe to this list here.

2011	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (25)	Nov (11)	Dec (36)
2012	Jan (30)	Feb (4)	Mar (4)	Apr (7)	May (5)	Jun (31)	Jul (6)	Aug (19)	Sep (38)	Oct (30)	Nov (22)	Dec (19)
2013	Jan (55)	Feb (39)	Mar (77)	Apr (10)	May (83)	Jun (52)	Jul (86)	Aug (61)	Sep (29)	Oct (9)	Nov (38)	Dec (22)
2014	Jan (14)	Feb (29)	Mar (4)	Apr (19)	May (3)	Jun (27)	Jul (6)	Aug (5)	Sep (3)	Oct (48)	Nov	Dec (5)
2015	Jan (8)	Feb (2)	Mar (8)	Apr (16)	May	Jun	Jul (2)	Aug (1)	Sep (2)	Oct (13)	Nov (5)	Dec (2)
2016	Jan (26)	Feb (6)	Mar (8)	Apr (8)	May (2)	Jun	Jul	Aug (11)	Sep (3)	Oct (5)	Nov (14)	Dec (2)
2017	Jan (16)	Feb (4)	Mar (11)	Apr (4)	May (5)	Jun (5)	Jul (3)	Aug	Sep (6)	Oct	Nov (10)	Dec (6)
2018	Jan	Feb (21)	Mar (11)	Apr (3)	May (2)	Jun (8)	Jul	Aug (13)	Sep (6)	Oct (2)	Nov	Dec (11)
2019	Jan	Feb (5)	Mar (10)	Apr (2)	May	Jun	Jul	Aug	Sep (10)	Oct (4)	Nov	Dec
2020	Jan	Feb	Mar (1)	Apr (4)	May	Jun	Jul (3)	Aug	Sep (3)	Oct	Nov	Dec (4)
2021	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (4)	Aug	Sep	Oct (4)	Nov	Dec
2022	Jan	Feb	Mar (4)	Apr	May (11)	Jun (1)	Jul (3)	Aug	Sep (1)	Oct	Nov (2)	Dec (1)
2023	Jan (4)	Feb	Mar (1)	Apr	May	Jun (2)	Jul	Aug	Sep	Oct	Nov	Dec (1)

Flat | Threaded

<< < 1 .. 52 53 54 55 56 > >> (Page 54 of 56)

Re: [Pocl-devel] Workgroup functions caching

From: Carlos S. de La L. <car...@ur...> - 2011-12-19 12:27:03

>> Again, ELF is used in BIF just as a wrapper, so you can create a
>> ".myownstuff" section and put whatever you want inside. There is no need
>> for it to be in the ".text"
> 
> I know. But then, how I see it, ELF is just "abused" as an archive format here
> without actual benefits to generic file archive formats. If you do not
> use the executable and linking related info but treat all contents as
> "unknown binary blobs" with some "custom metadata", what's the point of
> using such a format?

None, it is just a format, any is OK as long as it is simple enough (thats why I prefer ELF over FatELF, but any other simpler form would be even better).

> If we use ELF as the "OpenCL binary format" I think we should keep the binary
> program dlopenable (in .text, .rodata, .data) to have the benefits from the
> standard format (at least in operating systems that support ELF). This means
> (AFAIU) compiling the multiple versions of the kernel to the same .text with
> different function names. Or, if something simpler is chosen it could be a
> simple wrapper with quickly extractable final files (even tar could work or even our own simple format).

Agreed, the directly loadable ELF would be the nicest. But how to support this in a way it does not clash with systems where the loadable libraries are not ELF? One option is to let binary generation solely to the driver so I can take care, but even some drivers (native/pthread) might have to be ELF and non-ELF depending on the system (darwin is not ELF, for example).

> The alternative could be to define a simplistic custom wrapper format that
> wraps in the final binaries and the bc. It would not care of the format of
> the final binaries (thus the OpenCL binary container *contents* would be
> platform-specific, but the container itself not) and would store enough
> metadata for choosing the correct final binary based on the dimensions.

That was my first thought. I think it is the simplest way. Any format capable of storing several files/buffers together will do. I said ELF cause BIF is defined as ELF, but any format will do. I suspect AMD chose ELF cause they had to put the ELF libraries there anyways (the binary inside the .text is again ELF in their case). But I would go for any sensible format (existing one, no need to reinvent the wheel with a custom format).

Carlos

Re: [Pocl-devel] Workgroup functions caching

From: Pekka J. <pek...@tu...> - 2011-12-19 12:05:02

On 12/19/2011 01:43 PM, Carlos Sánchez de La Lama wrote:
> Again, ELF is used in BIF just as a wrapper, so you can create a
> ".myownstuff" section and put whatever you want inside. There is no need
> for it to be in the ".text"

I know. But then, how I see it, ELF is just "abused" as an archive format here
without actual benefits to generic file archive formats. If you do not
use the executable and linking related info but treat all contents as
"unknown binary blobs" with some "custom metadata", what's the point of
using such a format?

This adds a dependency to the ELF writing libraries (think Windows or
embedded, for example) even when something much simpler could suffice.

If we use ELF as the "OpenCL binary format" I think we should keep the binary
program dlopenable (in .text, .rodata, .data) to have the benefits from the
standard format (at least in operating systems that support ELF). This means
(AFAIU) compiling the multiple versions of the kernel to the same .text with
different function names. Or, if something simpler is chosen it could be a
simple wrapper with quickly extractable final files (even tar could work or 
even our own simple format).

The alternative could be to define a simplistic custom wrapper format that
wraps in the final binaries and the bc. It would not care of the format of
the final binaries (thus the OpenCL binary container *contents* would be
platform-specific, but the container itself not) and would store enough
metadata for choosing the correct final binary based on the dimensions.

-- 
Pekka

Re: [Pocl-devel] Workgroup functions caching

From: Carlos S. de La L. <car...@ur...> - 2011-12-19 11:43:43

>> Also the final binary, optionally.
> 
> OK. In our case the kernel just might have multiple versions for
> the multiple dimensions in the .text section. Should work...

Or even multiple version of the binary in different sections.

>>> The OpenCL API for fetching and loading the program binaries is
>>> multi-device.
>>> Thus the format should not be tied to an architecture as it can
>>> contain the
>>> same kernels compiled for multiple devices.
>> 
>> What does this mean?
> 
> I think it means that for example in case of AMD you could have
> the CPU and the GPU (device) versions of the program in the same
> (OpenCL) binary. I see from the specs that they do not support this but
> store only the GPU or CPU bits but not both:
> 
> "By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the
> executable for the GPU (,.llvmir, .amdil, and .text sections), as well as
> LLVM IR and the executable for the CPU (.llvmir and .text sections)."?

Given our LLVM IR format can be linked to an device-dependant architecture, we are not going to support binary retargeting anyways so we should not bother about that.

> ELF has only one architecture-specific .text section, IIUC so it would
> not work for this.

Again, ELF is used in BIF just as a wrapper, so you can create a ".myownstuff" section and put whatever you want inside. There is no need for it to be in the ".text"

>> Any other option (tar/zip/ELF/whatever) would do the same, but as this
>> is documented and used on a OpenCL SDK I would suggest doing the same.
> 
> I do not consider the main advantage to be that it's used by AMD. But
> in case it can be used as a directly dlopenable program binary then it's
> a real advantage (BTW in MacOS or at least Windows we might need something
> else then?). It would avoid the objcopy step in case the binary contains
> a kernel version suitable for launching directly for the given
> dimensions... probably a small saving but still a nifty thing to have.

If we want to be able to dlopen the binary directly then we need something like this FatELF... but as you found out, the project seems to be half dead, dlopen is not going to support FatELF binaries in almost any system so we would end up with more stuff to fix ourselves. I would go for the "keep it simple" way.

Carlos

Re: [Pocl-devel] Workgroup functions caching

From: Pekka J. <pek...@tu...> - 2011-12-19 11:19:37

On 12/19/2011 12:59 PM, Carlos Sánchez de La Lama wrote:
> Also the final binary, optionally.

OK. In our case the kernel just might have multiple versions for
the multiple dimensions in the .text section. Should work...

>> The OpenCL API for fetching and loading the program binaries is
>> multi-device.
>> Thus the format should not be tied to an architecture as it can
>> contain the
>> same kernels compiled for multiple devices.
>
> What does this mean?

I think it means that for example in case of AMD you could have
the CPU and the GPU (device) versions of the program in the same
(OpenCL) binary. I see from the specs that they do not support this but
store only the GPU or CPU bits but not both:

"By default, OpenCL generates a binary that has LLVM IR, AMD IL, and the
executable for the GPU (,.llvmir, .amdil, and .text sections), as well as
LLVM IR and the executable for the CPU (.llvmir and .text sections)."?

ELF has only one architecture-specific .text section, IIUC so it would
not work for this. Anyways, we can add a separate wrapper for the
multidevice case on top of this (or use the FatELF) later, if we see
need.

http://icculus.org/fatelf/

"FatELF lets you pack binaries into one file, seperated by OS ABI, OS ABI 
version, byte order and word size, and most importantly, CPU architecture."

> Any other option (tar/zip/ELF/whatever) would do the same, but as this
> is documented and used on a OpenCL SDK I would suggest doing the same.

I do not consider the main advantage to be that it's used by AMD. But
in case it can be used as a directly dlopenable program binary then it's
a real advantage (BTW in MacOS or at least Windows we might need something
else then?). It would avoid the objcopy step in case the binary contains
a kernel version suitable for launching directly for the given
dimensions... probably a small saving but still a nifty thing to have.

-- 
Pekka

Re: [Pocl-devel] Workgroup functions caching

From: Carlos S. de La L. <car...@ur...> - 2011-12-19 11:00:05

>> I think we can reuse AMD one (just uses ELF sections for different
>> binaries) as it is well specified.
> 
> IIRC, the AMD's ELF files store the LLVM bitcode and the AMD-IL (both IRs),
> not the final binary, right?

Also the final binary, optionally.

> The OpenCL API for fetching and loading the program binaries is multi-device.
> Thus the format should not be tied to an architecture as it can contain the
> same kernels compiled for multiple devices.

What does this mean?

> I'm not sure if the ELF provides some benefits for this scenario. We need
> basically a format that stores multiple independent (sometimes ELF?)
> binaries for the compiled program(s) and the LLVM bitcode as a separate one.
> The compiled programs should be quickly loadable to the dynamic linker,
> preferably without extracting them first to a separate file. Can ELF
> support this nicely? The stored binaries (same kernels compiled for
> multiple dimensions) might contain clashing symbols, AFAIU so they should
> be stored as "binary blob sections" in ELF, not as "program sections" with
> linkage/relocation info? You know ELF better than I do...

ELF is used as a trick here. It is not real "ELF". You just get the binary/bytecode/source file/whatever and put it into an ELF section. No relocations or anything at that level, the ELF is used only as a way to put all the files together. If some of the binaries are ELF, then the whole ELF would be inside of a section (you would need to "objcopy" it out, and then you would get a real ELF).

http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Any other option (tar/zip/ELF/whatever) would do the same, but as this is documented and used on a OpenCL SDK I would suggest doing the same.

> What about FatELF?

If it provides advantages in our use case and it is widespread enough, why not. But if it does not the then using ELF (not real ELF, but ELF as in AMD SDK "BIF" format) is probably easier for implementation (just pack the files together).

Carlos

Re: [Pocl-devel] Workgroup functions caching

From: Pekka J. <pek...@tu...> - 2011-12-19 10:36:52

On 12/19/2011 12:18 PM, Carlos Sánchez de La Lama wrote:
> I think we can reuse AMD one (just uses ELF sections for different
> binaries) as it is well specified.

IIRC, the AMD's ELF files store the LLVM bitcode and the AMD-IL (both IRs),
not the final binary, right?

The OpenCL API for fetching and loading the program binaries is multi-device.
Thus the format should not be tied to an architecture as it can contain the
same kernels compiled for multiple devices.

I'm not sure if the ELF provides some benefits for this scenario. We need
basically a format that stores multiple independent (sometimes ELF?)
binaries for the compiled program(s) and the LLVM bitcode as a separate one.
The compiled programs should be quickly loadable to the dynamic linker,
preferably without extracting them first to a separate file. Can ELF
support this nicely? The stored binaries (same kernels compiled for
multiple dimensions) might contain clashing symbols, AFAIU so they should
be stored as "binary blob sections" in ELF, not as "program sections" with
linkage/relocation info? You know ELF better than I do...

What about FatELF?

http://en.wikipedia.org/wiki/Executable_and_Linkable_Format#FatELF:_Universal_Binaries_for_Linux

Let's see...

-- 
Pekka

Re: [Pocl-devel] Workgroup functions caching

From: Carlos S. de La L. <car...@ur...> - 2011-12-19 10:18:57

>>> Although I'd like to cache the final
>>> binary, not only the bitcode
> 
> I meant to (also) cache (or store in the binary format) the code gen
> results, i.e., the final bits.

Ok, I had actually missed it. Then some binary format is needed, I think we can reuse AMD one (just uses ELF sections for different binaries) as it is well specified.

> All in all, I don't think there's many cases when you do *not* want
> caching (even over multiple OpenCL program runs). If it works "behind
> the scenes" like ccache (and includes the pocl+LLVM versions in the hash)
> it should always be beneficial. Disk space is cheap.

Binary caching is probably always OK. I was referring to storing replicated WG functions in the same LLVM IR module when I said it has drawbacks.

> I think clCreateProgramWithBinary can also contain all this functionality
> for manual caching:
> 
> "The program binary can consist of either or both:
> Device-specific code and/or,
> Implementation-specific intermediate representation (IR) which will be 
> converted to the device-specific code."

Yep, binary format is really implementation dependent so you can put whatever you want there. The point is in our case we always need the LLVM IR, cause while OpenCL API allows storing only the binary for a device (AMD SDK can do this, and gives an "invalid device" error of loading a binary for a different architecture) in our case storing the binary would mean fixing also the dimensions, which is not API compliant in the general cases AFAIK.

Carlos

Re: [Pocl-devel] Workgroup functions caching

From: Pekka J. <pek...@tu...> - 2011-12-19 09:21:43

Hi,

I think you missed this:

>> Although I'd like to cache the final
>> binary, not only the bitcode

I meant to (also) cache (or store in the binary format) the code gen
results, i.e., the final bits.

As you know, for TCE it might take considerable time to generate
the code from the bitcode. Also, in general, in a production system
that uses OpenCL for the program, we do not want to execute the compiler
at all if one can avoid it. We might even exclude the compiler from
the host to provide something like the "standalone mode" (to ship
binaries only) but in a more standards compliant way.

Useless overhead is useless overhead. It's especially the case for mobile
devices with energy constraints.

All in all, I don't think there's many cases when you do *not* want
caching (even over multiple OpenCL program runs). If it works "behind
the scenes" like ccache (and includes the pocl+LLVM versions in the hash)
it should always be beneficial. Disk space is cheap.

I think clCreateProgramWithBinary can also contain all this functionality
for manual caching:

"The program binary can consist of either or both:
Device-specific code and/or,
Implementation-specific intermediate representation (IR) which will be 
converted to the device-specific code."

"OpenCL allows applications to create a program object using the
program source or binary and build appropriate program executables. This
can be very useful as it allows applications to load program source and
then compile and link to generate a program executable online on its first
instance for appropriate OpenCL devices in the system. These executables
can now be queried and cached by the application. Future instances of the
application launching will no longer need to compile and link the program
executables. The cached executables can be read and loaded by the
application, which can help significantly reduce the application
initialization time."

BR,
-- 
Pekka

[Pocl-devel] Inlining & context privatization

From: Carlos S. de La L. <car...@ur...> - 2011-12-19 09:02:10

> We currently do this in C++, and I want to port this code to OpenCL.
> Unconditional inlining of all functions would not be good for this
> application. Would it be possible to skip functions that don't call a
> get_*() function, or to skip inlining functions marked "noinline"?

It is, that is why I removed the forcing inline. The passes do not strictly require that the kernel is fully inlined (in fact, inlining now is done by LLVM with its own criteria), What needs to be fixed (as per your bug report) is always inline calls leading to one of those get_xxx().

> Instead of privatizing the code for each thread, is it possible to
> privatize these variables on which the get_*() functions are based? With
> hyperthreading or modern AMD processors, it can be beneficial to have
> several threads executing the same code, even if some expressions cannot
> be evaluated at build time.

No, those variables need to be different for each workgroup, so we cannot make them global (multiple workgroups might be running in parallel in threaded environments). The only way around this is using a context structure that gets passed to all subfunctions, but old passes used to work like that and, in general, generated code is much worse due to load and stores to that structure.

Remember there is no threading "within" the workgroup. Threads are created for different workgroups but not for different workitems of the same workgroup.

Carlos

[Pocl-devel] Workgroup functions caching

From: Carlos S. de La L. <car...@ur...> - 2011-12-19 08:48:51

Hi,

moved this from the bug reporting comments (better discuss on the list I think).

>> How many kernels are cached? E.g. in pthread.c, there is an if statement
>> "if (d->current_kernel != kernel)", as if only one kernel was cached. If
>> that is so, would that be the right place to introduce a larger cache?


Only one right now. Larger cache is needed, I agree.

> Yes, Carlos. The cached result depends on the dimensions so saving
> multiple versions could work. Although I'd like to cache the final
> binary, not only the bitcode, to save all the compilation costs. This
> will be useful especially in the future when TCE is used in a proper
> host-device configuration, in the embedded/mobile systems that really
> want to save all useless work, and also in my planned research wrt.
> OpenCL to FPGA. So we might need to create a new simple binary format
> with multiple target binaries inside + some metadata (for example to
> save the dimensions).

I think the best way is, instead of defining a binary format (I was originally thinking on a ELF with different binaries in different sections, like AMD SDK does) it is probably better just to use the BC. The workgroup function would be created a different name (probably including the dimensions, so the info is there) and the OpenCL-related metadata is not touched so it still points to the original kernel. The passes need slight modifications to handle this, but if we update at that point the "binary image" of the kernel, then we would have what we want.

Drawbacks would be:
1) Little unneeded delay with the caching code in case no caching is wanted.
2) The binary code might grow quite big.

It would be nice if there was a way to enable/disable the caching, more or less complying to the standard. Is there a way to define host-side extensions?

Carlos