From: Carlos S. de La L. <car...@ur...> - 2011-10-24 10:46:00
|
Hi all, I have been thinking about how to implement the kernel library for different devices, and some related issues. Right now, pocl flow goes like this: Compilation: .cl to .bc (bytecode) | V Linking with kerneĺ library | V Fully inlining | V Workgroup creation (replicate workitems) | V Device-dependant driver Workgroup creation needs to detect barriers, thats why it needs to be done after fully inlining (there can be barriers in a function called by the kernel, not in the kernel itself). One desirable thing is bytecode to be device independent as long as possible, until device driver if possible, so we do not have to store several binaries in the host (there might be some unavoidable dependencies, but I think given OpenCL restricted C support those will be minor). Then there are two possibilities: 1) Make kernel library runtime compatible with all devices. This was the planned approach, it can be done by selection the implementation for a device using runtime conditional (C-level ifs) instead of preprocessor ones (#if/#ifdefs). LLVM should then eliminate dead code when generating the final binary. 2) Perform inlining and replication before linking. Only a minor part of the kernel library (get_xxx_id() and friends) need to be linked before WG creation, and those are going to be common for all device because they depend on replication passes. But the big "functional" kernel runtime library could be linked later, even in device-dependant binary form instead of bytecode form, allowing the use of different kernel libraries for different devices. This would have the additional advantage of smaller bytecode and faster code generation. Thoughs? Carlos |
From: Erik S. <esc...@pe...> - 2011-10-24 13:26:40
|
2011/10/24 Carlos Sánchez de La Lama <car...@ur...>: > Hi all, > > I have been thinking about how to implement the kernel library for > different devices, and some related issues. > > Right now, pocl flow goes like this: > > Compilation: .cl to .bc (bytecode) > | > V > Linking with kerneĺ library > | > V > Fully inlining > | > V > Workgroup creation (replicate workitems) > | > V > Device-dependant driver > > Workgroup creation needs to detect barriers, thats why it needs to be > done after fully inlining (there can be barriers in a function called by > the kernel, not in the kernel itself). > > One desirable thing is bytecode to be device independent as long as > possible, until device driver if possible, so we do not have to store > several binaries in the host (there might be some unavoidable > dependencies, but I think given OpenCL restricted C support those will > be minor). Then there are two possibilities: > > 1) Make kernel library runtime compatible with all devices. This was the > planned approach, it can be done by selection the implementation for a > device using runtime conditional (C-level ifs) instead of preprocessor > ones (#if/#ifdefs). LLVM should then eliminate dead code when generating > the final binary. This does not allow hardware-dependent optimisations. For example, certain function calls / assembler instructions are only available on certain hardware, and lead to syntax errors on others. Now, it would be nice if these were not necessary. This would require providing them via LLVM, i.e. implementing the OpenCL run time (e.g. sin, cos, sqrt, their vectorised versions, etc.) in LLVM instead of in POCL. That may be a good idea overall, since this would make these functions available to a larger audience, and would simplify POCL. I don't know whether the LLVM project would be open to such extensions, though -- I have not checked. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Carlos S. de La L. <car...@ur...> - 2011-10-24 14:05:31
|
> > 1) Make kernel library runtime compatible with all devices. This was the > > planned approach, it can be done by selection the implementation for a > > device using runtime conditional (C-level ifs) instead of preprocessor > > ones (#if/#ifdefs). LLVM should then eliminate dead code when generating > > the final binary. > > This does not allow hardware-dependent optimisations. For example, > certain function calls / assembler instructions are only available on > certain hardware, and lead to syntax errors on others. After some rethinking and discussion, it is probably better to forget about bytecode device independency. It is not designed to work that way and even compiled bytecode for different targets is already different just after clang. So different bytecodes inside API structures and different libraries for each device then. One interesting point is that if we want proper "vectorization" (once that is working) on the library code also, the library needs to be compiled into bytecode and linked at bytecode level, before WG generation, but LLVM assembly can handle inline target assembly so this should not be a problem. One pending task is thus to organize the library Makefiles in a way that allows each target to override generic implementations with target-dependent ones. > Now, it would be nice if these were not necessary. This would require > providing them via LLVM, i.e. implementing the OpenCL run time (e.g. > sin, cos, sqrt, their vectorised versions, etc.) in LLVM instead of in > POCL. It would simplify our job, but would make every LLVM backend have to implement all those functions, this means will make backend implementation more complex. I am dubious that LLVM project will go that way. BR, Carlos |