From: Tom S. <tst...@gm...> - 2010-04-03 07:33:33
|
Hi, I have completed a first draft of my Google Summer of Code proposal, and I would appreciate feedback from some of the Mesa developers. I have included the project plan from my proposal in this email, and you can also view my full proposal here: http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/tstellar/t126997450856 However, I think you will need a google login to view it. Project Tasks: 1. Enable branch emulation for Gallium drivers: The goal of this task will be to create an optional "optimization" pass over the TGSI code to translate branch instructions into instructions that are supported by cards without hardware branching. The basic strategy for doing this translation will be: A. Copy values of in scope variables to a temporary location before executing the conditional statement. B. Execute the "if true" branch. C. Test the conditional expression. If it evaluates to false, rollback all values that were modified in the "if true" branch. D. Repeat step 2 with the "if false" branch, and then step 3, but this time only rollback if the conditional expression evaluates to true. The TGSI instructions SLT, SNE, SGE, SEQ will be used to test the conditional expression and the instruction CND will be used to rollback the values. There will be two phases to this task. For phase 1, I will implement a simple translator that will be able to translate the branch instructions with only one pass through the TGSI code. This simple translator will copy all in scope variables to a temporary location before executing the conditional statement, even if those variables will not not be modified in either of the branches. Phase 2 will add a preliminary pass before to the code translation pass that will mark variables that might be modified by the conditional statement. Then, during the translation pass, only the variables that could potentially be modified inside either of the conditional branches will be copied before the conditional statement is executed. 2. Unroll loops for Gallium drivers: The goal of this task will be to unroll loops so that they can be executed by hardware that does not support them. The loop unrolling will be done in the same "optimization" pass as the branch emulation. Loops where the number of iterations is known at compile time will be unrolled and may have additional optimizations applied. Loops that have an unknown number of iterations, will have to be studied to see if there is a way to replace the loop with a set of instructions that produces the same output as the loop. For example, one solution might be to replace an ADD(src0, src0) instruction that is supposed to execute n times with a MUL(src0, n). It is possible that not all loops will be able to be unrolled successfully. These first two tasks are important not only for older cards that do not support hardware branching, but newer cards as well. Driver developers will not need to use every hardware instruction to compile shaders with branches and loops, so they could use the branch emulation as a temporary solution while hardware support for branching and loops is being worked on. 3. Loops and Conditionals for R500 fragment and vertex shaders: The goal of this task will be to make use of the R500 hardware support for branches and loops. New radeon_compiler opcodes (RC_OPCODE_*) will need to be added to represent loops, and the corresponding TGSI instructions will need to be converted into these new opcodes during the TGSI_OPCODE_* to RC_OPCODE_* phase. Once this has been done, the code generator for R500 vertex and fragment shaders will need to be modified to output the correct hardware instructions for loops. 4. More compiler optimizations / other GLSL features: This is an optional task that will allow me to revisit the work from the previous tasks and explore doing some optimizations I may have wanted to do, but were outside the scope of those tasks. If there are no obvious optimizations to be done, this time could be spent implementing some other GLSL features for the R300 driver, possible ideas include: Adding support for the gl_FrontFacing variable. Handling varying modifiers like perspective, flat, and centroid. Improving the GLSL frontend to add support for more language features. Schedule / Deliverables: 1. Enable branch emulation for Gallium drivers (4 weeks) 2. Unroll loops for Gallium drivers (2 - 3 weeks) Midterm Evaluation 3. Loops and Conditionals for R500 fragment and vertex shaders (4 weeks) 4. More compiler optimizations / other GLSL features (2 weeks) Tasks 1-3 will be required for this project. Task 4 is optional. Thank you. -Tom Stellard |
From: Corbin S. <mos...@gm...> - 2010-04-03 09:23:36
|
On Sat, Apr 3, 2010 at 3:31 PM, Tom Stellard <tst...@gm...> wrote: > Hi, > > I have completed a first draft of my Google Summer of Code > proposal, and I would appreciate feedback from some of the > Mesa developers. I have included the project plan from my > proposal in this email, and you can also view my full proposal here: > http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/tstellar/t126997450856 > However, I think you will need a google login to view it. > > Project Tasks: > > 1. Enable branch emulation for Gallium drivers: > The goal of this task will be to create an optional "optimization" pass > over the TGSI code to translate branch instructions into instructions > that are supported by cards without hardware branching. The basic > strategy for doing this translation will be: > > A. Copy values of in scope variables > to a temporary location before executing the conditional statement. > > B. Execute the "if true" branch. > > C. Test the conditional expression. If it evaluates to false, rollback > all values that were modified in the "if true" branch. > > D. Repeat step 2 with the "if false" branch, and then step 3, but this > time only rollback if the conditional expression evaluates to true. > > The TGSI instructions SLT, SNE, SGE, SEQ will be used to test the > conditional expression and the instruction CND will be used to rollback > the values. > > There will be two phases to this task. For phase 1, I will implement a > simple translator that will be able to translate the branch instructions > with only one pass through the TGSI code. This simple translator will > copy all in scope variables to a temporary location before executing the > conditional statement, even if those variables will not not be modified > in either of the branches. > > Phase 2 will add a preliminary pass before to the code translation > pass that will mark variables that might be modified by the conditional > statement. Then, during the translation pass, only the variables that > could potentially be modified inside either of the conditional branches > will be copied before the conditional statement is executed. > > 2. Unroll loops for Gallium drivers: > The goal of this task will be to unroll loops so that they can be > executed by hardware that does not support them. The loop unrolling > will be done in the same "optimization" pass as the branch emulation. > Loops where the number of iterations is known at compile time will be > unrolled and may have additional optimizations applied. Loops that > have an unknown number of iterations, will have to be studied to see > if there is a way to replace the loop with a set of instructions that > produces the same output as the loop. For example, one solution might > be to replace an ADD(src0, src0) instruction that is supposed to execute > n times with a MUL(src0, n). It is possible that not all loops will be > able to be unrolled successfully. > > These first two tasks are important not only for older cards that do not > support hardware branching, but newer cards as well. Driver developers > will not need to use every hardware instruction to compile shaders > with branches and loops, so they could use the branch emulation as a > temporary solution while hardware support for branching and loops is > being worked on. > > 3. Loops and Conditionals for R500 fragment and vertex shaders: > The goal of this task will be to make use of the R500 hardware support for > branches and loops. New radeon_compiler opcodes (RC_OPCODE_*) will need > to be added to represent loops, and the corresponding TGSI instructions > will need to be converted into these new opcodes during the TGSI_OPCODE_* > to RC_OPCODE_* phase. Once this has been done, the code generator for > R500 vertex and fragment shaders will need to be modified to output the > correct hardware instructions for loops. > > 4. More compiler optimizations / other GLSL features: > This is an optional task that will allow me to revisit the work from the > previous tasks and explore doing some optimizations I may have wanted to > do, but were outside the scope of those tasks. If there are no obvious > optimizations to be done, this time could be spent implementing some > other GLSL features for the R300 driver, possible ideas include: > > Adding support for the gl_FrontFacing variable. > Handling varying modifiers like perspective, flat, and centroid. > Improving the GLSL frontend to add support for more language features. > > Schedule / Deliverables: > 1. Enable branch emulation for Gallium drivers (4 weeks) > 2. Unroll loops for Gallium drivers (2 - 3 weeks) > Midterm Evaluation > 3. Loops and Conditionals for R500 fragment and vertex shaders (4 weeks) > 4. More compiler optimizations / other GLSL features (2 weeks) > > Tasks 1-3 will be required for this project. > Task 4 is optional. > > Thank you. Wow! Looks like you're certainly on the right track and you've been doing your research. I would say that the first two items on your list would be fine as a complete project. TGSI streams are tricky to modify, and you may find that you have to write more and more TGSI-specific code as you dig in. (For example, there are no helpers for strength reduction in TGSI yet.) I'll wait for everybody else to chime in, but it looks good so far. ~ C. -- When the facts change, I change my mind. What do you do, sir? ~ Keynes Corbin Simpson <Mos...@gm...> |
From: Luca B. <luc...@gm...> - 2010-04-03 18:37:46
|
This is somewhat nice, but without using a real compiler, the result will still be just a toy, unless you employ hundreds of compiler experts working full time on the project. For instance, Wikipedia lists the following loop optimizations: # loop interchange : These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of reference, depending on the array's layout. This is also known as loop permutation. # loop splitting/loop peeling : Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop. # loop fusion or loop combining : Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data. # loop fission or loop distribution : Loop fission attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the code in the loop's body. # loop unrolling: Duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead (except multiple instruction fetches & increased program load time), but requires that the number of iterations be known at compile time (except in the case of JIT compilers). Care must also be taken to ensure that multiple re-calculation of indexed variables is not a greater overhead than advancing pointers within the original loop. # loop unswitching : Unswitching moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional. # loop inversion : This technique changes a standard while loop into a do/while (a.k.a. repeat/until) loop wrapped in an if conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a pipeline stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect-free, the if guard can be skipped. # loop-invariant code motion : If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with loop inversion, because not all code is safe to be hoisted outside the loop. # loop reversal : Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependencies and thus enable other optimizations. Also, certain architectures utilise looping constructs at Assembly language level that count in a single direction only (e.g. decrement-jump-if-not-zero (DJNZ)). # loop tiling/loop blocking : Loop tiling reorganizes a loop to iterate over blocks of data sized to fit in the cache. # loop skewing : Loop skewing takes a nested loop iterating over a multidimensional array, where each iteration of the inner loop depends on previous iterations, and rearranges its array accesses so that the only dependencies are between iterations of the outer loop. Good luck doing all this on TGSI (especially if the developer does not have serious experience writing production compilers). Also, this does not mention all the other optimizations and analyses required to the above stuff well (likely other 10-20 things). Using a real compiler (e.g. LLVM, but also gcc or Open64), those optimizations are already implemented, or at least there is already a team of experienced compiler developers who are working full time to implement such optimizations, allowing you to then just turn them on without having to do any of the work yourself. Note all "X compiler is bad for VLIW or whatever GPU architecture" objections are irrelevant, since almost all optimizations are totally architecture independent. Also note that we should support OpenCL/compute shaders (already available for *3* years on e.g. nv50) and those *really* need a real compiler (as in, something developed for years by a team of compiler experts, and in wide use). For instance, nVidia uses Open64 to compile CUDA programs, and then feeds back the output (via PTX) to their ad-hoc code generator. Note that unlike Mesa/Gallium, nVidia actually had a working shader optimizer AND a large paid team, yet they still decided to at least partially use Open64. PathScale (who seems to mainly sell an Open64-based compiler for the HPC market) might do some of this work (with a particular focus on a CUDA replacement for nv50), but it's unclear whether this will turn out to generally useful (for all Gallium drivers, as opposed to nv50-only) or not. Also they plan to use Open64 and WHIRL, and it's unclear whether this is as well designed for embedding and easy to understand and customize like LLVM is (please expand of this you know about it) Really, the current code generation situation is totally _embarassing_ (and r300 is probably one of the best here, having its own compiler, and doesn't even have loops, so you can imagine how good the other drivers are), and ought to be fixed in a definitive fashion. This is obviously not achievable if Mesa/Gallium contributors are supposed to write the compiler optimization themselves, since clearly there is not even enough manpower to support a relatively up-to-date version of OpenGL or, say, to have drivers that can allocate and fence GPU memory in a sensible and fast way, or implement hierarchical Z buffers, or any of the other things expected from a decent driver, that the Mesa drivers don't do. In other words, state-of-the-art optimizing compilers are not something one can just pop up and write himself from scratch, unless he is interested and skilled at it, it is his main project AND he manages to attract, or pays, a community of compiler experts to work on it. Since LLVM already works well, has a community of compiler experts working on it, and is funded by companies such as Apple, there is no chance of attracting such a community, especially for something limited to the niche of compiling shaders. And yes, LLVM->TGSI->LLVM is not entirely trivial, but it is doable (obviously), and once you get past that initial hurdle, you get EVERYTHING FOR FREE. And the free work keeps coming with every commit to the llvm repository, and you only have to do the minimal work of updating for LLVM interface changes. So you can just do nothing and after a few months you notice that your driver is faster on very advanced games because a new LLVM automatically improved the quality of your shaders without you even knowing about it. Not to mention that we could then at some point just get rid of TGSI, use LLVM IR directly, and have each driver implement a normal backend if possible. The test for adequateness of a shader compiler is saying "yes, this code is really good: I can't easily come up with any way to improve it", looking at the generated code for any example you can find. Any ad-hoc compiler will most likely immediately fail such a test, for complex examples. So, for a GSoC project, I'd kind of suggest: (1) Adapt the gallivm/llvmpipe TGSI->LLVM converter to also generate AoS code (i.e. RGBA vectors as opposed to RRRR, GGGG, etc.) if possible or write one from scratch otherwise (2) Write a LLVM->TGSI backend, restricted to programs without any control flow (3) Make LLVM->TGSI always work (even with control flow and DDX/DDY) (4) Hook up all useful LLVM optimizations If there is still time/as followup (note that these are mostly complex things, at most one/two might be doable in the timeframe) (5) Do something about uniform-specific shader generation, and support automatically generating "pre-shaders" for the CPU (using the x86/x86-64 LLVM backends) for uniform-only computations (6) Enhance LLVM to provide any missing optimization with a significant impact (7) Convert existing drivers to LLVM backends, or have them expose more functionality to the TGSI backend via TGSI extensions (or currently unused features such as predicate support), and do driver-specific stuff (e.g. scalarization for scalar architectures) (8) Make sure shaders can be compiled using as large as possible a subset of plain C/C++, as well as OpenCL (using clang), and add OpenCL support to Mesa/Gallium (some of it already exists in external repositories) (9) Compare with fglrx and nVidia libGL,/cgc/nvopencc and improve whatever necessary to be equal or better than them (10) Talk with LLVM developers about good VLIW code generation for the Radeons and to a lesser extent nv30/nv40 that need it, and find out exactly what the problem is here, how it can be solved and who could do the work (11) Add Gallium support for nv10/nv20 and r100/r200 using the LLVM DAG instruction selector to code-generate a fixed pipeline (Stephane Marchesin tried this already, seems it is non-trivial but could be made to work partially, and probably enough to get the Xorg state tracker to work on all cards and get rid of all X drivers at some point). (12) Figure out if any other compilers (Open64, gcc, whatever) can be useful as backends for some drivers Maybe I should propose to do it myself though, if that is still possible, since everyone else seems afraid of it for some reason and it seems to me it is absolutely essential to have a chance of having usable (read: that don't look ridiculous compared to the proprietary ones) drivers, especially in the long run for DirectX 11-level and later games and software heavily using OpenCL/compute shaders and very complex tessellation/vertex/geometry/fragment shaders. |
From: Luca B. <luc...@gm...> - 2010-04-03 19:09:17
|
As a further example that just came to mind, nv40 (GeForce 6-7 and PS3 RSX) supports control flow in fragment shaders, but does not apparently support the "continue" keyword (since NV_fragment_program2, which maps almost directly to the hardware, does not have it either). I implemented TGSI control flow in a private branch, but did not implement the "continue" keyword. Implementing "continue" requires to transform the code to generate and carry around "should continue" flags, or perform even less trivial transformations including code duplication. Unfortunately, doing requires non-local modifications, and thus would require to do something beyond just scanning the TGSI source code as the nv30/nv40 driver currently does. If there was a TGSI->LLVM->TGSI module, the LLVM->TGSI control flow reconstruction would already handle this, and it would be enough to tell it to not make use of the "continue" instruction: it would then automatically generate the proper if/endif structure, duplicating code and/or introducing flags as needed in a generic way. As things stand now, I'm faced with either just hoping the GLSL programs don't use "continue", implementing an hack in the nv40 shader backend (where such an high-level optimization does not belong at all and can't be done cleanly), or writing the LLVM module myself before tackling this. With an LLVM-based infrastructure, there would be a clear and straightforward way to solve this, will all the supporting infrastructure already available and the ability to create an optimization pass reusable by other drivers that may face the same issue. This is just an example, by the way: others can be found. |
From: Luca B. <luc...@gm...> - 2010-04-03 19:39:43
|
By the way, if you want a simple, limited and temporary, but very effective, way to optimize shaders, here it is: 1. Trivially convert TGSI to GLSL 2. Feed the GLSL to the nVidia Cg compiler, telling it to produce optimized output in ARB_fragment_program format 3. Ask the Mesa frontend/state tracker to parse the ARB_fragment_program and give you back TGSI This does actually optimize the program well and does all the nice control flow transformations desired. If your GPU can support predicates or condition codes, you can also ask the Cg compiler to give you NV_fragment_program_option, which will use them efficiently. If it also supports control flow, you can ask for NV_fragment_program2 and get control flow too where appropriate. Of course, if this does not happen to do exactly what you want, you are totally out of luck, since it is closed source. With an ad-hoc TGSI optimizer, you can modify it, but that will often require to rearchitecture the module, since it may be too primitive for the new feature you want, and implement everything from scratch with no supporting tools to help you. With a real compiler framework, you already have the optimization ready for use, or you at least have a comprehensive conceptual framework and IR and a full set of analyses, frameworks and tools to use, not to mention a whole community of compiler developers that can at least tell you what is the best way of doing what you want (actually giving out competent advice), if not even have already done or planned to do it themselves. |
From: Tom S. <tst...@gm...> - 2010-04-03 19:45:10
|
On Sat, Apr 03, 2010 at 08:37:39PM +0200, Luca Barbieri wrote: > This is somewhat nice, but without using a real compiler, the result > will still be just a toy, unless you employ hundreds of compiler > experts working full time on the project. > <SNIP - loop optimization techniques from Wikipedia> > > Good luck doing all this on TGSI (especially if the developer does not > have serious experience writing production compilers). I agree with you that doing these kinds of optimizations is a difficult task, but I am trying to focus my proposal on emulating branches and loops for older hardware that don't have branching instructions rather than performing global optimizations on the TGSI code. I don't think most of the loop optimizations you listed are even possible on hardware without branching instructions. > Also, this does not mention all the other optimizations and analyses > required to the above stuff well (likely other 10-20 things). > > Using a real compiler (e.g. LLVM, but also gcc or Open64), those > optimizations are already implemented, or at least there is already a > team of experienced compiler developers who are working full time to > implement such optimizations, allowing you to then just turn them on > without having to do any of the work yourself. > > Note all "X compiler is bad for VLIW or whatever GPU architecture" > objections are irrelevant, since almost all optimizations are totally > architecture independent. > > Also note that we should support OpenCL/compute shaders (already > available for *3* years on e.g. nv50) and those *really* need a real > compiler (as in, something developed for years by a team of compiler > experts, and in wide use). > For instance, nVidia uses Open64 to compile CUDA programs, and then > feeds back the output (via PTX) to their ad-hoc code generator. > > Note that unlike Mesa/Gallium, nVidia actually had a working shader > optimizer AND a large paid team, yet they still decided to at least > partially use Open64. > > PathScale (who seems to mainly sell an Open64-based compiler for the > HPC market) might do some of this work (with a particular focus on a > CUDA replacement for nv50), but it's unclear whether this will turn > out to generally useful (for all Gallium drivers, as opposed to > nv50-only) or not. > Also they plan to use Open64 and WHIRL, and it's unclear whether this > is as well designed for embedding and easy to understand and customize > like LLVM is (please expand of this you know about it) > > Really, the current code generation situation is totally _embarassing_ > (and r300 is probably one of the best here, having its own compiler, > and doesn't even have loops, so you can imagine how good the other > drivers are), and ought to be fixed in a definitive fashion. > > This is obviously not achievable if Mesa/Gallium contributors are > supposed to write the compiler optimization themselves, since clearly > there is not even enough manpower to support a relatively up-to-date > version of OpenGL or, say, to have drivers that can allocate and fence > GPU memory in a sensible and fast way, or implement hierarchical Z > buffers, or any of the other things expected from a decent driver, > that the Mesa drivers don't do. > > In other words, state-of-the-art optimizing compilers are not > something one can just pop up and write himself from scratch, unless > he is interested and skilled at it, it is his main project AND he > manages to attract, or pays, a community of compiler experts to work > on it. > > Since LLVM already works well, has a community of compiler experts > working on it, and is funded by companies such as Apple, there is no > chance of attracting such a community, especially for something > limited to the niche of compiling shaders. > > And yes, LLVM->TGSI->LLVM is not entirely trivial, but it is doable > (obviously), and once you get past that initial hurdle, you get > EVERYTHING FOR FREE. > And the free work keeps coming with every commit to the llvm > repository, and you only have to do the minimal work of updating for > LLVM interface changes. > So you can just do nothing and after a few months you notice that your > driver is faster on very advanced games because a new LLVM > automatically improved the quality of your shaders without you even > knowing about it. > > Not to mention that we could then at some point just get rid of TGSI, > use LLVM IR directly, and have each driver implement a normal backend > if possible. > > The test for adequateness of a shader compiler is saying "yes, this > code is really good: I can't easily come up with any way to improve > it", looking at the generated code for any example you can find. > > Any ad-hoc compiler will most likely immediately fail such a test, for > complex examples. I think that part of the advantage of my proposal is that the branch instruction translation is done on the TGSI code. So, even if the architecture of the GLSL compiler is changed to something like LLVM->TGSI->LLVM, these translations can still be applied by hardware that needs them. > So, for a GSoC project, I'd kind of suggest: > (1) Adapt the gallivm/llvmpipe TGSI->LLVM converter to also generate > AoS code (i.e. RGBA vectors as opposed to RRRR, GGGG, etc.) if > possible or write one from scratch otherwise > (2) Write a LLVM->TGSI backend, restricted to programs without any control flow > (3) Make LLVM->TGSI always work (even with control flow and DDX/DDY) > (4) Hook up all useful LLVM optimizations > > If there is still time/as followup (note that these are mostly complex > things, at most one/two might be doable in the timeframe) > (5) Do something about uniform-specific shader generation, and support > automatically generating "pre-shaders" for the CPU (using the > x86/x86-64 LLVM backends) for uniform-only computations > (6) Enhance LLVM to provide any missing optimization with a significant impact > (7) Convert existing drivers to LLVM backends, or have them expose > more functionality to the TGSI backend via TGSI extensions (or > currently unused features such as predicate support), and do > driver-specific stuff (e.g. scalarization for scalar architectures) > (8) Make sure shaders can be compiled using as large as possible a > subset of plain C/C++, as well as OpenCL (using clang), and add OpenCL > support to Mesa/Gallium (some of it already exists in external > repositories) > (9) Compare with fglrx and nVidia libGL,/cgc/nvopencc and improve > whatever necessary to be equal or better than them > (10) Talk with LLVM developers about good VLIW code generation for the > Radeons and to a lesser extent nv30/nv40 that need it, and find out > exactly what the problem is here, how it can be solved and who could > do the work > (11) Add Gallium support for nv10/nv20 and r100/r200 using the LLVM > DAG instruction selector to code-generate a fixed pipeline (Stephane > Marchesin tried this already, seems it is non-trivial but could be > made to work partially, and probably enough to get the Xorg state > tracker to work on all cards and get rid of all X drivers at some > point). > (12) Figure out if any other compilers (Open64, gcc, whatever) can be > useful as backends for some drivers I think (2) is probably the closest to what I am proposing, and it is something I can take a look at. Thanks for your feedback. -Tom Stellard |
From: Luca B. <luc...@gm...> - 2010-04-03 21:17:53
|
> I agree with you that doing these kinds of optimizations is a difficult > task, but I am trying to focus my proposal on emulating branches and > loops for older hardware that don't have branching instructions rather > than performing global optimizations on the TGSI code. I don't think > most of the loop optimizations you listed are even possible on hardware > without branching instructions. Yes, that's possible. In fact, if you unroll loops, those optimizations can be done after loop unrolling. This does not however necessarily change things, since while you can e.g. avoid loop-invariant code motion, you still need common subexpression elimination to remove the mutiple redundant copies of the loop-invariant code generated by unrolling. Also even loop unrolling needs to find the number of iterations, which at the very least requires simple constant folding, and potentially a whole suite of complex optimization to work in all possible Some of the challenges of this were mentioned in a previous thread, as well as LLVM-related issues >> (2) Write a LLVM->TGSI backend, restricted to programs without any control flow > > I think (2) is probably the closest to what I am proposing, and it is > something I can take a look at. Note that this means an _input_ program without control flow, that is a control flow graph with a single basic block. Once you have more than one basic block, you need to convert the CFG for an arbitrary graph to something made of structured loops and conditionals. The problem here is that GPUs often use a "SIMT" approach. This means that the GPU internally works like an SSE CPU with vector registers (but often much wider, with up to 32 elements or even more). However, this is hidden to the programmer, by putting the variables related to several pixels in the vector, and making you think everything is a scalar or just a 4-component vector This works fine as long as there is no control flow; however when you reach a conditional jump, some pixels may want to take one path and some others another path. The solution is to have an "execution mask" and do not write to any pixels not in the execution masks. When and if/else/endif structure is encountered, if the pixels all take the same path, things work like CPUs; if that is not the case, both branches are executed with the appropriate execution masks, and things continue normally after the endif. The problem here is that this needs a structure if/else/endif formulation as opposed to arbitrary gotos. However LLVM and most optimizers work in arbitrary-goto formulation, which needs to be converted to a structured approach. The above all applies for GPU with hardware control flow. However, even without it, you have the same issue of reconstructing if/else/endif blocks, since you need to basically do the same in software, using a the if conditional to choose between results computed by the branches. Converting a control flow graph to a structured program is always possible, but doing it well requires some thought. In particular, you need to be careful to not break DDX instructions, which operate on a 2x2 block of pixels, and will thus behave differently if some of the other things have diverged away due to control flow modifications. This may require to make sure control flow optimizations do not duplicate them, and possibly other issues. Using an ad-hoc optimizer does indeed sidestep the issue, but only as long as you don't try to do non-trivial control flow optimization or changes. In that case, those may be best expressed on an arbitrary control flow graph (e.g. the issue with converting "continue" to if/end), and at this point you would need to add that logic anyway. At any rate, I'm not sure whether this is suitable for your GSoC project or not. My impression is that using an existing compiler would prove to be more widely useful and more long lasting, especially considering that we are moving towards applications and hardware with very complex shader support (consider the CUDA/OpenCL shaders and the very generic GPU shading capabilities). An ad-hoc TGSI optimizer will probably prove unsuitable for efficient code generation for, say, scientific applications using OpenCL, and would need to be later replaced. So my personal impression (which could be wrong) is that using an existing optimizer, while possibly requiring an higher initial investment, should have much better payoffs in the long run, by making everything beyond the initial TGSI->LLVM->TGSI work already done or easier to do. >From a coding perspective, you lose the "design and write everything myself from scratch" aspect, but you gain experience with a complex and real-world compiler, and are able to write more complex optimizations and transforms due to having a well-developed infrastructure allowing to express them easily. Furthermore, hopefully using a real compiler would result in seeing your work producing very good code in all cases, while an ad-hoc optimizer would impove the current situation, but most likely the resulting code would still be blatantly suboptimal. Another advantage would be presumably seeing the work used indefinitely and built upon for projects such as OpenCL/compute shaders support. It may be more or less time consuming, depending on the level of sophistication of the ad-hoc optimizer. By the way, it would be interesting to know what people who are working on related things think about this (CCed them). In particular, Zack Rusin has worked extensively with LLVM and I think a prototype OpenCL implementation. Also, PathScale is interested in GPU code generation and may contribute something based on Open64 and its IR, WHIRL. However, I'm not sure whether this could work as a general optimizing framework, or instead just as a backend code generator for some drivers (e.g. nv50). In particular, it may be possible to use LLVM to do architecture independent optimizations and then convert to WHIRL if such a backend is available for the targeted GPU. BTW, LLVM seems to me superior to Open64 as an easy-to-use framework for flexibly running existing optimization passes and writing your own (due to the unified IR and existing wide adoption for such purpose) so we may want to have it even if a Open64-based GPU backends where to become available; however, I might be wrong on this. The way I see it, it is a fundamental Mesa/Gallium issue, and should really be solved in a lasting way. See the previous thread for more detailed discussion of the technical issues of an LLVM-based implementation. Again, not sure whether this is appropriate for this GSoC project, but it seemed quite worthwhile to raise this issue, since if I'm correct, using an existing optimizer (LLVM is the default candidate here) could produce better results and avoid ad-hoc work that would be scrapped later. I may consider doing this myself, either as a GSoC proposal if still possible, or otherwise, if no one else does before, and time permits (the latter issue is the major problem here...) |
From: Zack R. <za...@vm...> - 2010-04-03 22:11:58
|
On Saturday 03 April 2010 17:17:46 Luca Barbieri wrote: > >> (2) Write a LLVM->TGSI backend, restricted to programs without any > >> control flow > > > > I think (2) is probably the closest to what I am proposing, and it is > > something I can take a look at. <snip> > By the way, it would be interesting to know what people who are > working on related things think about this (CCed them). > In particular, Zack Rusin has worked extensively with LLVM and I think > a prototype OpenCL implementation. From the compute support LLVM->TGSI translation isn't even about optimizations, it's about "working". Writing a full C/C++ compiler that generates TGSI is a lot less realistic than reusing Clang and writing a TGSI code-generator for it. So the LLVM code-generator for TGSI would be a very high impact project for Gallium. Obviously a code-generator that can handle control-flow (to be honest I'm really not sure why you want to restrict it to something without control- flow in the first place). Having said that I'm not sure whether this is something that's a good GSOC project. It's a fairly difficult piece of code to write. One that to do right will depend on adding some features to TGSI (a good source of inspiration for those would be AMD's CAL and NVIDIA's PTX http://developer.amd.com/gpu_assets/ATI_Intermediate_Language_(IL)_Specification_v2b.pdf http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf ) I thought the initial proposal was likely a lot more feasible for a GSOC (of course there one has to point out that Mesa's GLSL compiler already does unroll loops and in general simplifies control-flow so the points #1 and #2 are largely no-ops, but surely there's enough work on Gallium Radeon's drivers left to keep Tom busy). Otherwise having a well-defined and reduced scope with clear deliverables would be rather necessary for LLVM->TGSI code because that is not something that you could get rock solid over a summer. z |
From: Corbin S. <mos...@gm...> - 2010-04-03 22:26:39
|
On Sat, Apr 3, 2010 at 3:10 PM, Zack Rusin <za...@vm...> wrote: > On Saturday 03 April 2010 17:17:46 Luca Barbieri wrote: >> <snipped walls of text> > > From the compute support LLVM->TGSI translation isn't even about > optimizations, it's about "working". Writing a full C/C++ compiler that > generates TGSI is a lot less realistic than reusing Clang and writing a TGSI > code-generator for it. > So the LLVM code-generator for TGSI would be a very high impact project for > Gallium. Obviously a code-generator that can handle control-flow (to be honest > I'm really not sure why you want to restrict it to something without control- > flow in the first place). > > Having said that I'm not sure whether this is something that's a good GSOC > project. It's a fairly difficult piece of code to write. One that to do right > will depend on adding some features to TGSI (a good source of inspiration for > those would be AMD's CAL and NVIDIA's PTX > http://developer.amd.com/gpu_assets/ATI_Intermediate_Language_(IL)_Specification_v2b.pdf > http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf ) > > I thought the initial proposal was likely a lot more feasible for a GSOC (of > course there one has to point out that Mesa's GLSL compiler already does > unroll loops and in general simplifies control-flow so the points #1 and #2 are > largely no-ops, but surely there's enough work on Gallium Radeon's drivers > left to keep Tom busy). Otherwise having a well-defined and reduced scope with > clear deliverables would be rather necessary for LLVM->TGSI code because that > is not something that you could get rock solid over a summer. Agreed. There are some things here that need to be kept in mind: 1) r300/r500 are not architectures powerful enough to merit general compilation, and they don't mesh well with LLVM. The hand-written optimizations we already have in place are fine for these chipsets. 2) We should leverage LLVM when possible, since we're going to be increasingly dependent on it anyway. 3) Common code goes up, specialized code goes down. That's the entire point of Gallium. Specialized compiler passes that operate on TGSI but are only consumed by one driver should move down into the driver. I think that the first two parts of Tom's original proposal would be better spent on r300 only, taking nha's r300g-glsl work and cleaning and perfecting it. If we can pass all of the GLSL tests (save for the NOISE test) on r300, then we will be far better off as opposed to work on TGSI towards the same goal. ~ C. -- When the facts change, I change my mind. What do you do, sir? ~ Keynes Corbin Simpson <Mos...@gm...> |
From: Marek O. <ma...@gm...> - 2010-04-03 22:58:44
|
On Sun, Apr 4, 2010 at 12:10 AM, Zack Rusin <za...@vm...> wrote: > I thought the initial proposal was likely a lot more feasible for a GSOC > (of > course there one has to point out that Mesa's GLSL compiler already does > unroll loops and in general simplifies control-flow so the points #1 and #2 > are > largely no-ops, but surely there's enough work on Gallium Radeon's drivers > left to keep Tom busy). Otherwise having a well-defined and reduced scope > with > clear deliverables would be rather necessary for LLVM->TGSI code because > that > is not something that you could get rock solid over a summer. > It doesn't seem to simplify branches or unroll loops that much, if at all. It fails even for the simplest cases like this one: if (gl_Vertex.x < 30.0) gl_FrontColor = vec4(1.0, 0.0, 0.0, 0.0); else gl_FrontColor = vec4(0.0, 1.0, 0.0, 0.0); This gets translated to TGSI "as is", which is fairly... you know what. -Marek |
From: Zack R. <za...@vm...> - 2010-04-04 00:09:18
|
On Saturday 03 April 2010 18:58:36 Marek Olšák wrote: > On Sun, Apr 4, 2010 at 12:10 AM, Zack Rusin > <za...@vm...<mailto:za...@vm...>> wrote: I thought the initial > proposal was likely a lot more feasible for a GSOC (of course there one > has to point out that Mesa's GLSL compiler already does unroll loops and > in general simplifies control-flow so the points #1 and #2 are largely > no-ops, but surely there's enough work on Gallium Radeon's drivers left to > keep Tom busy). Otherwise having a well-defined and reduced scope with > clear deliverables would be rather necessary for LLVM->TGSI code because > that is not something that you could get rock solid over a summer. > > It doesn't seem to simplify branches or unroll loops that much, if at all. It does for cases where the arguments are known. > It fails even for the simplest cases like this one: > > if (gl_Vertex.x < 30.0) which is unknown at the compilation time. z |
From: Luca B. <luc...@gm...> - 2010-04-03 23:08:06
|
> Gallium. Obviously a code-generator that can handle control-flow (to be honest > I'm really not sure why you want to restrict it to something without control- > flow in the first place). The no-control-flow was just for the first step, with a second step supporting everything. > Having said that I'm not sure whether this is something that's a good GSOC > project. It's a fairly difficult piece of code to write. One that to do right > will depend on adding some features to TGSI (a good source of inspiration for > those would be AMD's CAL and NVIDIA's PTX > http://developer.amd.com/gpu_assets/ATI_Intermediate_Language_(IL)_Specification_v2b.pdf > http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf ) This would be required to handle arbitrary LLVM code (e.g. for clang/OpenCL use), but since GLSL shader code starts as TGSI, it should be possible to convert it back without TGSI. > I thought the initial proposal was likely a lot more feasible for a GSOC (of > course there one has to point out that Mesa's GLSL compiler already does > unroll loops and in general simplifies control-flow so the points #1 and #2 are > largely no-ops, but surely there's enough work on Gallium Radeon's drivers > left to keep Tom busy). Otherwise having a well-defined and reduced scope with > clear deliverables would be rather necessary for LLVM->TGSI code because that > is not something that you could get rock solid over a summer. I'd say, as an initial step, restricting to code produced by TGSI->LLVM (AoS) that can be expressed with no intrinsics, having a single basic block, with no optimization passes having been run on it. All 4 restrictions (from TGSI->LLVM, no instrinsics, single BB and no optimizations) can then be lifted in successive iterations. Of course, yes, it has a different scope than the original proposal. The problem I see is that since OpenCL will be hopefully done at some point, then as you say TGSI->LLVM will also be done, and that will probably make any other optimization work irrelevant. So basically the r300 optimization work looks doomed from the beginning to be eventually obsoleted. That said, you may want to do it anyway. But if you really want a quick fix for r300, seriously, just use the nVidia Cg compiler. It's closed source, but being produced by the nVidia team, you can generally rely on it not sucking. It takes GLSL input and spits out optimized ARB_fragment_program (or optionally other languages) so it is trivial to interface with it. It could even be useful to compare the output/performance of that with a more serious LLVM-based solution, to make sure we get the latter right. For instance, personally, I did work on the nv30/nv40 shader assembler (note the word "assembler" here), and haven't done anything more than simple local transforms, for exactly this reason. The only thing I've done for LLVM->TGSI is trying to recover Stephane Marchesin's work on LLVM (forgot to CC him too), lost in an hard drive crash, but failed to find anyone having pulled it. |
From: Zack R. <za...@vm...> - 2010-04-04 00:42:34
|
On Saturday 03 April 2010 19:07:59 Luca Barbieri wrote: > > Gallium. Obviously a code-generator that can handle control-flow (to be > > honest I'm really not sure why you want to restrict it to something > > without control- flow in the first place). > > The no-control-flow was just for the first step, with a second step > supporting everything. k, that's good. > > Having said that I'm not sure whether this is something that's a good > > GSOC project. It's a fairly difficult piece of code to write. One that to > > do right will depend on adding some features to TGSI (a good source of > > inspiration for those would be AMD's CAL and NVIDIA's PTX > > http://developer.amd.com/gpu_assets/ATI_Intermediate_Language_(IL)_Specif > >ication_v2b.pdf http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf ) > > This would be required to handle arbitrary LLVM code (e.g. for > clang/OpenCL use), but since GLSL shader code starts as TGSI, it > should be possible to convert it back without TGSI. Which of course means you have to have that reduced scope and well defined constraints that I mentioned. Otherwise it's gonna be impossible to judge the success of the project. > I'd say, as an initial step, restricting to code produced by > TGSI->LLVM (AoS) that can be expressed with no intrinsics, having a > single basic block, with no optimization passes having been run on it. > All 4 restrictions (from TGSI->LLVM, no instrinsics, single BB and no > optimizations) can then be lifted in successive iterations. Yes, that's all fine, just like the above it would simply have to be defined, e.g. no texture sampling (since for that stuff we'd obviously want our intrinsics) and whatever other features that go with it. > The problem I see is that since OpenCL will be hopefully done at some > point, then as you say TGSI->LLVM will also be done, and that will > probably make any other optimization work irrelevant. OpenCL has no need for for TGSI->LLVM translation. It deals only with LLVM IR inside. > So basically the r300 optimization work looks doomed from the > beginning to be eventually obsoleted. Well, if that was the attitude we'd never get anything done, in 10 years the work we're doing right will be obsoleted, in 50 Gallium in general will be probably obsoleted and in 100 we'll be dead (except me, I decided that I'll live forever and so far so good), so what's the point? Writing something simple well, is still a lot better, than writing something hard badly. The point of GSOC is not to nail your first Nobel prize, it's to contribute to a Free Software project and ideally keep you interested so that you keep contributing. Picking insanely hard projects is counter productive even if technically they do make sense. Just like for a GSOC for a Linux kernel you'd suggest someone improves Ext4 rather than write a whole new file system even if long term you'll want something better than Ext4 anyway. Or at least that's what I'd suggest, but that's probably because, in general, I'm just not into sadism. z |
From: Marek O. <ma...@gm...> - 2010-04-03 23:09:58
|
On Sat, Apr 3, 2010 at 9:31 AM, Tom Stellard <tst...@gm...> wrote: > 1. Enable branch emulation for Gallium drivers: > The goal of this task will be to create an optional "optimization" pass > over the TGSI code to translate branch instructions into instructions > that are supported by cards without hardware branching. The basic > strategy for doing this translation will be: > > A. Copy values of in scope variables > to a temporary location before executing the conditional statement. > > B. Execute the "if true" branch. > > C. Test the conditional expression. If it evaluates to false, rollback > all values that were modified in the "if true" branch. > > D. Repeat step 2 with the "if false" branch, and then step 3, but this > time only rollback if the conditional expression evaluates to true. > > The TGSI instructions SLT, SNE, SGE, SEQ will be used to test the > conditional expression and the instruction CND will be used to rollback > the values. > > There will be two phases to this task. For phase 1, I will implement a > simple translator that will be able to translate the branch instructions > with only one pass through the TGSI code. This simple translator will > copy all in scope variables to a temporary location before executing the > conditional statement, even if those variables will not not be modified > in either of the branches. > > Phase 2 will add a preliminary pass before to the code translation > pass that will mark variables that might be modified by the conditional > statement. Then, during the translation pass, only the variables that > could potentially be modified inside either of the conditional branches > will be copied before the conditional statement is executed. > First I really appreciate you're looking into this. I'd like to propose something doable in GSoC timeframe. Since Nicolai has already implemented the branch emulation and some other optimizations, it would be nice to take over his work. I tried to use the branch emulation on vertex shaders and it did not work correctly, I guess it needs little fixing. See this branch in his repo: http://cgit.freedesktop.org/~nh/mesa/log/?h=r300g-glsl<http://cgit.freedesktop.org/%7Enh/mesa/log/?h=r300g-glsl> Especially this commit implements exactly what you propose (see comments in the code): http://cgit.freedesktop.org/~nh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f<http://cgit.freedesktop.org/%7Enh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f> Reusing this code for Gallium seems more reasonable to me than reinventing the wheel and doing basically the same thing elsewhere. I recommend implementing a TGSI backend in the r300 compiler, which will make possible using it with TGSI shaders. So basically a TGSI shader would be converted to the RC representation the way it's done in r300g right now, and code for converting RC -> hw code would get replaced by conversion RC -> TGSI. Both RC and TGSI are very similar so it'll be pretty straightforward. With a TGSI backend, another step would be to make a nice hw-independent and configurable interface on top of it which should go to util. So far it's simple, now comes some real work: fixing the branch emulation and continuing from (2) in your list. Then it'll be up to developers of other drivers whether they want to implement their own hw-specific optimization passes and lowering transformations. Even linking various shaders would be much easier done with the compiler (and more efficient with its elimination of dead-code due to removed shader outputs/inputs), this is used in classic r300 and I recall Luca wanted such a feature in nouveau drivers. There is also an emulation of shadow samplers, WPOS, and an emulation of various instructions, so this is a nice and handy tool. (I would do it but I have a lot of more important stuff to do.) This may really help Gallium drivers until a real optimization framework emerges. -Marek |
From: Marek O. <ma...@gm...> - 2010-04-03 23:31:44
|
On Sun, Apr 4, 2010 at 1:07 AM, Luca Barbieri <luc...@gm...>wrote: > So basically the r300 optimization work looks doomed from the > beginning to be eventually obsoleted. > Please consider there are hw-specific optimizations in place which I think no other compiler framework provides, and I believe this SSA thing will do even better job for superscalar r600. So yes, we need both LLVM to do global optimizations and RC to efficiently map code to hw. -Marek |
From: Luca B. <luc...@gm...> - 2010-04-03 23:41:52
|
>> So basically the r300 optimization work looks doomed from the >> beginning to be eventually obsoleted. > > Please consider there are hw-specific optimizations in place which I think > no other compiler framework provides, and I believe this SSA thing will do Sure, but it seemed to me that all the optimizations proposed were hardware-independent and valid for any driver (other than having to know about generic capabilities like having control flow or not). > even better job for superscalar r600. So yes, we need both LLVM to do global > optimizations and RC to efficiently map code to hw. LLVM also uses SSA form (actually, it is totally built around it), assuming that's what you meant. There are doubts about whether the LLVM backend framework works well for GPUs or not (apparently because some GPUs are VLIW and only IA-64 is VLIW too, so LLVM support for it is either nonexisting or not necessary a major focus), but using LLVM->TGSI makes this irrelevant, since the existing TGSI-accepting backend will still run. |
From: Tom S. <tst...@gm...> - 2010-04-04 04:16:29
|
On Sun, Apr 04, 2010 at 01:09:51AM +0200, Marek Olšák wrote: > > Since Nicolai has already implemented the branch emulation and some other > optimizations, it would be nice to take over his work. I tried to use the > branch emulation on vertex shaders and it did not work correctly, I guess it > needs little fixing. See this branch in his repo: > http://cgit.freedesktop.org/~nh/mesa/log/?h=r300g-glsl<http://cgit.freedesktop.org/%7Enh/mesa/log/?h=r300g-glsl> > Especially this commit implements exactly what you propose (see comments in > the code): > http://cgit.freedesktop.org/~nh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f<http://cgit.freedesktop.org/%7Enh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f> > > Reusing this code for Gallium seems more reasonable to me than reinventing > the wheel and doing basically the same thing elsewhere. I recommend > implementing a TGSI backend in the r300 compiler, which will make possible > using it with TGSI shaders. So basically a TGSI shader would be converted to > the RC representation the way it's done in r300g right now, and code for > converting RC -> hw code would get replaced by conversion RC -> TGSI. Both > RC and TGSI are very similar so it'll be pretty straightforward. With a TGSI > backend, another step would be to make a nice hw-independent and > configurable interface on top of it which should go to util. So far it's > simple, now comes some real work: fixing the branch emulation and continuing > from (2) in your list. I am not sure if I follow you here, so let me know if I am understanding this correctly. What you are suggesting is to take Nicolai's branch, which right now does TGSI -> RC -> Branch Emulation in RC -> hw code and instead of converting from RC to hw code convert from RC back into TGSI. Then, pull the TGSI -> RC -> Branch Emulation in RC -> TGSI path out of the r300 compiler and place it in gallium/auxillary/util so it can be used by other Gallium drivers that want to emulate branches. Is this correct? -Tom |
From: Marek O. <ma...@gm...> - 2010-04-04 05:30:08
|
On Sun, Apr 4, 2010 at 6:14 AM, Tom Stellard <tst...@gm...> wrote: > On Sun, Apr 04, 2010 at 01:09:51AM +0200, Marek Olšák wrote: > > > > Since Nicolai has already implemented the branch emulation and some other > > optimizations, it would be nice to take over his work. I tried to use the > > branch emulation on vertex shaders and it did not work correctly, I guess > it > > needs little fixing. See this branch in his repo: > > http://cgit.freedesktop.org/~nh/mesa/log/?h=r300g-glsl<http://cgit.freedesktop.org/%7Enh/mesa/log/?h=r300g-glsl> > <http://cgit.freedesktop.org/%7Enh/mesa/log/?h=r300g-glsl> > > Especially this commit implements exactly what you propose (see comments > in > > the code): > > > http://cgit.freedesktop.org/~nh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f<http://cgit.freedesktop.org/%7Enh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f> > < > http://cgit.freedesktop.org/%7Enh/mesa/commit/?h=r300g-glsl&id=71c8d4c745da23b0d4f3974353b19fad89818d7f > > > > > > Reusing this code for Gallium seems more reasonable to me than > reinventing > > the wheel and doing basically the same thing elsewhere. I recommend > > implementing a TGSI backend in the r300 compiler, which will make > possible > > using it with TGSI shaders. So basically a TGSI shader would be converted > to > > the RC representation the way it's done in r300g right now, and code for > > converting RC -> hw code would get replaced by conversion RC -> TGSI. > Both > > RC and TGSI are very similar so it'll be pretty straightforward. With a > TGSI > > backend, another step would be to make a nice hw-independent and > > configurable interface on top of it which should go to util. So far it's > > simple, now comes some real work: fixing the branch emulation and > continuing > > from (2) in your list. > > I am not sure if I follow you here, so let me know if I am understanding > this correctly. What you are suggesting is to take Nicolai's branch, > which right now does TGSI -> RC -> Branch Emulation in RC -> hw code and > instead of converting from RC to hw code convert from RC back into TGSI. > That's right. > Then, pull the TGSI -> RC -> Branch Emulation in RC -> TGSI path out of > the r300 compiler and place it in gallium/auxillary/util so it can be used > by other Gallium drivers that want to emulate branches. Is this correct? > Sorry I should have been more clear. The whole RC may stay in src/mesa/drivers/dri/r300/compiler as it is now. I think these are parts that should go to util: - TGSI -> RC conversion - RC -> TGSI conversion - Hw-independent interface to the compiler, i.e. one function (or more) which takes a TGSI shader and returns a TGSI shader. It should do both conversions above and use r300/compiler directly. In the long-term, the compiler should probably be moved to src/compiler or something like that (since both classic and gallium drivers may use it), but you don't need to care about that if you don't want to. -Marek |
From: Nicolai H. <nha...@gm...> - 2010-04-04 21:51:34
|
Hi, On Sat, Apr 3, 2010 at 3:31 AM, Tom Stellard <tst...@gm...> wrote: > I have completed a first draft of my Google Summer of Code > proposal, and I would appreciate feedback from some of the > Mesa developers. I have included the project plan from my > proposal in this email, and you can also view my full proposal here: > http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/tstellar/t126997450856 > However, I think you will need a google login to view it. I like your proposal, and I believe it's quite workable as a GSoC project as far as scope is concerned. The LLVM points mentioned are valid, but since you seem to be interested in focusing on R300, I believe you are absolutely on the right track, since if you complete the tasks outlined in your project, it will be a very tangible step forward *right now*. <snip> > Schedule / Deliverables: > 1. Enable branch emulation for Gallium drivers (4 weeks) > 2. Unroll loops for Gallium drivers (2 - 3 weeks) I believe your time estimate for loop unrolling should be longer, and for branch emulation probably shorter, particularly since there is existing work. So branch emulation should go much quicker; I don't think the part about re-translating to TGSI is too difficult. > Midterm Evaluation > 3. Loops and Conditionals for R500 fragment and vertex shaders (4 weeks) This part is in some ways the most interesting one, and even if we do go to LLVM eventually (but I doubt it's going to happen any time soon), this part will still be needed. It would definitely be great to see this project come to fruition! cu, Nicolai |
From: Nicolai H. <nha...@gm...> - 2010-04-04 22:11:24
|
On Sat, Apr 3, 2010 at 2:37 PM, Luca Barbieri <luc...@gm...> wrote: > Note all "X compiler is bad for VLIW or whatever GPU architecture" > objections are irrelevant, since almost all optimizations are totally > architecture independent. Way back I actually looked into LLVM for R300. I was totally unconvinced by their vector support back then, but that may well have changed. In particular, I'm curious about how LLVM deals with writemasks. Writing to only a select subsets of components of a vector is something I've seen in a lot of shaders, but it doesn't seem to be too popular in CPU-bound SSE code, which is probably why LLVM didn't support it well. Has that improved? The trouble with writemasks is that it's not something you can just implement one module for. All your optimization passes, from simple peephole to the smartest loop modifications need to understand the meaning of writemasks. > This is obviously not achievable if Mesa/Gallium contributors are > supposed to write the compiler optimization themselves, since clearly > there is not even enough manpower to support a relatively up-to-date > version of OpenGL or, say, to have drivers that can allocate and fence > GPU memory in a sensible and fast way, or implement hierarchical Z > buffers, or any of the other things expected from a decent driver, > that the Mesa drivers don't do. I agree, though if I were to start an LLVM-based compilation project, I would do it for R600+, not for R300. That would be a very different kind of project. > So, for a GSoC project, I'd kind of suggest: > (1) Adapt the gallivm/llvmpipe TGSI->LLVM converter to also generate > AoS code (i.e. RGBA vectors as opposed to RRRR, GGGG, etc.) if > possible or write one from scratch otherwise > (2) Write a LLVM->TGSI backend, restricted to programs without any control flow > (3) Make LLVM->TGSI always work (even with control flow and DDX/DDY) > (4) Hook up all useful LLVM optimizations A LLVM->TGSI conversion is not the best way to go because TGSI doesn't match the hardware all that well, at least in the Radeon family. R300-R500 fragment programs have the weird RGB/A split, and R600+ is yet another beast that looks quite different from TGSI. So at least for Radeon, I believe it would be best to generate hardware-level instructions directly from LLVM, possibly via some Radeon-family specific intermediate representation. The thing is, a lot of the optimizations in the r300 compiler are actually there *because* TGSI (and Mesa instructions) are not a good match for what the hardware looks like. So replacing those optimizations by an LLVM pass which then becomes useless due to a drop to TGSI seems a bit silly. In a way, this is rather frustrating when dealing with the assembly produced by the Mesa GLSL compiler. That compiler is rather well-meaning and tries to deal well with scalar values, but then those "optimizations" are actually counterproductive for Radeon, because they end up e.g. using instructions like RCP and RSQ on one of the RGB components, which happens to be a really bad idea. It would be nice if we could feed e.g. LLVM IR into the Gallium driver instead of TGSI, and let the Gallium driver worry about all optimizations. Anyway, I'm convinced that LLVM (or something like it) is necessary for the future. However, for this particular GSoC proposal, it's off the mark. cu, Nicolai |
From: Luca B. <luc...@gm...> - 2010-04-05 02:42:22
|
> Way back I actually looked into LLVM for R300. I was totally > unconvinced by their vector support back then, but that may well have > changed. In particular, I'm curious about how LLVM deals with > writemasks. Writing to only a select subsets of components of a vector > is something I've seen in a lot of shaders, but it doesn't seem to be > too popular in CPU-bound SSE code, which is probably why LLVM didn't > support it well. Has that improved? > > The trouble with writemasks is that it's not something you can just > implement one module for. All your optimization passes, from simple > peephole to the smartest loop modifications need to understand the > meaning of writemasks. You should be able to just use shufflevector/insertelement/extractelement to mix the new computed values with the previous values in the vector register (as well as doing swizzles). There is also the option of immediately scalarizing, optimizing the scalar code, and then revectorizing. This risks pessimizing the input code, but might turn out to work well. > I agree, though if I were to start an LLVM-based compilation project, > I would do it for R600+, not for R300. That would be a very different > kind of project. > A LLVM->TGSI conversion is not the best way to go because TGSI doesn't > match the hardware all that well, at least in the Radeon family. > R300-R500 fragment programs have the weird RGB/A split, and R600+ is > yet another beast that looks quite different from TGSI. So at least > for Radeon, I believe it would be best to generate hardware-level > instructions directly from LLVM, possibly via some Radeon-family > specific intermediate representation. The advantage of LLVM->TGSI would be that it works with all drivers without any driver specific code, so it probably makes sense as an initial step. nv30/nv40 fragment programs map almost directly to TGSI (with the addition of condition codes, and half float precision, and a few other things). Things that end up using an existing graphics API like vmware svga, or using the llvm optimizer for game development, also need tgsi-like output. Thus, even if TGSI itself becomes irrelevant at some point, any nontrivial parts of the LLVM->TGSI code should be needed anyway for those cases. |
From: Nicolai H. <nha...@gm...> - 2010-04-05 15:06:34
|
On Sun, Apr 4, 2010 at 10:42 PM, Luca Barbieri <luc...@gm...> wrote: >> Way back I actually looked into LLVM for R300. I was totally >> unconvinced by their vector support back then, but that may well have >> changed. In particular, I'm curious about how LLVM deals with >> writemasks. Writing to only a select subsets of components of a vector >> is something I've seen in a lot of shaders, but it doesn't seem to be >> too popular in CPU-bound SSE code, which is probably why LLVM didn't >> support it well. Has that improved? >> >> The trouble with writemasks is that it's not something you can just >> implement one module for. All your optimization passes, from simple >> peephole to the smartest loop modifications need to understand the >> meaning of writemasks. > > You should be able to just use > shufflevector/insertelement/extractelement to mix the new computed > values with the previous values in the vector register (as well as > doing swizzles). Okay, that looks good. > There is also the option of immediately scalarizing, optimizing the > scalar code, and then revectorizing. > This risks pessimizing the input code, but might turn out to work well. This might depend on the target: R600+, for example, is quite scalar-oriented anyway (modulo a lot of subtle limitations), so just pretending that everything is scalar could work well there since revectorizing is almost unnecessary. cu, Nicolai |
From: Luca B. <luc...@gm...> - 2010-04-05 15:50:49
|
> This might depend on the target: R600+, for example, is quite > scalar-oriented anyway (modulo a lot of subtle limitations), so just > pretending that everything is scalar could work well there since > revectorizing is almost unnecessary. Interesting, nv50 is also almost fully scalar, and based on the Gallium driver source, i965 seems to be scalar too. So it seems it would really make sense to also have a scalar IR, whether LLVM IR or something else. Of course, "scalar" is usually actually SoA SIMD, but that's mostly hidden, except for things like barriers, join points and nv50 "voting" instructions. |