From: Ian R. <id...@us...> - 2005-07-22 06:33:48
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roland Scheidegger wrote: > I'm actually wondering how ATI solved that problem in their driver, I > couldn't see an easy way out to avoid the fallback - even using the 2 > additional tex env stages or the second phase of the fragment pipeline > isn't going to fix the issue I think. Maybe someone else has a good idea? So, for any set of texture environments there is an ordering of operations and an assignment of registers that will work. Once upon a time I wrote a python script that implemented a simple algorithm to do this. I'll have to see if I can dig it up. The algorithm works in two passes. The first pass identifies any texture stages and texture reads, if any, that do not contribute to the final result. I'm going to use the notation T# for a texture read and P for the previous result. If the texture environment is { {T0 + T1} {T3 - - T2} }, then T0, T1, and the result of adding them don't contribute to the final result. You can omit those stages entirely and freely use those registers as temporaries. The second pass assigns registers. Each T# gets assigned the next R#, in order. If T0, T1, and T4 contribute to the final result, they get assigned R0, R1, and R2. Next, each P gets assigned an available register. A register is available if its either unassigned or its value will not be read again. At any point, there is *always* an available register. I think this is mathematically provable, but it's way beyond my patience to do so. :) Here are a couple examples. I have left out the operations for clarity. I'm also going to simplify a bit. I assume 3 textures, 3 registers, 3 stages, and 2 reads per stage. Start: {T0, T2}, {P , T0}, {T1, P } Pass 1: {T0, T2}, {P , T0}, {T1, P } Pass 2.1: {R0, R2}, {P , R0}, {R1, P } Pass 2.2: {R0, R2}, {R2, R0}, {R1, R0} Start: {T0, T2}, {T1, T0}, {T1, P } Pass 1: {T1, T0}, {T1, P } Pass 2.1: {R1, R0}, {R1, P } Pass 2.2: {R1, R0}, {R1, R0} Working through this, I noticed something that I hadn't noticed before. This technique only works if each operation cannot access the entire register set. I first did it with 3 reads per stage, and I very quickly came up with some impossible examples. :) 3 reads w/6 registers will still work. The nice thing about this algorithm is that it not only works, but it eliminates "dead code" and unused textures. I don't know about the former, but the later can certainly improve the performance of ill written code. In addition, this same algorithm could be used to optimize ATI_fragment_program code. It should also make it possible to implement NV_texture_env_combine4, which is used by a lot more programs than ATI_texture_env_combine3. In both these cases you need to expand the notation to have multiple P values. Other optimizations are possible, but I never explored them. Most of the ones that I could think of are probably unlikely in practice. Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 + T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0} are possible, but probably not worth the effort. I think the right way to actually implement this in the driver is to convert texture env (be it ARB_texture_env_combine / ATI_texture_env_combine3 or NV_texture_env_combine4) into an ATI_fragment_program and optimize that. Doing it that way effectively kills two birds with one stone. We can get away with that here because the texture env will only ever require one pass. One nice thing about doing it that way is you can write an application that converts texture env scripts to ATI_fragment_programs. You can compare the direct implementation of the texture env with the generated ATI_fragment_program. That should be a *lot* easier to debug than doing it in the driver code! It's also worth noting that a similar technique can be applied in the i830 driver to implement ATI_texture_env_combine3. The i830 implements *most* of the required instructions. The unavailable instructions can be implemented by simpler operations (e.g., {T0*T1-T2} becomes {T0*T1} {P-T2}). Adding the optimization pass, especially if it *did* the optimizations that said were "probably not worth the effort", would reduce the chances of needing a fallback. An env like {T0*T1-T2} {P+T3} {P*C} {P+T0} would be optimized to {T0*T1+T3} {P-T2} {P*C+T0}. If you don't think you want to tackle this now, I'll gather up my python script and all my notes on the subject and file an enhancement bug. That way none of the information will get lost / forgotten. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFC4JM7X1gOwKyEAw8RAkGqAJkBbllTflRuCtOiV8PxwFtDiJMGLQCfaEhg ra2W8jQQae1odI95g/RwQ5Y= =cGeR -----END PGP SIGNATURE----- |